Introduction: Introduction: This article is organized from the Special event of Intelligent Operation and Maintenance (Shanghai station), fluent Speaking best practice speech: “Unified Monitoring and Operation Practice based on SLS ten-million-level Online Education Platform”

Author: Sun Wenjie fluent operation and maintenance director Yuan Yi Ariyun intelligent technology expert

Quality content and customized services to enhance the core competitiveness of enterprises

In 2020, under the slogan of “no school suspension”, the online education market scale increased rapidly due to the impact of the epidemic, reaching 485.8 billion yuan. After a few years of rapid growth in the online education industry, the market has become relatively mature. Users have different demands for different types of online education institutions. Therefore, traffic alone cannot win loyal users. But for the education industry, the core competitiveness is still quality content and services. Only with high-quality course content, personalized plan based on customers’ learning habits and foundation, quality product experience and stability, combined with higher business operation efficiency, can enterprises win long-term development. Throughout the online education industry, in the midst of constant adjustment, the surviving enterprises must return to the essence of education and win long-term development with quality products, content and services.

Combined with artificial intelligence, the characteristic teaching is unique

After the industry step by step adjustment, the enterprises in the online education track will gradually return from the focus on incremental content construction. But in the overall environment, the syllabus is the same, the teaching methods are very different. Although the courses are different, they are not amazing, and most companies cannot rely on content to stand out.

However, Fluency is different. In this era of ARTIFICIAL intelligence, Fluency provides users with personalized teaching courses by virtue of its characteristic intelligent teaching courses and innovative technologies such as ARTIFICIAL intelligence, so as to help more users improve their English. Fluent as of March 31, 2021, said the accumulative total more than 200 million registered users, and it has a huge database of Chinese English phonetics, can according to the actual situation of each student evaluation, students said in fluent pronunciation in the process of learning, through intelligent recognition remedy system the mouth can be dynamic to capture key point of the mouth, to compare with advanced technology, Find the problem with the student’s pronunciation. In this way, targeted guidance can be put forward to solve the problem of oral expression, and fundamentally help students improve their oral level.

Product experience is the key, how to improve system stability into a problem

The rapid development of the fluent speaking business and the large increase of the number of users, from the initial millions of users to more than 200 million, have brought great challenges to the operation and maintenance work, such as the changes of data traffic, the complexity of the service and the difficulty of analysis in the peak and peak periods. In the overall Internet environment, experience is one of the most critical competitiveness, according to the statistics, every 1 second of delay, on average, leads to the loss of 7% of users.

As a company without a separate operation and maintenance department, Fluent said that the operation and maintenance system of the basic platform is mainly completed by the R&D of the Cloud-Infra team, and the core requirements of the team are not only SLA, performance monitoring, alarm and provide related data for problem location, but also include the operation of cloud-infra’s technical value. For example, utilization, cost savings, business relationship networking, etc. Under these core demands, the requirements for intelligent operation and maintenance platform are as follows:

1. Collect and monitor various heterogeneous data sources, including machine indicators on K8s and ECS, utilization, call logs related to Istio, indicators of self-built middleware, indicators provided by cloud services, Trace data of services, etc., and real-time collection of various cost data.

2. Dynamic discovery and collection of all kinds of resources, including the data related to the organization relationship and other departments also need to be updated in real time, so that the most accurate relevant indicators and ownership relationship can be feedback in real time.

3. Large-scale data storage and analysis. Due to the large scale of fluent business, various cloud resources are used and the huge amount of data generated by the business is tens of TB per day.

4. The monitoring platform is responsible for stability problems, and its own stability needs to be done well. Therefore, it needs to eliminate single point problems of each part and have the ability of unusually fast recovery.

One-stop intelligent operation and maintenance solution, open the whole link of data collection and computing

The intelligent operation and maintenance platform built by Fluent needs to process more than just time-related data. The very core business availability data also needs to be calculated and analyzed through various Logs, so the overall choice of data options is Logs and Metrics. There are different community schemes or commercial schemes for both types of data, such as ES, Loki, SLS, Prometheus, OpenTSDB, InfluxDB, etc. In the end, Alicloud SLS is chosen as the log scheme and Prometheus+SLS as the timing scheme, for the following reasons:

1.SLS has the ability to store and analyze all kinds of data in a unified manner, and to correlate Metrics and Logs data on SLS, which is not available on other platforms

2.SLS platform can adapt to a very large data scale, and its performance is much better than THAT of ES. It is also an o&M free service, eliminating the problem of maintaining high reliability of ES

3. The timing scheme is dominated by Prometheus, whose ecology is very perfect, and PromQL is simple to use. The time series library of SLS can be used as the remote high reliability storage of Prometheus, which can solve the reliability problem of Prometheus

4. The SLS scheme has the function of data processing, which can do Join analysis and processing with external data sources, so as to better deal with all kinds of complex logs and add the information related to catalog to the logs

Meanwhile, in order to realize automation to the maximum extent, Alibaba Cloud log service SLS has developed a set of dynamic discovery mechanism for IaaS and PaaS resources applicable to cloud scenarios, which can add newly purchased and created resources to monitoring and collection in real time, avoiding most manual operations.

In addition, in each data scenario, Alibaba Cloud log service SLS also makes specialized customization to meet the requirements of fluency:

1. Log

  • Logs of different services are directly collected to different log libraries using the Logtail of the SLS
  • Not all logs need to be stored and indexed for a long time. Therefore, we categorize logs. For logs that need to be audited, they will be delivered to OSS for long-term storage. Service troubleshooting logs are saved only for two weeks, and full-text indexes are enabled. AccessLog only indexes some fields, which can save a lot of index costs.
  • For NGINX access logs that need to be calculated for SLA and PXX indicators, data processing is used to map urls in NGINX access logs to departments, applications, and methods in conjunction with Catalog information such as mapping rules, departments, and applications stored in RDS.

2. Data monitoring

  • Prometheus was chosen for the monitoring solution, and for the fluency scenario, we developed some vendors to capture Metrics from a variety of cloud-based products and self-built components
  • Also to better use Prometheus and integrate with the internal CICD system, we added a Sidecar to Prometheus that listens for changes to the Git repository and dynamically reconstructs the Prometheus configuration based on the changes
  • In order to improve the query speed, various Recording rules are configured on Prometheus, which are managed by Git
  • Alarms of the AlertManager are directly connected to the internal alarm center and can perform advanced functions, such as layout and upgrade
  • In order to solve the problem of single point Prometheus and subsequent correlation analysis with the Catalog, we use the SLS timing library and directly let Prometheus Remote Write to the SLS timing library

3. Index calculation

  • Part of the calculation of core indicators comes from the AccessLog of NGINX. QPS, error rate, Latency (average, PXX, etc.) of each business can be obtained from the entry, without any intrusion to the business
  • Indicators such as resource utilization, middleware, and infrastructure are derived from the time series library written by Prometheus. Based on the Catalog, relevant indicators of each department and business can be aggregated and calculated
  • The calculated indicator information can be easily stored in MySQL or ES and sent to OSS for backup because the data amount is very small

Build a unified intelligent operation and maintenance platform, from the cost center to innovative productivity tools

At present, this intelligent operation and maintenance platform system carries almost all the core operation and maintenance of the company. It has been running stably since its launch, and it can also easily cope with the sudden increase of data volume during various activities. The overall business value is mainly reflected in:

  • Monitoring: The first value of monitoring is to do all kinds of monitoring and alarm, especially SLA related, because the data has been associated with specific departments and business applications, it is easy to obtain the SLA of each department and application, and carry out unified promotion and improvement across the company
  • Troubleshooting and fault isolation: Based on Istio access logs and Catalog information, the calling relationships of each application can be calculated. Therefore, a grid of business relationships can be generated in real time and the quality of each relationship (edge) can be known. When a fault occurs, you can quickly locate the root cause and isolate the fault
  • FinOps: In Cloud Infra, the most challenging issue is cost. Therefore, cost optimization is also a core work for us. The main practice is to calculate the resource utilization rate of each department and team, including the average utilization rate and the utilization rate of all kinds of PXX (as shown in the table below), so as to judge the resource utilization of each department and promote the cost optimization of each department.

Wrote last

In the cloud-based era, digitization is driving business innovation across industries. Only by improving the user experience, accelerating innovation, updating the infrastructure and architecture, and taking advantage of diverse data can we stand out in the overall environment. The intelligent operation and maintenance platform launched by Aliyun is not only to help engineers reduce their workload, but also to release them from all kinds of mechanized work. We will do all the “dirty work”, so that the breakdown of time significantly reduced, so that the operation and maintenance people will be more creative, on digital innovation and enterprise business innovation, to provide enterprises with better competitiveness.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.