Brief introduction:God – core functionality | nailing the alarm + data gateway

1. Development background

1.1 User pain points

① The problem of weak operation and maintenance capacity on the tenant side: on the tenant side, customers have no effective way to timely obtain the data of instance-level status, performance and capacity. Current situation: At a fixed time every day, on site needs human flesh to collect data and push it to customers on a regular basis. ② Low efficiency in troubleshooting problems: there are problems in the application business, the cloud platform products are normal, and the customers do not recognize them, so they need to help the customers solve the problems. Current situation: It is found that the performance and capacity of application examples are full, and the troubleshooting process is often lengthy and inefficient. ③ Lack of monitoring ability: the cloud platform is not fully monitored, and capacity management, performance management and other reporting capabilities are missing. What we see now: on site requires a lot of human patrolling, or scripting. (4) The problem of low timeliness of monitoring mode: the business side always has priority over the application and cloud platform to perceive the fault, and the operation and maintenance are very passive. Status: Customers find problems, inform the application, check the application, trace to the cloud platform, check the link serial and inefficient.

1.2 Solution

(1) Ensure business stability through the change of the service capability of cloud products and the establishment of the business simulation model to predict the health of the customer’s business in advance. Alarm will be triggered if the customer’s business health is below the baseline. ②SLA display trigger threshold automatic alarm to quantify the health status of the product.

2. Development and design

2.1 System architecture



Figure 1: System architecture diagram

As shown in Figure 1, the architecture of the Enclosure System is divided into two modules: Client and Server.

  • Clinet terminal: it is deployed in the classic copper finch container and collects the data of various products in the cloud through timing task control.
  • Server side: it is deployed on ECS in VPC. The system framework is Flask, which is divided into two parts: data processing and data storage.

    (1) Data processing refers to the provision of API to accept the CLIENT’s data and carry out storage operations as well as the front-end display of data. ② Data storage refers to the persistent operation of data with the help of Ali Cloud RDS database.

2.2 Business architecture



Figure 2: Business architecture diagram

As shown in 2, Fengshen business structure is divided into five sections.

  • Ziya Jiang: Tenant-side alerts, mainly including ECS, RDS and other cloud product instance performance and business-related alerts.
  • Shen Gongbao: Alarms on the operation and maintenance side mainly include the health status of cloud products, water level capacity and other related alerts.
  • Lei zhenzi: hardware alarm, mainly including bad disk, physical machine out of the band and so on.
  • Bigan: Security alerts, mainly from Cloud Shield related security alerts.
  • Yang Jian: fault alarm, SLA algorithm is mainly used to process the data of each product, and P0,P1 level fault threshold is set.

3. Nail Alarm

3.1 Alarm classification

How to create the robot can be referred to [1] for details.

Son tooth

ShenGongBao

Harpies sometimes

Than to do

Yang hoping

3.2 Alarm display



Picture 3: ginger seed teeth



Photo 4: Shen Gong Bao



Figure 5: A thunderbolt



Figure 6: Bigan



Figure 7: Yang Jian -1



Figure 7: Yang Jian -2



Figure 7: Yang Jian -3



Figure 7: Yang Jian -4

4. Data gateway

The data gateway is divided into two modules: obtaining data and receiving data.

  • Obtain data is divided into alarm data, full data, performance data.

    ① Alarm data: corresponding to the alarm information pushed by the nailing robot, encapsulated into the corresponding data format, and provided data in the form of API interface. (2) Full data: source table data in the database, do not do any processing, in the form of API interface to provide external data, high operability. Performance data: product performance data will be regularly stored in the timing database, storage time is long, you can query the historical performance data.

  • Receive data: provide external API to receive custom-defined monitoring data, encapsulated into Markdown format, real-time pin warning.

4.1 Data Acquisition

4.1.1 Alarm data
4.1.1.1 Request interface

Request method: POST request URL address: http://{ip}:{port}/api/v1/sea… IP: ecs\_ip port:9170 PARAM: parameter list Please see [2] for details.

4.1.1.2 DEMO
Import sys import requests url = "http://{ip}:{port}/api/v1/search/monitor/" data = {"product":"MQ", "title":" backlog ", "stime":"2020-01-04 00:00:00", "etime":"2020-01-04 00:01:00"} res = requests.post(url=url, json=data) print res.content
curl -H "Content-Type:application/json" 
-X POST -d '{"type":"ALL"}' http://{ip}:{port}/api/v1/search/monitor/
4.1.1.3 Data return

(1) the existing alarm {” code “: 0,” data “: [{” info” : “0.0.0.0, ecs, 95% \ n 0.0.0.1, ecs, 95%”, “product” : “ecs”, “title” : “performance warning”, “Level” : “warning”, “robot” : “the son tooth”, “monitor \ _time” : “2020-01-14 00:00:00”, “Columns”, “IP, product, value}]} (2) is not currently warning data (the alarm back to normal) {” code” : 0, “data” : [{” info “:” “, “product” : “ECS”, “title” : “performance warning”, “level” : “warning”, “Robot”, “tooth”, “monitor \ _time” : “2020-01-14 00:00:00”, “columns” : “IP, product, value}]} (3) not query to the data: {” code” : 0, “data” : []} (4) query exception: {“code”:500, “data”:” exception information “}

4.1.2 Full data
4.1.2.1 Request interface

Request method: POST request URL address: http://{ip}:{port}/api/v1/sea… IP: ecs\_ip port:9170 PARAM: parameter list Please see [2] for details.

4.1.2.2 DEMO
import sys
import requests
url = "http://{ip}:{port}/api/v1/search/data/"
data = {"product":"MQ", "title":"TIME", "stime":"2020-01-04 00:00:00", "etime":"2020-01-04 00:01:00"}
res = requests.post(url=url, json=data)
print res.content
4.1.2.3 Data return

4.1.3 Performance data
4.1.3.1 Request interface

Request method: POST request URL address: http://{ip}:{port}/api/v1/inf… \_query/ IP: ECS \_ip port:9170 PARAM: INFLUXDB SQL

4.1.3.2 DEMO
import sys
import requests
url = "http://{ip}:{port}/api/v1/influxdb_query/"
data = {"sql":"infudb sql"}
res = requests.post(url, data)
print res.content
4.1.3.3 Data return

4.2 Data Receiving

4.2.1 Request interface

Request method: POST request URL address: http://{ip}:{port}/api/v1/ins… IP: ECS \ _IP port:9170 PARAM:

4.2.2 the DEMO
Import sys import requests a url = "http://172.0.0.1:9170/api/v1/insert/third/" data = {" title ":" ecs performance monitoring ", "level" : "warning", "Source", "cloud monitoring", "product" : "ecs", "MSG" : "IP: 10.0.0.1 CPU: 98%. IP 127.0.0.1 mem: 99%", "robot" : "the son tooth," "submitor" : "Gao Dechen." "Monitor_Time ":"2021-03-10 16:00:00", "details":" buddy "} res = requests
4.2.3 Alarm display



Figure 8: Alarm display diagram

Refer to the article

[1] granting titles to gods deployment pre check: https://yuque.antfin-inc.com/docs/share/d3a743db-af85-47d2-89c5-4f22eb1693c5? [2] for god data – three sides sealing API: https://yuque.antfin-inc.com/docs/share/2037fbb2-35fa-42ad-8476-ec7502e9ed33?#

We are AliCloud Intelligence Global Technical Serving -SRE team. We are committed to becoming a technology-based, service-oriented engineering team to ensure the high availability of business systems. Provide professional and systematic SRE services to help customers make better use of the cloud, build a more stable and reliable business system based on the cloud, and improve business stability. We are looking forward to sharing more technologies to help enterprise customers get on the cloud and use the cloud well, so that customers can run more stable and reliable business on the cloud. You can scan the QR code below, join the circle of Ali Cloud SRE Technical College, and communicate with more people about the cloud platform.

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.