Hulk is a Go service development framework developed by the Short video R&d Department based on GDP2 (Go Develop Platform). It is a business-oriented Web development framework that provides a number of out-of-the-box components and capabilities for rapid development of Web services. At the same time, relying on the HULK framework and combining with the excellent development practices in the factory/industry, a GO ecosystem that meets the business application scenarios is preliminarily constructed.

The full text is 7330 words, and the expected reading time is 12 minutes.

= = =

First, the background

Hulk framework is generated in the background of the go server architecture upgrade of the “Good video” server.

1.1 Why architecture Upgrade? What are the issues facing the current architecture?

In the early stage, due to the business needs rapid and flexible development iteration, PHP was adopted as the development language to achieve back-end services, which achieved good development iteration effect in the early stage. However, with the rapid development of beautiful business and the rapid expansion of server-side projects (interfaces, codes, etc.), the PHP architecture of class monomer has encountered bottlenecks and problems in many aspects, mainly reflected in the following aspects:

1. Development efficiency: For the main code base, all server-side students develop in the same code space. In addition, dependent third-party teams also modify it.

2. Online efficiency: Another disadvantage of multi-user development of the same code base is online waiting. Because only one branch can be online at a time (hierarchical online), the connected online demands have to wait in line. This also led to our students to explore the “ride online” mode, although it accelerated the online efficiency, but also increased the online risk, did not fundamentally solve the problem;

3. Operational efficiency: PHP does have some advantages in terms of development efficiency and flexibility, but when the business supports tens of millions of DAUs or more, we must consider the operational efficiency and resource cost of the service. PHP language is weaker than Java, C/C++, Go and other languages in support of multithreading/multi-coroutine. The single-class service deployment architecture based on physical machine deployment is also difficult to meet the requirements in resource utilization and service expansion and reduction.

4.SRE efficiency: When stability problems occur, we expect to be able to quickly perceive, locate and stop losses. At present, sia-based monitoring/alarm, log-based problem location is still far from the ideal goal: First, students have to rush to various platforms/systems to get clues about problems; second, the clues and information obtained often cannot meet the needs of rapid and accurate problem location;

These problems need to be solved in terms of the overall business architecture, deployment architecture, and infrastructure.


1.2 Why not directly based on GDP2?

One factor is that the nice go servicification upgrade took place before GDP2 was officially released.


1.3 HulK and GDP2 capacity control

The following is a simple comparison with GDP2 from three aspects to get a preliminary understanding of hulK’s overall capability and some differences with GDP2.

1.3.1 Web Server Capabilities

Hulk currently serves primarily web applications, so let’s take a look at hulk’s Web Server capabilities.

1.3.2 Functions/Components

The richness of functionality/components and their capabilities greatly influence the ability of the framework to support business services.

1.3.3 Frame perimeter and infrastructure

Frameworks are never “alone”; they need to be supported by peripheral tools and infrastructure.

NOTE:

1. In the process of going, I also investigated and introduced some excellent tool systems and schemes in the open source community. Hulk adds integration of these infrastructures by default.

1.3.4 Comparative summary

This section mainly compares hulK capability with GDP2. The above comparison can be summarized as four points:

1. Hulk uses GDP for many basic capabilities, such as BNS, NET, CODEC, etc.;

2. Hulk encapsulates and enhances some general/extension components according to business requirement scenarios, such as HttpServer, RAL, Redis, mysql, etc.

3. Develop and integrate some business requirements not supported by GDP at present, such as scheduled task, configuration center, service governance, etc.;

4. Introduced some new infrastructure, such as Prometheus + Grafana cluster, Sentry cluster, fault locating system, etc.

GDP2 consists of dozens of modules. Due to the limited time, there may be deviations in the comparison of individual function points.

= = =

2. Learn about HULK

2.1 Design Roadmap


2.2 Frame Structure

In terms of functionality, hulk’s overall capabilities can be divided into four layers:

2.2.1 Basic Components

It provides the basic capabilities that most projects should require and that other upper-level functional components are likely to rely on. The HulK framework uses these basic components to enable upper-layer applications to seamlessly integrate with the infrastructure:

  • Log component: supports phP-compatible print format (for configuring SIA monitoring and alarm) and fTRACE access format (log query and problem location) by default.

  • Cloud native monitoring: Prometheus by default, multi-dimensional metrics collection of all interface requests, remote calls such as REDis and RAL, and display through Grafana;

  • Configuration center: Through the configuration center, you can deliver and take effect configurations in real time. Apollo/iConf is currently supported, including version management, hot release, gray release, permission management, audit and audit, etc.

  • Event tracking/location: With Sentry, we can sense some failures in seconds. Hulk saves complete field information in the exception information, such as call stack, request, cluster and instance information, through which you can directly locate the cause of the problem.

2.2.2 Common Components

The component capabilities of this layer are generic and provide some administrative control and aspect capabilities:

  • Ral components: Hulk’s RAL module encapsulates the MAIN RAL functions of GDP2 and also enhances RAL – a) provides RAL initialization and LAZY ral loading via strings rather than files; B) Provides multiple hook capabilities, such as monitoring information collection, fusing and degradation of Prometheus;

  • Service governance: The service governance capability of the framework is built based on Sentinel (ali open source high availability traffic protection component) and configuration center. It mainly takes traffic as the entry point, and helps to guarantee the stability of micro-services from multiple dimensions such as traffic limiting, traffic shaping, circuit breaker and degradation, and provides dynamic control capability.

  • Coroutine pool: a) it can automatically schedule massive coroutines and reuse Goroutines to reduce GC; b) it can handle panic gracefully and prevent program crashes; c) It provides: task submission, obtaining the number of goroutines in operation, and encapsulating WaitGroup to support coroutine task scheduling;

  • Event notification: The framework integrates with streaming. By configuring the Robot Token in the project, the user can directly use the RuLiu component to send alarms/notifications to specified streams such as streams. For example, the flow component combined with Sentry can let us know the problem of the program in the first time and quickly locate the problem;

2.2.3 Extended Components

The functions of the first two layers are less involved in the direct business processing logic, and the capabilities of the components in this layer are mainly for processing a certain kind of specific business logic and scenarios, such as Redis /mysql/ scheduled tasks, etc. :

1. Redis component: Based on GDP2 Redis module package and added features, providing:

A) Metrics hook, monitoring (Prometheus) all Redis requests (latency/ P99 / QPS/error code distribution, etc.);

B) Sentry hook, which supports sending redis errors to Sentry while recording error logs;

C) Demote hook, support demote redis access by cluster/instance/percentage dimension;

D) Fuse hook, which supports fuse setting of dependent services according to cluster/instance/error rate/slow request rate;

Mysql component: mysql component is based on GDP2 mysql and GORM_Adapter package, in addition to the existing capabilities, with the following extensions:

A) Provides metrics hooks for monitoring (Prometheus) all mysql requests (latency/ P99 / QPS/error code distribution, etc.);

B) Sentry hook is provided to support sending mysql errors to Sentry while recording error logs;

3. Distributed lock: Hulk provides a distributed lock implementation based on Redis. Among them, Redis connection is the transformation of REDis module based on GDP2, and distributed lock function is the encapsulation of open source project RedSync.

4. Scheduled task: Supports two scheduled task modes.

A) Operation mode with distributed lock: For scheduled tasks deployed by multiple instances, if the tasks are not idempotent, the distributed lock needs to be used to control the scheduling operation of tasks;

B) Operation mode without distributed lock: In this mode, if multiple instances are deployed, scheduled tasks on all instances are executed at the same time;

2.2.4 http server

Hulk (which currently only provides HTTP Server capabilities) provides a lot of generic and efficient HTTP Middleware and exposes some administrative control interfaces that can be used to intervene with services at runtime in special cases:

  • Logger_middleware: Logs HTTP requests, responses, and time consumption. Logs can be printed by IDC, IP, percentage, UID, and CUID in real time.

  • Timer_middleware: Monitors HTTP requests using metrics such as availability, TP99, traffic, flat noise, and error codes. Metrics include service-level, IDC, and instance.

  • Recover_middleware is used to capture painc events in HTTP request links and customize panic handler logic, such as sentry and stream, to sense and locate Panic events in real time.

  • Flow_control_middleware: indicates an interface traffic limiting component. You can configure the center or management interface to limit traffic on the interface by IDC /instance.

  • Timeout_middleware: This middleware or in combination with a configuration center allows idC-level timeout control for interfaces.

  • Other Middleware can be found in the Hulk documentation

    (e.g. -internal_user_middleware, Jager_OpenTracing_middleware, thirdPartY_auth_middleware, b2logger_middleware, etc.)

  • Management control interface: such as health check interface, service governance – fusing, current limiting, degradation interface, metrics interface, online instance performance debugging interface, etc.

2.3 Framework Ecology

After nearly a year of construction, we have preliminarily constructed a HULK framework-centered GO ecosystem that meets good-looking business scenarios, including:

  • Standard catalog specification: avoid the structure of each project is not unified, reduce the difficulty and workload of project maintenance;

  • Code generator: based on hulK framework, standard catalog specification, component usage specification code generator, the purpose is to reduce the use of common modules/components non-standard, solve the common process coding, processing inconsistent problems;

  • Hklib: a good-looking general purpose lib library, which provides some general functions (including a lot of ORP general/basic functions/functions in the process of converting PHP to GO), and also provides 50+ client calls to mid-platform services, reducing repetitive code, improving r&d efficiency, and improving maintainability;

  • Infrastructure: Prometheus + Thanos cluster, Sentry service, Apollo cluster, Pyroscope performance analysis platform, etc.

  • Iconf: The self-developed configuration center has added/enhanced some functions in addition to the open source Apollo, such as -key dimension publishing, more secure configuration acquisition, more concise operation page, class classification publishing, etc.

  • Artemis: Service visualization and intelligent fault location system, in which you can see service deployment architecture, service internal call chain, multidimensional fine-grained near real-time monitoring and key logs. In the case of availability failures, some faults can be located to the cause and specific code in seconds.

2.4 Framework application

At present, all the GO services of short video are built based on HulK, and have some phased outputs and benefits in terms of resources, interface performance and availability.

Application status of HULK framework:

Resource and performance benefits:

Much of the resource and performance gains are due to PHP->Go’s technology stack switch; The framework provides a convenient and efficient way for services to apply the corresponding technology stack features.

2.4 HulK Service Architecture


The following diagram depicts a microservice (Hulk-based) architecture panorama:

  • Each functional component in the framework is around the business of each scenario and requirements, in the business logic can be relatively convenient to use the relevant functional components;

  • After these components are enabled, they also interact with the corresponding infrastructure to support the efficient, controllable, and stable operation of services.

Hulk component initialization and integration with the surrounding infrastructure can be done mostly through environment variables/configuration files.

= = =

3. Framework capability and application

Let’s take a look at the capabilities of the framework from some of the pain points we encounter in daily development, along with examples of how these capabilities reduce or resolve pain points.

3.1 How to improve code quality?

Code quality can directly or indirectly affect the following:

  • Code quality will directly affect code maintenance cost;

  • Code quality affects the probability of bugs;

  • Code quality will affect the efficiency of program operation;

Hulk framework improves code quality in three ways.

3.1.1 Code organization structure

Reduce project maintenance cost and improve r&d efficiency.

  • Through the standard catalog specification, define the general project layout (HTTP service), avoid one or more layouts per person, the final project structure “a hundred flowers bloom” phenomenon;

  • Through the code generator, help developers to generate project templates, initialization process, the use of each directory/file potential conventions;

3.1.2 Coding specification and static check

Improve code readability and reduce low-level code bugs

  • Comply with Baidu Go coding specification + business coding supplementary specification;

  • Use GDP code check tools: GO_fMT, GOC;

3.1.3 Supporting pressure measurement and performance analysis platform

Identify service stress boundaries and identify potential performance issues.

  • Pressure and performance test platform (test environment) : nGrinder

  • Program performance analysis platform: Pyroscope. You can use hulK self-integrated management interface to enable or disable the continuous prof function of online instances in real time to locate online performance problems:

3.2 How to improve the speed of development iteration?

  • How to get developers to focus on business logic and implementation?

  • How can developers respond quickly and fulfill product requirements?

The HulK framework provides the following support to speed up iteration.

3.2.1 Rich practical components/functions

Improve r&d efficiency, avoid trial and error, reduce mistakes.

  • Program enhancement components: enhanced Redis /mysql functionality, enhanced RAL calls, etc. Example – Redis monitoring in the following figure, whose monitoring indicators are automatically collected and calculated by hulk Redis:

  • Excellent open source components: Sentry, Prometheus + Grafana, Apollo, coroutine pool, etc. Example – Prometheus + Grafana: The Hulk framework supports Prometheus by default, and calculates metrics such as service availability, QPS, time, and error codes automatically.

  • Rich HTTP Middleware.

3.2.2 Configuration and low code support

Reduce code changes and rollout, and improve response and fulfillment of requirements.

  • Most components in the HulK framework can be initialized using environment variables/configuration files;

  • Variable data and configuration in service logic can be delivered and effective through Apollo /iconf in real time without code modification or long process on-line. Example – You can use the out-of-the-box configuration center function to deliver and take effect the configuration in real time:


3.3 How can I quickly detect and locate problems?

  • How do developers quickly sense problems in their services, and how do they sense serious problems in real time?

  • How can developers get detailed problem information from monitoring, logging, and alarms to quickly locate problems?

Hulk provides support from the following aspects to improve SRE efficiency.

3.3.1 Perfect event tracking, positioning and notification capabilities

The ability to track and notify developer-defined errors in real time

  • Real-time event tracking component: Sentry. Hulk provides sentry components out of the box, which can be used like printing logs. Information in Sentry includes code call stack, context, custom key information, and so on:

  • Notification component: Ruliu. A single line of token configuration can enable the flow function, which can send some information requiring immediate attention to the flow group in real time. Meanwhile, it can be combined with Sentry to realize real-time awareness and location of abnormal problems:

3.3.2 Prometheus + SIA Monitoring

Complementary to Prometheus and Noah, Prometheus supports multi-dimensional and comprehensive monitoring to obtain more information about service stability

  • Prometheus provides developers with flexible, multi-dimensional business monitoring information;

  • Sia can provide log-based collection of service stability indicators, container, network and other resource dimension monitoring information for developers.

3.3.3 Ftrace Log Query and Analysis Function

Hulk supports the fTrace platform log format by default

  • Ftrace enables you to query user logs conveniently and efficiently.

  • You can run the pdo2 command to query logs about user-defined rules.

3.4 HulK-based service visualization and intelligent fault locating system

Artemis is a service visualization and intelligent fault location tracking system developed by us based on HulK. It integrates many aspects of information such as service deployment architecture visualization, near-real-time multi-dimensional monitoring, key logs, service invocation chain, etc., which can find and locate stability problems quickly, efficiently and accurately.

At present, the system has been connected with many back-end services, such as nice/universal/DCC, which greatly accelerates the efficiency of fault location. In some fault scenarios, the fault can be located in seconds, and the lines of code for the problem can be presented.

3.4.1 Service Deployment Architecture

Through the instance list, you can obtain the IDC list, instance list and details of the service, and provide convenient and efficient debugging entry and login commands:

3.4.2 Near-real-time multidimensional monitoring

The near-real-time monitoring provided by Artemis provides additional dimensions that sia and Prometheus cannot provide, such as:

  • QPS, time, availability of a downstream (or downstream instance)RAL below a URI;

  • Monitoring information for a service instance instance URI or RAL;

3.4.3 Critical Logs

Due to the deep integration with HULK, Artemis can get more log information, such as – dimension information, call stack, context, etc., when printing logs of higher than warning level in business code:

3.4.4 Service Invocation Chain

With the help of HULK framework, Artemis can also obtain URI and RAL call information on which URI depends, thus constructing request call chain and displaying relevant metrics information on call chain in real time:

Links of different colors represent different availability: red – one 9 or less, yellow – two 9’s, blue – three 9’s, and gray – four 9’s. Through the service invocation chain, it is very intuitive to see which interfaces in a service are problematic and which downstream impacts the availability of that interface.

3.4.5 Use Cases

Through the linkage with the alarm system, the affected service and URI can be found in Artemis system at the first time when the alarm occurs, and whether it is caused by the downstream, what is the error, which line of code reported the error, etc. The following is a practical application example of Artemis.

= = =

Four,

= = = = = = = = = = =

Although HULK is a new GO language Web framework, it is not to repeat the wheel, but on the basis of the plant and open source software, combined with the actual business development, department, operation, operation and maintenance environment, to learn from each other, secondary development of these open source frameworks and tools, and finally meet the actual business use scenarios. Meanwhile, the GO ecosystem initially built around HULK provides strong support for the service in all stages of development, deployment, operation, operation and maintenance.

Finally, it is hoped that some practices, schemes and experiences of the SHORT video R&D department in the go service-oriented architecture upgrade/R&D framework can provide some help and reference to students from other business lines who have the same architecture upgrade needs and encounter problems in the go project practice.

Job Information:

Short video RESEARCH and development Department, responsible for the incubation and development of good-looking videos, national small videos and a number of innovative apps. As a company-level strategic product, it undertakes the task of providing short video content for Baidu product matrix, focusing on supporting baidu search and information flow to video, and shoulder the mission of baidu content ecological video transformation. In only two years, the user scale has grown from zero to 100 million, and tens of millions of users live every day. With tens of billions of traffic, hundreds of millions of data, rich novel and comprehensive product play, multiple types of technical systems and leading technical architecture.

Welcome to join the short video RESEARCH and development department, social recruitment, internship, school recruitment oh

Resume mailing address: [email protected] (Delivery note [short video])

Recommended reading:

Hundreds of billions of models in offline consistency assurance scheme details

How to quickly locate the program Core?

Baidu BaikalDB in the same cheng Yilong successful application practice analysis

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention