In the history of the most detailed sina advertising system technical architecture optimization process

Content source: On August 10, 2017, Sina advertising development technology expert Xu Ting in the “second APMCon China application performance management conference [large-scale network architecture optimization special]” on the “Sina advertising system service optimization course” speech sharing. IT said as an exclusive video partner, authorized by the organizers and speakers review and release.

Read the word count: 4858 | 13 minutes to read

Watch guest speech videos and PPT, please click:
t.cn/EAEQLSN.

Abstract

In fact, Sina began to study the advertising system very early, according to UserID + CookieID + user behavior log and other multiple elements to distinguish users, and then for individual users to do frequency control. At the same time, in order to better monitor the performance and accuracy of the advertising service platform, Sina advertising technology team has carried out a number of improvement measures on this basis.

Technical architecture and pain points

SinaX’s pre-servitization architecture

Some friends may not understand the technical architecture of Internet advertising, here is the first introduction to the structure of the general. Almost all companies have an ADExchange in their technical solutions — the module responsible for advertising traffic access and traffic optimization (the middle module in the picture).

The advertising traffic from the top will be optimized, and after the optimization is completed, the DSP inquiry will be placed at the bottom.

The whole process is similar to an auction, a traffic comes in, if it is judged to be good traffic, then a bid is sent to some DSP, the DSP will optimize the price based on the internal algorithm and user behavior optimization, adexchange will receive these offers and decide the winner, and the AD request is given to the winner. Adexchange is mainly doing two things, one is to determine the winner, and the other is to adjust the ecological level. Adexchange does not treat all the engines below equally, but according to the requirements of phased advertising ecology.

Early on our Adexchange was based on the Nginx server, with Lua writing the business logic inside the server, namely the bidding rules and traffic optimization scheduling mentioned earlier.

The pain point of the old SinaX architecture

Because Adexchange doesn’t treat delivery engines equally, the overall traffic is a funnel model.

During the development of Adexchange, in order to better implement the funnel model and improve the business, we carried out secondary development internally, and found many problems during the development process.

Functionality development in Lua is not convenient because requirements change frequently and require immediate response, and some of the functionality in the immediate response process is cumbersome to use Lua embedding. Lua code is not maintainable, and it is expensive for new people to read the previous code once staff changes occur.

Thirdly, it is difficult to monitor within the service. If Nginx and Lua are adopted, strictly speaking, the only solution is to output logs in some way, and it is difficult to extract the state quantity inside Nginx and send it to the key service detection points. The last is the test problem, based on the Nginx and Lua environment to do complete test or full gray scale test is very troublesome, generally can only be directly embedded code to solve, can not use some transmission technical solutions.

At that time, I chose Nginx and Lua because I thought Nginx had good performance and Lua could solve the performance problems. So performance isn’t the bottleneck for us, the real pain point is secondary development.

The pain points of the old AD engine architecture

Below the aforementioned technical architecture diagram are a number of launch engines, and the above diagram is a concrete implementation of the engine. In 2014, these engines were implemented based on the Tomcat server, including user portrait query, order retrieval, algorithm rough row and other functions on the Tomcat server.

The biggest problem with this architecture was the threading model. Tomcat was a blocking threading model, where threads were allocated and had to complete the request before being released, but at that time we had a lot of daily traffic and Tomcat was very expensive and inefficient.

(Based on problems encountered with tomcat)

The biggest business pain point of the Tomcat architecture is that it cannot cope with sudden traffic shocks. If traffic suddenly increases by several times at some point, unless the previous load is very low, you can reduce the pressure on Tomcat by degrading or circuit-breaker, which can actually cost the company revenue.

Summary of technical pain points

The technical pain points we encountered throughout 2014 were summed up in four main points.

Serious functional monomer. All of the advertising logic and business is implemented in a single Nginx plus Lua and Tomcat, and any change is risky.
Function interaction between modules is in plug-in mode. The module here refers to the algorithm. For advertising, the module is online every day, which means the plug has to be changed frequently, which increases the system risk.
The programming model doesn’t fit the business. The programming model refers to the single-threaded Tomcat that must complete tasks before being released, which can easily clog up business. The only solution is to deploy a large number of Tomcat servers at no cost.
The system running status is not transparent.

Technical analysis and selection of pain point problem

In view of the above pain points, we have carried out technical analysis and selection.

Function servitization. Each service is independent of each other, which reduces the coupling degree of business and improves the response time of product demand development.

The RPC. Inter-service communication is based on RPC. After comparing the ecosystems of thrift, gRPC, Dubbo and Finagle RPC frameworks, we finally adopt Twitter’s open source Finagle RPC framework.

Monitoring. For the advertising system monitoring, there are two most important points, one is real-time monitoring and tracking of the system status, the other is real-time analysis and statistics of business data. The real-time state of the system is not only limited to the physical machine, but also concerned with QPS and timeout rates and the average time taken by requests. Business data is mainly related to the advertising business, such as the real-time click-through rate and change of an AD, and real-time analysis of these businesses requires timely feedback to the advertiser or agency, because they can directly optimize the ads that are not working well.

The service of sina advertisement system changes a process

Technology trade-offs

1. Service differentiation

In the past, in the initial stage of servification, we believed that the amount of filtering at each layer of the funnel model could be considered as a service, which was not only natural in business logic, but also convenient for business development and the addition of new functions. To add product functions, only the corresponding services needed to be changed.

However, in the subsequent prototype test, it was found that some services were called too frequently by upstream and downstream. This is because it seems to be through the abstract vulnerability model, in fact, the traffic may not leak on average in the model, and it may leak less in a certain layer. The most direct impact of frequent upstream and downstream calls is that the bandwidth is excessively occupied and the I/O ratio of processing time is high.

Since then we have changed the original design prototype and can now automatically merge services based on traffic characteristics.

(Architecture diagram after refactoring)

In the new architecture, traffic will first go through RB and PB, which means the most expensive will be at the top and then filtered down. As you can see, some of the traffic goes directly to the engine, bypassing the middle link and improving I/O and efficiency.

In the process of ADX servitization, our biggest experience is that service division should not be carried out completely according to business logic, but only through prototype testing with real traffic can we find the service division that actually needs to be done.

2. Advertising retrieval

(AD engine service architecture)

In the advertising system will encounter such a few processes, all traffic down after the ADX will be leaked to the specific delivery engine. In the process of placing the engine, the user will set up a service to query the user portrait. According to the user portrait, the user will get a pile of placing orders stored in the candidate set to be cast, and then interact with the algorithm to determine the most suitable placing orders. After the advertising is cast, there will be billing services for billing. All services are connected to the Log Service for log processing.

We spent a lot of time on the Index Service in the overall technology tradeoff. The initial solution was to make the Index Service a service directly, but as QPS increased the amount of data, the scalability of the Index Service has been affected. Later, we decided to push the data set periodically into the delivery service by means of rsync, which solved the problem of large amount of data transmitted by calls, which affected the scalability of the system.

Architecture with high reliability and scalability

1, system fault tolerance

In the process of considering high reliability and scalability, we believe that one of the most important issues to be addressed is system fault tolerance, which is also a problem to be considered in distributed systems. The visualization of system fault tolerance has three main points. One is to design the system for errors. These errors are not only communication errors, but also physical machine downtime and some unpredictable errors. Second, stateless design of services; Third, the automatic replacement of faulty services.

2. System reliability

There are three things we learned most about system reliability during the architecture process.

Track monitoring program running status and hot spots. Monitoring of the health status is well understood, but tracking hot spots is important because some previously undetected hot spots in the service application can arise over time, and if these hot spots are not tracked in time, the unexpected state can become a disaster immediately.
Monitor system exception handling in real time and isolate exception services directly.
Multiple backups are made for different service levels. It is possible to deploy multiple backups of the same service due to fault tolerance and high reliability, which also means increased costs. In the advertising system, we will classify the services. For the first level, we will do N backup, and for the low level, we will only do 1 or 2 backup considering the cost.

3. Scalability of system processing capability

When building a system, developers often assume that adding servers will improve the system’s processing power. This is not necessarily the case. You can’t expect to increase the overall processing power of a system by simply adding physical machines and computing resources.

Only when the processing capability of the stand-alone service reaches the linearization capability can the expansion of the subsequent system processing capability be guaranteed.

The key to the expansion of the system’s service processing capacity lies in resource allocation. The key point of a distributed system lies in scheduling control. In the past, scheduling control may only be a few requests, but now more emphasis is placed on the reasonable allocation of resources. For advertising, the peak change of traffic is very serious. Generally, the traffic will not be very large in the early morning, and will peak in the morning and noon respectively. In this case, the dynamic allocation of resources can greatly alleviate the waste of resources.

In the past, we thought the AD request was the most important thing to focus on, but with the first two scalability, the input is not really an issue, but the data hotspot of the system. Because it is often the bottleneck of the system, for example, in the case of data consistency cannot be guaranteed, we can only improve the consistency level for some businesses, at this time, it is easy to form data hotspot.

In the process of designing the system, it is important to see whether the data hotspot is always present in the context of the business, and if so, degrade it. The system is scalable only when the data hotspot is not identified as a hotspot over a period of time in monitoring or system analysis.

4. Scalability of the system architecture

In terms of the scalability of the system architecture, we refer to the Lambda architecture pattern, which divides the system into batch processing layer, service layer and speed layer, so that offline data and real-time data can be combined to provide services for the system.

In fact, the advertising system emphasizes real-time data feedback, such as model calculation, delivery effect, user portrait, etc., if only offline data effect will not be very good. The second point is data consistency. Since the amount of data within the AD system is not very large, the simplest way to ensure this is to use a hierarchical approach.

The third point involves consistency hashing, which can recover a node that has failed, but it also loses traffic. Therefore, we removed the consistent hash allocation of all incoming requests in the past and directly distributed the traffic based on the load capacity of the following services.

(System optimization effect)

Technical harvest and experience summary

Grasp the coarse and fine granularity of service division

To grasp the coarse and fine granularity of service division, the first thing to be clear is to avoid simply dividing services logically according to business logic. Second, the core of service division is to find out the real business center and business support; Third, there is no uniform standard in the whole process. The key is the tradeoff.

Service optimization focuses on the optimization of business logic

Developers generally think that system problems can be solved through servitization, but the reality is that the hot spots of the system will not disappear naturally with servitization.

Therefore, in the process of service-oriented optimization, the emphasis must be placed on the optimization of business logic, because all service-oriented is to realize business logic, and only in the case of a deep understanding of business logic, service optimization is purposefully considered, rather than purely from a technical perspective. We believe that the principle that must be followed in the process of servitization is that the implementation of technology is only the means.

Services need to be smoothly switched online

To servize an existing system, this means that two systems need to be smoothly switched online at a given time. During development, one of our colleagues in the development department said that servitization of the system was like changing an engine on a flight.

There are two key points in this high-risk process. One is that the new and old systems must be considered in parallel and compatible with each other when the gray on-line stage is launched. The second is to consider the impact of a single function on the overall system.

Service governance

For service governance, we believe that it is not necessary to follow its principles completely, but more from the perspective of distributed systems, from the data consistency, fault tolerance and other aspects of service governance. And we made some distinctions about that.

The first is the automatic degradation and recovery of service processing capacity. Second, service protection, including load balancing and circuit breaker; Finally, transparent operation of services. In the process of service-oriented governance practice, there will be many services, and so many services need to be monitored and tracked. If manual or scripted methods are used, it will be very inefficient and not intuitive.

That’s all for sharing, thank you!