How to access it flexibly and efficiently?

platform

• Build platforms, not projects — build a “Taobao” rather than a website for just a few businesses
• Abstraction and generality from the business — If a business is likely to be repeated in the future, modularize it, systemize it (e.g., batch systems) and develop it into platform capabilities

dynamic

• Process dynamics — Processes corresponding to different business types can be adjusted at will without code adjustments
• Code dynamic — Groovy scripts are used to dynamically adjust online code without having to issue a version; In addition to using various flexible preconfigurations for rules configuration, you can also script rules using Groovy; Index function groovy, do not need to send version every time.
• Configuration dynamic – Configuration dynamic can be considered in the form of virtual tables. Virtual tables store the structure of any table into a unified table structure to achieve dynamic configuration, which is similar to the idea of noSQL documentation.

How can I reduce response time and improve throughput?

Use storage and caching wisely

• Configuration data is loaded into local memory
• Data that is frequently accessed repeatedly is redis
• Release hbase for a large number of hbase devices with high stability requirements
• The detailed data that needs to be searched quickly is placed in ES

As shown in the figure below, the read time of different storage is very different. You should make good use of different storage, and use the storage with the shortest time as possible



The following figure shows a benchmark performance test of hbase. Do not ignore hbase. It can access massive data and respond in a very short time, which is a powerful tool for improving the performance of risk control systems. The most important accumulated data of current risk control systems is accessed based on hbase



asynchronous

• At the system architecture level, make asynchronous code as asynchronous as possible, but don’t abuse asynchrony

Here is a practical example, in the process of pressure test, found that the CPU sy and wa is very high, generally can be judged thread is too much, and waste in the thread, according to observation, enable asynchronous thread calls three external call time is not low, so the branch thread waiting time is too long, lead to take up a lot of threads waiting for IO, thread also frequent switching.



Based on dynamic process configuration, sy and WA are greatly reduced and no longer overwhelmed by merging the three external calls in the main system into one, while the remaining two calls are continued after Kafka decoupling.



Stand-alone TPS 2644.6 – > 3079

Average single machine response time 149.3->126.03


• Log printing asynchrony –log4j2 all async greatly improves throughput
The impact of logs on TPS cannot be ignored. I tried to disable all log printing, and the system TPS increased from 3000 to 4200.
If you don’t print logs, the online system can’t operate and maintain. In the risk control system, log is a very important troubleshooting tool and means. Log4j2 is designed to print logs in large amounts. All Async implements fully asynchronous printing and uses disruptor in the middle to speed up printing. Disruptor disruptor

Single-player TPS 3,079 ->3,686.1

Average single machine response time 126.03->79.35



• Reduce the number of threads, thus reducing system CPU time, asynchronous network calls — Netty’s client application
In order to ensure the throughput and execution time of the main thread, it is often necessary to asynchronize network calls. Some important asynchronous network calls also need to occupy a large number of threads in the thread pool. If the number of threads is large, SY will remain high, which not only wastes CPU, but also leads to the crash of the whole TPS line.
Nio’s Netty client is the solution to this problem. Below, each thread a link on the left waiting, cost a lot of threads waiting, will lead to sy and wa, based on the client of netty framework, limit connection thread to a small number of threads and callback business will remain in a smaller range and keep busy state, rather than the time spent on the sy and wa





• Use thread pools to keep your system stable
Thread pools are actually an important way to keep your system stable, and it’s important to keep resources in a manageable range rather than overwhelming the machine by adding resources indefinitely.


How to deal with big Data?

Incremental thinking

• Problem: Need to calculate from the original table to the result table, because the incremental result table cannot be reused, can only calculate the full result table every day, the calculation task seems huge (calculation task 182 hours)
• Solution: Convert each need full calculation to: each increment is calculated to the detail table, and then the full calculation from the detail table to the result table (the actual calculation is slow for the first run, and each subsequent run only takes a few hours)



Massive association relationship query

• Problem: In relational query, a large amount of data is often found from a few simple associations. How to process a large amount of data, sort it, paginate it for manual investigation?

• Solution: Redis caches paging information using ES for multiple queries. The algorithm is very simple, and the actual process will encounter many problems, such as massive data associated with IP, data query will time out, and subsequent query will be larger. Some tips are: prune data early; Restrictions on business queries, such as limits on query time; ES multiple query, can be a one-time data as a query criteria input. The distributed TOPK problem is more interesting, which is explained in the principle of ES, and can be studied by those who are interested





How to maintain system stability?

Current limiting

• In case of heavy traffic during the promotion period, push a sign to the current limiting switch of business channels to limit traffic

demotion

• Stop some of the operational query related requirements during peak periods, reduce the burden on the data system, and schedule to continue the query until midnight

plan

• Prepare a plan before every big push. Planning involves preparing for extreme failure scenarios and ensuring that failure does not come in a rush