On August 31, 2019, the OpenRESTY community jointly held the OpenRESTY × Open Talk national touring salon in Chengdu. Yin Jifeng, the former engineer of the basic architecture department of Shell House, shared “Using OpenRESTY to build high-performance Web applications” at the event.

OpenRESTY X Open Talk is a national touring salon initiated by the OpenRESTY community, which invites experienced OpenRESTY technical experts to share their OpenRESTY experience and promote communication and learning among OpenRESTY users. Drive the OpenResty open source project.

Yin Jifeng, the former engineer of the infrastructure department of Kehao Housing, is a multilingual enthusiast. He prefers asynchronous and functional programming and loves prototyping. He has built WebBeacon, image processing and other services with OpenResty in Kehao successively.

Here’s the full text:

Today I would like to introduce a relatively niche use scenario of OpenRESTy, using OpenRESTy to do Web framework writing services, hoping to bring you something new.

Why do high-performance services

First, a simple definition of a high performance Web service is given: a service with a QPS of over 10,000 is a high performance Web service. In my opinion, a good service is never optimized. Architecture determines the benchmark of a service. Premature optimization is the root of all evil.

Everyone knows that if you do the Web, the Web just needs to scale horizontally and scale, so why do you need high performance? In fact, some services are not suitable for horizontal scaling, such as stateful services. I took an inventory of the services I used over the past few years and found that there are indeed many stateful database services:

  • Unified centralized cache Redis
  • High performance queue Kafka
  • Traditional relational databases such as MySQL and Postgres
  • Emerging non-relational database MongoDB, ElasticSearch
  • Embedded database belongs to Embedded database, SQLite3, H2, BoltDB together with service configuration
  • InfluxDB, Prometheus and so on

In fact, all the above services are stateful. Its capacity expansion and scaling are not so simple. Generally, scaling is done manually through manual operation in the way of Sharding.

The service of a company may have thousands or hundreds of machines, but the basic service should only account for a small proportion. Therefore, we have high performance requirements for basic general services, such as Gateway, Logging, Tracing and other monitoring systems. As well as API and Session/Token validation for corporate users, such services have certain performance requirements.

In addition, horizontal expansion is limited, with the increase of the machine, the capacity provided by the machine, QPS is not a linear growth process.

The benefits of high performance, in my opinion, are as follows:

  • Low cost, simple operation and maintenance: expansion capacity is not sensitive, a machine can carry the amount of many machines;
  • Easy traffic distribution: fewer machines can facilitate traffic distribution, faster deployment means faster rollback; Red and green deployments are used when there is a complete incompatibility or a need for extremely smooth traffic migration;
  • Simplified design: In-app caching is more efficient, and simple abtests can be done by machine latitude;
  • “Programmer self-cultivation.”

How to do high performance service

The vast majority of Web applications are actually IO intensive services, not CPU intensive services. You may not have an intuitive sense of CPU performance, but here’s an example: π seconds is about a nanoscale, which means that people observe the world in seconds. The CPU is nanosecond scale. For a CPU, 3.14 seconds is equivalent to a human century, and one second is equivalent to “33 years” for a CPU. If it’s 1-2 milliseconds, we think it’s fast, but for the CPU it’s actually “20 days or so”.

And this is just a single core, there’s actually multi-core reinforcement. Faced with such strong CPU performance, how to make the most of it? This has led to a new programming pattern since 2000: the asynchronous model, also known as the event-driven model. Asynchronous programming, event-driven, turns blocking, slow IO operations into fast CPU operations, driving the entire application with a completely non-blocking event notification mechanism through the CPU.

Since 2012, when I came into contact with Python Tornado asynchronous programming on mobile sohu network, I have done asynchronous programming in several languages and frameworks. In my opinion, the synchronization model is “thread pool + frequent context switch + big lock of data synchronization between threads”, which means that if it is a synchronous model, the system needs to be tuned according to the current loading, and it will be very difficult to adjust the optimization bit by bit according to the current situation.

The asynchronous model is equivalent to a highly concurrent state because it’s an EventLoop, which loops around and around. If two requests come in at the same time, it may fire at the same time, and if one of the requests does not give up the CPU in time, it will affect the other request. It trades potential delay for higher concurrency. High performance is equivalent to “asynchronous + cache”, that is, asynchronously solve IO intensive, cache to solve CPU intensive problems.

The major asynchronous languages and frameworks currently on the market include:

  • C and C++, you don’t usually write Web applications;
  • Swoole in PHP, which is a relatively bad ecosystem;
  • Java, with the help of Spring Cloud and Gateway, has gained popularity in recent years, but it is still in a relatively primitive state. It is still programmed under the condition of a combination of Then/OnError.
  • NodeJS, when talking about asynchrony, is definitely inseparable from JS and NodeJS, the JS ecology is asynchronous ecology, it has no synchronization block. From the initial design to the present, after so many years of development, the callbacks have gradually evolved into the support of Promise class library, and then to the async/await and IO states of NodeJS that are convenient to write asynchronous code with generate. The model of async/await is also the asynchronous way of writing which is relatively recognized and respected by the whole industry at present.
  • Yield +send as a coroutine, yield+from add generate as a stream, and yield+send as a coroutine. Up to now the process of incorporating the 3.x version asyncio into the official library;
  • Rust, a new language, has chosen the pull model for asynchronous execution based on the peculiarity of the memory model. It was on 1.39.0 in September of this year. The async/await mode is a stable state, an async/await MVP minimum available product, Tokio, an official de-facto runtime, stabilizes between three and six months, so the first half of 2020 could see a surge in Rust for writing Web applications.
  • Golang, which uses Goroutine to do the switch, is different from the others. It looks like a synchronous code, but it actually does the asynchronous scheduling form at the end of the runtime.

So why learn OpenResty? First of all, it was limited to 2015. The practice I’m going to talk about today was done in 2015-2016, which may not be so new in time. However, in 2015, the maturity of the above asynchrony was not so high except for JS, so OpenResty was indeed a good framework for asynchrony at that time.

What is a OpenResty

Nginx

Nginx is a first-class reverse proxy server, it has a sound asynchronous open source ecology, and has become a standard in the first tier Internet companies, has been introduced into the technology stack. So introducing something on top of OpenResty is a very low risk thing, just by introducing a Module, you can reuse the learning costs that you already have.

In addition, Nginx has a natural architectural design:

  • Event Driving
  • Master-Worker Model
  • The URL Router, which eliminates the need to select a routing library, is itself efficient in the config
  • Processing Phases, which are Phases about handling requests, can only be felt when you use OpenResty

Lua

Lua is a small, flexible programming language that is naturally C-friendly and supports Coroutines. In addition, it has a very efficient implementation of Luajit, which provides a very high performing FFI interface that is faster to tune C functions than handwritten C code, according to Benchmark.

OpenResty

OpenResty is Nginx+Lua (JIT), and it’s a perfect combination of the two, with the asynchronous Nginx ecosystem powered by Lua. Lua itself is not an asynchronous ecosystem. The reason why brother chun does OPM and package management is because Luarocks is a synchronous ecosystem, so many OpenResty packages start with Lua-Resty, which means that it is asynchronous and can be used on OpenResty.

OpenRESTY application practice: WebBeacon

WebBeacon overview

Beacon is a buried service that logs HTTP requests and does subsequent data analysis to calculate a range of data such as session number, access duration, bounce rate, etc.

Dig.lianjia.com is the front end of WebBeacon service, it is responsible for supporting data collection, not responsible for data processing and calculation. It is also responsible for receiving HTTP requests and returning a 1×1 GIF image; Upon receiving HTTP requests, there will be an internal format called UAL (Universal Access Logging), which is a unified global Access log format. The information related to HTTP requests will be routed to Kafka to perform real-time or offline data statistics. This is the overview of the whole service.

We had a version when we took over, and the first version was implemented in PHP, and it didn’t perform very well. FastCGI + PHP Logging to file receives a request directly, PHP writes to the file, and then RSYSLOG IMFILE, the input module of the file reads the log file. Push to a situation of Kafka through Kafka’s Output Module.

For example, why isn’t PHP performing so well? According to PHP’s request model, you don’t need to worry about resource release in order to prevent memory leaks. The log file is opened at the beginning of each request, then written to the log, and automatically closed at the end of the request. This process is repeated as long as it is not initialized in Extension. In this project, opening and closing only to write one log at a time can cause poor performance.

Challenges Facing

  • For a long time, the only production material of the big data department of the whole HOME LINK network is derived from this service. Without this service, there will be no output of all subsequent statistical calculations and reports. For the downstream dependent parties, this service is of high importance.
  • High throughput state, all PC stations, M stations, Android, IOS, small programs and even Intranet services can be to this service, so there is a very high demand for throughput.
  • This service has a business, it has to go to different places to fetch information, do some conversion, and finally landed, there must be a flexible programming thing can do it.
  • Poor resource isolation. At that time, the entire service of HOME LINK network was in a state of mixed deployment, which was only maintained by the isolation of the operating system, so there would be multiple services running on the same machine, and there would be contention for CPU, memory, disk, etc.

Principles to adhere to

  • Performance Max, the higher the performance, the better;
  • Avoid (hard) disk dependencies because of mixed deployment reasons, hopefully it can avoid disk dependencies on the critical path even from start to finish;
  • Instant response user, a WebBeacon service, in fact the core is the data, the user can directly respond, do not need to wait, the data can be placed in the background to do, do not need to card the user’s request. We want to have a mechanism to take a request, return a request directly, and then do the business stuff later;
  • The cost of refactoring is as low as possible, and this project already has version 1, which is a refactoring, but it’s a rewriting. We want it to take as little time as possible in this process, to inherit all the features on the basis of the original, and to have its own new features.

The main logic

The diagram above is a phase diagram of OpenResty, which anyone who has written OpenResty programs will know is a very important thing, the core process of OpenResty. Usually when you say Content GeneratedBy? “Phase” uses “upstream” to do balancer. In practice, you can write “content_by_lua” directly when creating a Web application, because you can output it directly. At the same time, you can collect some data in the “Access” and “Header” phases. Through the log_by_lua* stage, the data is delivered so that we can immediately respond to the user’s needs. Content_by_luacontent_by_lua is really quite simple, just mindlessly spitting out fixed content:

As shown in the figure above, declaring Content-Length=43. If not, the default is Chunked mode, which makes no sense in our scenario; Z is a way of writing a Lua5.2 Multiline String that truncates the current newline and any whitespace characters that follow it, and then spits it back together.

access_by_lua

We did a few things in the access_by_lua procedure:

The first is to parse the cookie. To record and issue some cookies, we use Cloudflare/Lua-resty-Cookie to parse the cookie.

We then generate a UUID to identify the device ID or request ID, etc. We use OpenSSL’s c.rand_bytes to generate 16 random bytes, 128 bits, and then use c.ngx_hex_dump to convert. And then cut it bit by bit into UUID states. Because we use UUID generation a lot internally, we want it to perform as well as possible.

In addition to the UUID, there is also SSID, which is the session ID, and session is the number of recorded sessions. In a Web beacon, a session is a continuous visit by the user over a period of 30 minutes. If a user continues to access our service for 30 minutes, it is considered the state of a session; When the cookie expires after 30 minutes, a new cookie will be generated, that is, a new session. In addition, the statistics here are non-natural. For example, when a user comes in at 11:40 PM, he will only be given a cookie for 20 minutes, so as to ensure that he does not cross natural days.

In addition to all of this, we have one more important piece of business logic.

There will be a cost for each request sent by the mobile browser. We have made a setting to package and upload the logs collected by the mobile phone together, and summarize the logs collected by multiple buried points. When the user presses the home button to exit or put them in the background, the reporting process will be triggered. It will have very low flow loss. This means that we have to parse the body, to break a POST request into N different buried logs and then land it. Here’s what we did:

  • Limit the Max body to 64K because you don’t want it to fall off disk. Considering that we have GZIP, there may have been 400K to 500K state before, which should be enough for use after collection.
  • The client code, the server naturally to decode. Unpack GZIP via zlib, then decode URL, then do json_decode, finally split a request into N, a list containing N requests under the condition, recode each list item into buried point log and drop it back;
  • Table new(#list,0)-> Table. New (0,30)/ Table. Clear, then json_encode, ngx.escape_url, etc., finally form a single log;
  • Assuming that any problems occur during this process and a downgrade is done, use log_escape to drop a raw body into the log, so that later retrieval is still possible.

header(body)_filter_by_lua

The other logic is the process of collecting and summarizing fields.

Why not do it together with access_by_lua? The reason is that we found during the pressure test that an operation would coredump in the access_by_lua phase under high concurrency pressure. It is also possible that we are using the access_by_lua section in the wrong way, which should not be used in the access_by_lua section.

So we’ve moved some of the things that don’t have a lot to do with the current request to header_filter_by_lua. We have X-Forwarded-For dropping IP, and LIANJIA_TOKEN For dropping UCID. The UCID in the chain is a long string of numbers. The number in Lua has the accuracy of 51 bits, which means that this number cannot fall down, so we use FFI to new a 64 bit URL number, and do a series of transformations, and then through the way of printing cut out the desired UCID length of more than 20.

Log_by_lua alternative

Here is the core part of the log_by_lua logging process.

  • ACCESS_LOG is NGINX’s native built-in mode, which is not suitable for our scenario because we have a terabyte log volume and a mixed deployment, so the disk is in a contended state; The core point is that Read and Write are blocking operating system calls, which degrade its performance dramatically;
  • Doujiang24 / Lua-Resty-Kafka, written by Zhang Dejiang, is another approach, which we did not actually investigate, because the protocol and features of Kafka are too many and complicated, and the current features and future development of this solution may not meet our needs.
  • We finally settled on cloudflare/lua-resty-looger-socket libraries that can remotely drop logging without blocking.

rsyslog

As for the logging tool, we chose rsyslog. In fact, we did not do too much technology selection, but directly customized and optimized the original technology selection:

  • The maximum length of a single log is 8K. If a log is too long, it will be truncated. This is the specification of rsyslog, but it can be customized.
  • Rsyslog passes messages internally through a Queue, which is very heavy. We chose Disk-Assisted Memory, which can be dropped to the Disk when the Memory breaks down to ensure its reliability;
  • Parser uses RFC3164 instead of a newer version of the RFC, which is a simple protocol, as shown in the figure above: local MSG =”<142>”… timestamp..” “.. topic..” : “.. MSG is the specification log for RFC3164;
  • We have a log fell more than demand, between the log and log to do cutting, here actually opens the SupportOctetCountedFraming cutting parameters, is actually on the head of the current message plus the length of the message, by an RFC, Added additional features to complete the entire log landing;
  • ImpTCP is used to create a UNIX socket file handle instead of TCP or UDP, because UNIX sockets have lower losses. It is worth noting that the Cloudflare/Lua-resty-looger-socket library does not support dgram UNIX sockets, and Cloudflare has stopped maintaining this library;
  • The output module uses the Omkafka output module, which enables two special operations: opening dynaTopic allows custom input to the specified topic; Enabling Snappy compression reduces transmission traffic;
  • Open two extra modules: One is OMFILE, which received the request from IMPTCP and landed locally while pushing Kafka. The reason for doing so is that the messages in Kafka were not taken away in time due to the downtime of the downstream push program. Therefore, important data backup is needed, plus the dependence of disk. The other is impStats, which monitors the Queue capacity, current consumption status, and so on.

Deployment plan

As mentioned earlier in our hybrid deployment situation, there will be criteria for online access: tar +run.sh. We pre-compile OpenResty, compile all dependencies statically into it, pull the OpenResty binaries locally and mix them with Lua scripts at release time, rsync to a fixed location and run again. Use the supervisord or systemctl for online daemon management.

The test environment is self-maintaining, and the project can be split into two pieces, one OpenResty and the other Rsyslog. You might want to dump all of the OpenResty code into the repository, ignoring the fact that the configuration of Rsyslog is also very important, so you should dump all of the project-related stuff into the repository. We use Ansible to do everything else to manage the test environment.

The performance data

The final performance data is shown below:

I made statistics at the beginning of 2018, and the peak QPS was about 26,000 QPS, while the peak of the single machine pressure test was about 30,00QPS, so in fact an OpenRESTY machine can withstand the traffic of the entire buried site. Log transfers are compressed to about 30MB, and about a billion logs are dropped every day. In order to ensure the reliability of the service, there were three EC2 C3.2XLARG servers on the line at that time.

conclusion

  • Overall, architecture comes first, and we built a high-performance service using three tried-and-true components: OpenResty+ Rsyslog +Kafka;
  • To keep the boundary, keep the bottom line, such as not to drop the disk, we must do everything possible not to drop the disk, to ensure that there is the ultimate performance;
  • Sometimes you can look high performance, but the code will still have a lot of pits, which you can avoid in a lot of ways like flame diagrams. The important thing is to make a conscious effort to grasp the performance issues, starting with little things. For example, the NEI Not Yet Implement method should be avoided; For example, table.new/table clear once pre-allocation of a table array of known size, to avoid repeated, multiple dynamic expansion; The implementation performance of ngx.now is very high, it is cached, basically meet the requirements of time accuracy; UUID generation, from libuuid conversion to a one-time generation of 16 bytes of data; With CPU-reducing operations such as Shared Dict, and so on, we can eventually build a high-performance Web service.

That’s all I have to share today, thank you!

Share PPT download and live video:

Using OpenRESTy to build high-performance Web applications

This article by the blog multiple platform
OpenWriteRelease!