OpenRESTY Community: High Performance Practices for Apisix

On July 6, 2019, the OpenRESTY community jointly held the OpenRESTY × Open Talk national tour salon · Shanghai. Yuansheng Wang, co-founder of OpenRESTY Software Foundation, shared “Apisix’s High Performance Practice” at the event.

OpenRESTY X Open Talk is a national touring salon initiated by the OpenRESTY community, which invites experienced OpenRESTY technical experts to share their OpenRESTY experience and promote communication and learning among OpenRESTY users. Drive the OpenResty open source project. The activities will be held in Shenzhen, Beijing, Wuhan, Shanghai, Chengdu, Guangzhou, Hangzhou and other cities.

Yuansheng Wang is co-founder of OpenRESTY Software Foundation and OpenRESTY Community. He is the lead author of OpenRESTY Best Practices. He is the founder and lead author of the Apisix project.

Here’s the full text:

Hello, everyone. I’m Wang. It’s a great pleasure to be here in Shanghai. First of all, let me introduce myself. I joined Qihoo 360 in 2014 and got to know OpenResty at that time. Before that, I was a pure C/C++ language developer. During my work at 360, I wrote “OpenRESTY Best Practices” in my spare time, hoping to influence more people to correctly master the introduction of OpenRESTY. In 2017, I started a business together with brother chun (Zhang Yichun, Agenzh) as a technical partner. This year I changed my focus and left my job in March to focus more on open source, so I started Apisix. The mission of Apisix is to innovate and implement the technology related to microservice APIs based on the open source community.

What is an API gateway

The status of API gateway is more and more important, it hijacks almost all traffic, between inside and outside to complete the user’s security control, audit, through the way of custom plug-in to meet the enterprise’s own specific needs, the most common free identity authentication. With the increasing number and complexity of services, more and more enterprises adopt the micro-service approach. At this time, it is very necessary to complete unified traffic management and scheduling through API gateway.

There are some differences between a microservice gateway and a traditional API gateway, including the following:

Dynamic update: Before microservices, services did not change back and forth as often as they do now. For example, micro-service needs horizontal expansion, or failure recovery, hot standby, handover, etc., IP, node and other changes are more frequent. For example, when there is a breakpoint event on Weibo, the number of computing points is rapidly expanded, and new machines have to be expanded very quickly to handle the pressure. Peaks and troughs change significantly, and dynamic management of machines at the minute level is becoming more and more normal.
Lower latency: Usually dynamic means that some latency (increased complexity) can be done. In microservices, latency requirements are high, especially in today’s user experience, where latency of more than 1 second is completely unacceptable.
User-defined plug-ins: The API gateway is intended for enterprise users, and it must have proprietary logic (such as special authentication authorization, etc.), so the microservice gateway must be able to support enterprise user-defined plug-ins.
More centralized management API: As mentioned earlier, API gateways hijack all user traffic, so it is necessary to use gateways to do unified API management. From the gateway perspective, you can see how the API is designed, whether there are delays, security issues, response speed, health information, etc.

In addition to the above basic requirements, there are a few things that we want to do with the microservices API gateway product that we are different from others:

Focusing on community: Focusing on people with common needs through open source, so that people from more different companies can work together to make better products and reduce redundant development.
Simple core: The kernel of the product must be very simple. If the kernel is complex, it will make the cost of starting the product much higher, which is certainly not our expectation.
Scalability, top-notch performance, low latency: all of these things are strictly guaranteed at the same time, and that’s what we put the most effort into. The current performance of the Apisix project is only 15% lower than that of OpenResty, which is well worth tsundere.

Apisix high-performance microservices gateway

Apisix architecture and functionality

The diagram above shows the basic architecture of Apisix, listing the basic components used. ETCD can be used for configuration storage. Since ETCD can be clustered, we can use it for dynamic scaling, high availability clustering, etc. ETCD data can be obtained incrementally by means of watch, which enables the Apisix node rule update to be performed at millisecond level or even lower. Apisix itself is serviceless, so it is easy to scale horizontally.

The other component is JSON Schema, which is a standard protocol used primarily to validate data. There are currently four different versions of JSON Schema available to the public, and we chose RapidJSON because it has relatively complete support for all four.

The Admin API and Apisix in the figure can be placed together or separately. The Admin API receives the request submitted by the user. Before the request parameters are saved to the ETCD, a complete validation is performed using the JSON Schema. With this validation, it can be determined that the data in the ETCD is valid.

On the right side of the figure above is the actual traffic received from external users. Apisix subscribes to all configuration rules from ETCD and sends them to the following routing engine (Libr3). Currently, the default routing engine is Libr3. I in wuhan before sharing in detail (https://www.upyun.com/opentalk/428.html). Libr3 is a routing engine implementation based on prefix trees, which is very efficient and powerful because it also supports regex.

The V0.5 version of Apisix has the following features:

The performance of the APISIX

Typically, the introduction of the dozen or so features mentioned above is accompanied by a performance drop, but by how much? I did a performance comparison here. As shown in the figure above, on the right is a bogus service I wrote for testing. This service is empty, just take some variables from ngx_lua and pass them to the fake_fetch which does nothing. The following HTTP filter, log phase, etc., do not require any computation.

Then we ran pressure measurements on Apisix and the fake service on the right, respectively. The results showed that Apisix’s performance was only reduced by 15%, which means that you can enjoy all the features mentioned above while receiving a 15% performance decrease.

To talk about the specific values, Ali Cloud’s computing platform is used here. Single worker can run 23-24K QPS, and 4 workers can run 68K QPS.

The current state of Apisix

The latest version is V0.5, and the architecture is based on ETCD+ Libr3 +RapidJSON. The code coverage of V0.4 version is less than 5%, but the code coverage of the latest version reaches 70%, 95% of which is the core code, and the surrounding code coverage is relatively low, mainly due to the lack of related tests of plug-ins.

Had planned in version 0.5 online management interface function, it can reduce the entry barriers, but unfortunately it is not developed, this has something to do with our own professional, not good at doing the front-end interface, needed a front end, experts to help us achieve our plan will be in the 0.6 version online (note: currently has issued v0.6 version: https://github.com/iresty/apisix/blob/master/CHANGELOG_CN.md#060).

OpenResty programming philosophy and optimization tips

I’ve been working on OpenResty for six years now, starting in 2014. In the realm of OpenResty, the philosophy is to learn to minimize the problem, because Nginx manages memory by default by putting all request memory into a memory pool and then destroying the pool when the request exits. If it doesn’t go in and out quickly, it will keep applying, and it will eventually release with a lot of resource depletion, which Nginx is not good at. So for long connections with OpenResty, you need to be very careful not to grow the memory pool too large.

Also, create as few temporary objects as possible. There are two types of temporary objects. One is a table class, and the other is a string concatenation, where two variables are concatenated to create a new string. This seems fine in many languages, but in OpenResty you need to minimize such operations. Lua language is simple, but it is also a high-level language, with a good GC, so we don’t need to care about the life cycle of all the variables, just responsible for the application is good, but if the abuse of temporary variables and so on, will keep the GC busy, pay the cost of the overall performance is not high. Lua is good at dynamic and flow control, if you encounter a core CPU task, or recommended to C/C++ implementation.

Today I would like to share with you the optimization tips, mainly how to write a good Lua, after all, he has a larger audience. In Apisix’s core, we use some of the more specific optimization techniques, which are described below.

Tip 1: delay_json

For example, if the current log level is INFO, we would expect normal JSON encode. At the error level, we don’t expect a JSON encode operation to happen, and it would be perfect if we could skip it automatically. So how do we approximate that?

Delay_encode simply assigns tostring and force to two objects in delay_tab, and then does nothing else. If the tostring is overloaded with a metamethod, delay_encode simply assigns tostring and force to two objects in delay_tab. This is different from the JSON encode methods that you usually see. Because when you actually write a log, if you’re given a table, OpenResty will convert it to a string by checking to see if there’s a meta-method of toString registered, and calling that method to convert it to a string if there is. With the above encapsulation, we strike a good balance between high performance and ease of use.

Tip 2: Hash vs. Prefix Tree vs. Traversal

Lua table HASH: The best matching method, the disadvantage is that it can only do full matching.
Prefix Tree: Advanced matching of prefixes (with regular support) is done with Libr3.
Traversal: always the worst.

In the world of Apisix, I have merged the HASH with the prefix tree. If your request and routing rules do not include advanced rule matching, the HASH will be used by default to ensure efficiency. But if fuzzy matching logic is available, prefix trees are used.

Tip 3: ngx.log is NYI

Since ngx.log is NYI, we want to minimize the frequency of the following code firing:

Return ngx_log (log_level...)

To minimize this, you need to determine the current log level. If there is a size ratio between the current log level and the log level you entered, you will find that you do not need to enter it and return it directly. Avoid log processing, upload to the Nginx kernel and then find that there is no need to write log, so it will waste a lot of resources.

The stress tests mentioned above all put logs at the ERROR level, added a lot of debug code and left it undeleted, and the existence of the test code did not affect the performance results at all.

Tip 4: GC for CDATA and Table

Scenario: When a Table object is collected by the system, you want to trigger specific logic to release the associated resource. So how do we register GC for the table? Please refer to the following example:

When we cannot control the entire life cycle of the Lua Table, we can register a GC as shown in the figure above. When the table object has no reference, the GC is triggered and the associated resources are freed.

Tip 5: How to protect memory resident CDATA objects

We have a problem when using R3 C library: we add a lot of routing rules to R3, and then generate an R3 tree. If the rules do not change, R3 will be used over and over again. Since R3 does not request additional memory storage internally, it just refers to the pointer address. However, the Lua variable passed in outside may be a temporary variable and will be collected automatically by the Lua GC when the reference count reaches 0. The resulting phenomenon is that the contents of the original memory address referenced inside R3 suddenly change, and finally the routing match fails.

Now that we know the cause of the problem, the solution is simple. We just need to avoid releasing variable A prematurely and make the life cycle of variable A in Lua the same as the life cycle of the R3 object.

Tip 6: ngx.var.* is slower

You know C doesn’t support dynamic, it’s a compiled language. The internal implementation of ngx.var.* can be seen in the Nginx source code, or in the flame diagram to see the internal implementation. To accomplish dynamic fetching of a variable, the interior must go through a hash lookup, and the value of the variable must be read out using internal rules.

Solution is to use above the library (http://github.com/iresty/lua-var-nginx-module), a very simple way of no technical content. For example, to get the IP of the client, just pull the code out of C and read the value of the variable through Lua FFI. This small piece of code can improve the performance of Apisix by 5%. The downside is that you have to add this third party module to OpenResty compile time, which is a little more expensive to get started.

Tip 7: Reduce the number of garbage objects per request

As an OpenRESTY developer, we want to keep the number of garbage objects per request as low as possible, and if we understand this well, we can move into the top 50%.

Reducing unnecessary string concatenation doesn’t mean not concatenating strings when you need to do concatenation strings, but you need to keep this in mind all the time to reduce the number of invalid concatenations. When these small details add up, the performance gain can be huge.

Tip 8: Reuse the table

Let’s start with the initial version of Table. clear. When you need to use a temporary table, the usual way to write it is

local t ={}

Let’s talk about the disadvantages of this. If we create a temporary table T at the beginning, when the function exits, T will be reclaimed. The next time you enter this function, it will generate a temporary table t. In the Lua world, table creation and destruction is very expensive, because the table is a complex object. Unlike simple objects such as numbers and strings, the request and release can be handled in a single structure, which will keep your GC busy all at once.

If you only need a unique table object in the worker, you can use the table. Clear method to repeatedly use the temporary table, such as the temporary table LOCAL_PLUGIN_HASH in the figure above.

Reuse table: the advanced table.pool

Some Lua tables have a per-request lifecycle, usually with requests to enter the application object and requests to exit to release the object, and using table.pool is a good fit. TablePool is a pool of tables that can be reused. Can go to https://github.com/openresty/lua-tablepool#synopsis to view the official document, the combination of APISIX business using code, easier to understand.

In Apisix, there are two places that are used most intensively. In addition to the place above for recycling, there is also the place for application. After collection, these tables can be reused by other requests, and under unified control of TablePool, there may be dozens or hundreds of fixed objects maintained in the pool, which will be used repeatedly without destruction. When this technique is used correctly, performance can be improved by at least 20%, which is a very significant improvement.

Tip nine: the correct posture of the Irucache

Let’s briefly introduce IruCache. IruCache can complete the caching and reuse of data in worker. IruCache has a very big advantage that it can store any object. Shared memory is to complete data sharing between different workers, but it can only store simple objects, and some things cannot be shared across workers, such as function, CDATA object, etc.

The secondary packaging of IruCache mainly includes:

Keep the key short and simple: The most important thing to do when writing a key is to keep it simple. The worst way to design a key is to have a lot of stuff in it, but not much useful information. In theory, everyone likes to use string for key, but it can be an object such as table. Key should be as clear as possible, containing only the content you are interested in, and omit as much as possible to reduce the stitching cost.
Version reduces garbage cache: This is the breakthrough I made in Apisix: I extracted the combination of version, Irucache+ version, which can greatly reduce garbage cache.
Reuse cached data in the stale state.

The above figure is the encapsulation of lruCache, from the bottom up, the key is /routes, which follow the version number conf_version. What the global function does is to find out if there is stale data cache based on key+version, and return it directly if there is. If not, call the creat_r3_router. The creat_r3_router is responsible for creating a new object that accepts only one parameter, routes, which are passed in by routes.values.

This layer of encapsulation hides IruCache new, quantity, etc., so that we can’t see a lot of things, which we may still need to care about when we need to customize. In order to make things easier for plug-in developers to understand, Apisix has to do a layer of encapsulation to make things easier to use.

The < figcaption style = “margin – top: 0.66667 em. padding: 0px 1em; The font – size: 0.9 em. The line – height: 1.5; text-align: center; color: rgb(153, 153, 153);” >△ LRucache Best Practices1 </ figB

△ LRuCache best practices use cases

In the figure above, version is used to reduce the garbage cache and reuse the cached data of stale state, which is the second encapsulated code of iruCache. First, look at the second line: fetch the cache object according to the key, then compare the cache object’s cache_ver with the current version passed in. If the cache object is the same, then determine that the cache object must be available.

Stale_obj, which is less documented, occurs only in one case: cache objects are already obsolete in IruCache, but it’s only on the verge of being obsolete, not completely thrown away. In the figure above, the cache_ver of stale data is compared to the incoming version. If the version is consistent, it is valid. So as long as the data from the source doesn’t change, it can be used again. This allows us to reuse stale_obj and avoid creating new objects again.

To explain what was mentioned earlier, version reduces the garbage cache. If there is no version, we need to write version into the key. Each version change will generate a new key, and the old data that has been eliminated will always exist and cannot be eliminated. It also means that the number of objects in IruCache will keep increasing. The previous approach is to ensure that if the key is an object, there will be only one table corresponding to it, and there will not be different object caches based on different versions, thus reducing the cache total.

That’s all I have to share today, thank you!

Speech video and PPT download Portal:

Apisix’s high performance practices