On July 6, 2019, the OpenRESTY community jointly held the OpenRESTY × Open Talk national tour salon · Shanghai station. Ye Jing, the head of the development department of the platform of OpenRESTY, shared “the application of OpenRESTY in the storage of Youpai cloud” in the event. OpenRESTY X Open Talk is a national touring salon initiated by the OpenRESTY community, which invites experienced OpenRESTY technical experts to share their OpenRESTY experience and promote communication and learning among OpenRESTY users. Drive the OpenResty open source project. The activities will be held in Shenzhen, Beijing, Wuhan, Shanghai, Chengdu, Guangzhou, Hangzhou and other cities.

Ye Jing, head of Development Department of Youpaiyun Platform, is currently mainly responsible for the design and development of Youpaiyun elastic cloud processing platform and internal private cloud, as well as some work related to file upload interface. I have in-depth research on languages such as Python/Lua/Go, rich experience in ngx_Lua and OpenResty module development, focus on high concurrency and high availability service architecture design, and have more practice on Docker container. Usually keen to participate in the open source community to share open source experience.

Here’s the full text:

Hello everyone, I am Ye Jing from Youpaiyun. Today, I would like to share with you the application of OpenRESTy in Youpaiyun storage system. On the one hand, I will introduce the application of OpenRESTy; on the other hand, I will introduce the principle of Youpaiyun storage system.

Distributed storage, especially public cloud storage systems, cannot be separated from three requirements:

  • High availability, the system in any case can not appear unserviceable state, even if a few machines hang, should be able to write, and need to be as readable as possible;
  • Easy to expand, the storage capacity is constantly rising, and the rising speed is very fast, if the system can not support rapid and convenient expansion, the whole system will face great pressure in the operation and maintenance;
  • Easy to maintain, the storage system has many components, these components must be very easy to maintain, not too much dependence on each other.

Stored data I: split

partition

The first step in splitting stored data is partitioning, which is a very important concept and the most common practice in distributed systems. In a large database, it is common to divide the whole database into small subsets. The most common method is to partition the database by key. For A cloud storage system, the key is the path after the URL, and the cloud is according to the path after the URL, according to A to Z to arrange, the data partition. This partition is easy to do prefix scanning, because we often do directories, column directories are just files with the same PATH prefix, and this is very convenient if the keys are arranged in order.

The second step is to hash the key to shatter the access, as described in more detail below.

All of this work is done in Lua code, and the data split is done by OpenResty.

The image above shows the hash of the key. The original request for the cloud storage file is a URL, but if this URL is also used as its key when writing to the storage, the hot spot will be very serious. And pat cloud has over 500000 paying customers, through our observation, there are a lot of customers, especially the big customer, their key file is with the date, so the key prefix of the file may have is a certain date, and recently uploaded files must be the hottest, and that will lead to upload all the files on the same machine, It will make the bandwidth of this machine full. So what we do is we turn the URL of the file into a hash, which is not the MD5 of the key, or the hash calculated by some algorithm, but it’s just internally generated UUID corresponding to the file, and then we record the corresponding relationship.

Index splitting

The index in the storage system is the metadata information of the file. Metadata information refers to the original key of the record, the internal key, the file size, which cluster it exists in, and the like.

The flow in the figure above is the upload file flow for external storage access. The first step is to make a PUT request. The URL of the PUT request is the key of the file. Then we go to the OpenResty layer, which is a storage gateway based on OpenResty. The storage gateway will generate an internal UUID for the URL and do the corresponding. Ngx_lua has a req.socket inside it. It fetches the socket and stores it in a cluster called Block. The Block cluster is where the binaries of the files are actually stored. The whole process is streaming, while reading and writing, so it will not bring some problems of large files, when the file data is finished, and then write the UUID, some metadata information to the KeyIndex (metadata cluster).

Internal content splitting

The second step is to split the Block data. This is just a simple process. When you receive the uploaded data and write it to the Block cluster, you don’t write all the data to the same Block cluster, you split it. Cloud also supports the maximum 40T file upload, now the maximum used disk is a single 8T, single file 40T is how to support it? The practice is to split the Block, if the data is split into 10M, 10M blocks, it can be uploaded to different machines and disks, as long as the corresponding relationship can be recorded.

In fact, in the OpenRESTY Gateway, the same principle is used to receive data. First you take a Block size, say 10M, then 10M becomes a UUID-0 and write it to the Block, then you take a second 10M, which becomes UUID-1 and write it to the Block cluster. By the end of receiving, such a G file may produce more than 100 “UUID-numeric” chunks, which are stored on different machines and disks, so that it can support very large file storage.

The process of receiving and writing data is actually strategic. Instead of the usual OpenResty usage, where you do some validation and speed limiting in the Access or Rewrite phase, you delegate the data to Nginx proxy_pass. In this case, there is no Nginx proxy_pass at all. We use Lua code to control the reading and writing of data, and then return the data. The whole process is controlled by Lua code.

In summary, the above mentioned break-up is divided into three steps:

  • The first split, file path (URL) corresponding to multiple Meta clusters, fixed partition. In storage, there are multiple Meta clusters, and there are many Block clusters. A Meta cluster can correspond to multiple Block clusters. When a file is uploaded, there is a policy about which cluster it should be stored in. The first step is to determine which storage partition the URL belongs to. We often see an option when building storage: build East China data center, South China data center or North China data center. At this time, it has been determined which Meta cluster the data of this storage space will be written to forever. This time, we will mainly do this.
  • The second split, one Meta cluster corresponds to multiple Block clusters, which is the Meta cluster according to some internal weight and configuration adjustment;
  • The third split, the META and BLOCK subsystems are partitioned internally to separate a single piece of data to different disks on different machines.

Storing Data II: Routing

The second part introduces the routing in the storage.

Routing mode selection

Routing will think of a pattern is often mentioned agent, the agent is above the middle of the first (2), the role of it in the middle of doing a layer of agent, all the following MySQL or Redis just do single node storage, including the previous agent know the distribution of all nodes under the storage routing, all requests are proxied.

In the first mode on the left, all nodes are peer to peer, and all nodes know where the data resides. When Redis visits the data, it can find any node to access. If the data happens to be on this node, it will return directly.

The third method is often used by the Java ecosystem. HBase, for example, uses the third method. Routing information is stored in the client, and the client finds the node directly, eliminating a lot of intermediate process. One problem, however, is that clients can be very complex, and in a storage system, the third kind of client is definitely not a good idea, because a client is a client’s REST API, which has only one HTTP and cannot route information.

Routing Tier ② is OpenResty Storage Gateway. It contains routing information and knows which cluster the URL should go to. The above picture shows a GET request process for downloading a file. After a URL comes in, the gateway will first go to the META cluster (KeyIndex on the left), take the URL to find the internal corresponding UUID, and then take the internal “UUID-number” to read out the blocks in the Block cluster. And then you spit it back out piece by piece, and that’s how GET works.

Meta cluster routes are all fixed routes, which are divided into several levels:

  • The different user or space, the space that precedes a URL, which storage cluster the space should correspond to, these are all fixed;
  • Different storage types, such as ordinary storage and low frequency storage, are fixed and unchangeable in which cluster respectively;
  • Different indexing functions.

Column catalog

And take the cloud through the gateway column directory, simple is the key prefix match, we built a separate directory system to achieve the directory function.

The data in the KeyIndex (Meta cluster) on the left of the figure above will be synchronized in real time, and some necessary information will be synchronized to the directory index. For example, the column directory only needs the name, size, type, modification time of the key of the file, and it will extract the information and input it into the directory system. If the previous gateway receives a column directory request, it goes directly to the directory system and lists the data based on prefix matches.

Files are filtered by time

It’s not uncommon to see a request to list files that were recently uploaded, files that were uploaded from a certain day, or files that were uploaded a year ago, according to when they were uploaded. At this time, we need to build a separate set of chronological index, which is different from the local file system. The local file system is small, so we can list as we want. However, the number of cloud storage files is more than 100 billion, so it is impossible to complete the index without doing it in advance and then listing when the request arrives.

routing

The routing of the Block cluster is the same as that of the Meta cluster, which is divided by storage type and user space. In addition, each Block cluster can have a different Weight to control the amount of writing to it. If a new Block cluster needs to write more data, it can be changed to a higher Weight.

– Lua-resty-checkups is a module that has been open sourced for a long time.https://github.com/upyun/lua-resty-checkupsThis module is available on almost all ngx_lua machines in the cloud. It has been in use for several years and is very stable. The main work of this module is to manage the UPSTEAM address, to do active health check, passive health check, dynamic update UPSTEAM address and routing strategy, etc. All the routing functions mentioned above are implemented through this module.

The OpenRESTy Gateway will periodically go into Consul to get the latest configuration and then cache it into its own process. Currently we take the routing configuration once a minute and cache it into the process and each process works according to this configuration. About configuration function, and pat slardar cloud also open a project (https://github.com/upyun/slardar), the configuration principle and the principle of slardar here as like as two peas, and many of the module is directly bring them here.

Storing data III: common features

HEAD

The Head operation is to check whether the file exists. It has nothing to do with the real data of the file. After the Head comes, the gateway will take the URL to check to see if the file exists.

DELETE

The Delete operation does not require a Block of data, because in a storage system, the Delete does not actually Delete the data from the disk, but only marks the file as deleted by marking it in the metadata KeyIndex. In fact, data cleaning is to collect the files that have been marked deleted by an asynchronous worker, and then go to GC to really delete them, and there will be a delay in GC. It is not necessary to go to GC immediately after the mark is deleted, because there may be some cases of misoperation. In order to avoid this situation, We usually defer GC by 7 days or even 1 month. During this process, the gateway sends a message to the Kafka queue via the Kafka module in Lua, indicating that this is a delete operation. The Kafka message will be consumed by the GC consumer. When it gets the log, it will go to the Block data and delete the file.

Other common functions

In addition to the operations just described, there are many other operations in the storage system:

  • Move, rename;
  • Copy D.
  • Append, Append write;
  • Patch, modify the source information of some files;
  • Mkdir, build directory;
  • Random, Random write;

We haven’t implemented the Random function yet, but the Random Read function is OK. With the exception of the Random function, everything else can be implemented in Lua code, and these are a good example of OpenResty writing business logic.

Storage data IV: capacity expansion

Next, we’ll look at storage scaling, which is not so much about OpenResty, but is definitely a storage issue. Scaling involves two aspects, one is the Meta cluster expansion, the other is the Block cluster expansion.

Meta cluster expansion

Meta cluster stores metadata information of files. The value is actually very small, maybe only a few hundred bytes, no matter how big it is, no more than 1K. Its capacity expansion is relatively easy, such as adding a machine, its total amount is also small, and balance is very fast.

In fact, we don’t usually do Mata cluster expansion, impression and make cloud only done once for so many years, because of the capacity of the Meta cluster can be calculated, such as to support one hundred billion file storage, can be calculated probably need the capacity of Meta cluster, hundreds of T sure enough, so you buy a batch of equipment in the, You don’t have to think about the expansion. In general, scaling a Meta cluster is relatively simple.

Block cluster expansion

More troubling is the expansion of the Block cluster. The file size of a Block can be large or small, and it can contain dozens or even hundreds of P’s. If you are a cluster of several P, when you add a machine to balance again, all the other machines to be removed part of the data to write new you are currently on this machine, this is a very horrible things, may need a few days or even a week, the whole cluster is in a state of a data to down to, This is definitely going to affect the business.

We wanted to avoid this Balance operation as much as possible, so we came up with a clever way to avoid Balance within the cluster as much as possible, and when we needed to expand capacity, we simply added a new cluster. Of course, sometimes it is necessary to do balance. If necessary, let it be expanded slowly, for a few days or a week. However, our general practice is to estimate the number of machines and capacity needed by the next cluster, directly add the whole cluster to the gateway layer, and then increase the Weight value, so that a large amount of data will be written to the new cluster, so as to expand the capacity of the whole cloud storage.

Other V

copy

Replication capability is required for both META clusters and BLOCK clusters, since we are using multiple copy stores, or EC stores. PostgreSQL/MySQL can be used for the Meta cluster. Hbase has HDFS for replication. PostgreSQL/MySQL needs to configure its master and slave or provide some synchronization and replication functions for it.

In addition, the backup of Meta data is also very important, because the Meta cluster is related to whether all the data can be accessed or not. If there is a problem, it will be very serious, so we need to write the Meta data to Kafka in the gateway layer. Another way is to directly create a plug-in in the database and then direct it to Kafka.

The replication of a Block cluster is more complex, and is usually done within the cluster and has little to do with the gateway layer.

The transaction

Transaction is also a very important concept of storage. In the cloud storage system, there is no way to do transactions like stand-alone database. It can only do transactions at the level of a single object to ensure that the object is in the transaction. The entire operation requires a Meta cluster to support a CAS (compare-and-set) operation. An object cannot be written by two threads at the same time. This will cause one of the threads to fail. The Meta information written later will prevail.

OpenResty /lua-resty-limit-traffic is an openResty /lua-resty-limit-traffic method, which is an openResty /lua-resty-limit-traffic method, and a token bucket method is added to this method. The Token Bucket module is also currently open source on our GitHub. We use this module internally. After testing, it is the smoothest module and can cope with unexpected requests well.

Beyond Distributed Storage

In fact, to do a cloud storage system is not only to do a gateway or storage, there are many supporting things behind, such as API, API and cloud is also written through Lua, there are also a lot of business logic, For example, form API involves form parsing, parameter parsing, uploading to storage gateways and so on. In addition, there are authentication algorithms, breakpoint continuation are also written through Lua. Breakpoint continuation refers to a large file, such as a dozen gigabytes of files, which can be cut into 1M and 1M file blocks and transferred to the storage respectively. The storage will first write these files to the Block cluster, and when the last finish message is received, the storage will assemble these temporary data into a whole file.

And the cloud storage system

The above diagram is the relationship diagram of the modules of the cloud storage system. OpenResty is mainly in the upper left corner. UpyunAPI is the API layer of the cloud storage system, and it does things like authentication, authentication, uploaded form API, etc. Avalon is OpenResty’s cloud storage gateway, where internal storage-related traffic passes through, including CDN GET traffic. On the left is the Meta cluster, which has many components, including HBase, PostgreSQL, Redis, and backup work; On the right are some consumers, because the storage system needs a lot of consumers to do some specific work, such as automatic expiration, TTL, GC, bad disk repair, etc. The bottom part is the Block cluster, where the actual data is stored.

Cloud OpenResty related open source projects

The following are some of the open source projects mentioned above, which can be found in Upyun’s warehouse. These modules are also widely used in Upyun, mainly including:

[1] upyun/slardar: https://github.com/upyun/slardar

[2] upyun/lua – resty – checkups: https://github.com/upyun/lua-resty-checkups

[3] upyun/lua – resty – limit to: https://github.com/upyun/lua-resty-limit-rate

Speech video and PPT download:

Application of OpenResty in Repair Cloud Storage – Repair Cloud