In the era of big data, enterprises have higher demands for DBAs. At the same time, NoSQL as a new technology in recent years, also received more and more attention. Based on the DBA work of Mr. Meng Xianyao, SRA, and the experience of big data operation and maintenance, this paper will share two major contents: 1. The company’s architecture evolution on KV storage and the problems to be solved in operation and maintenance; 2. Second, NoSQL how to select and future development of some thinking.

As of April 20, 2018, NoSQL officially has 225 solutions, using a very small subset of them for each company. The products highlighted in blue in the image below are currently being used by Twitter.

The origin of the no

1946 the first general-purpose computer was created. But it wasn’t until 1970, with the advent of RDMBS, that a universal data storage solution was found. By the 21st century, the DT era has made data capacity the most intractable problem, to which Google and Amazon have proposed their own NoSQL solutions, such as Google’s Bigtable in 2006. NoSQL was officially coined at a technology conference in 2009, and there are now 225 solutions.

NoSQL differs from RDMBS in two key respects: first, it provides schemaless flexibility and supports very flexible schema changes; Second, scalability, native RDBMSS are only suitable for single machines and small clusters. NoSQL, on the other hand, is distributed from the start, addressing read/write and capacity scalability issues. The above two points are also the root cause of NoSQL.

There are two main ways to achieve distribution: Replication and Sharding. Replication solves read scalability and HA (high availability) problems, but not read and capacity scalability. Sharding addresses read and write and capacity scalability. Typical NoSQL solutions combine the two.

Sharding mainly resolves data partitioning, including interval partitioning (such as Rowkey partitioning in Hbase) and hashing partitioning. In order to solve the monotony and balance problems of hash distribution, virtual nodes are mainly used in the industry at present. Codis, described below, also uses virtual nodes. Virtual nodes are equivalent to establishing a layer of virtual mapping between data shards and managed servers.

Currently, NoSQL is categorized mainly by data model and access mode.

Several common NoSQL solutions

The Redis system scale is shown below. The following describes several problems encountered in the o&M process.

The first is the technology architecture evolution process. Getuan started as a message push service for APP developers. Before 2012, The business volume of Getuan was relatively small. At that time, we used Redis for cache and MySQL for persistence. From 2012 to 2016, with the rapid development of individual push business, a single node could not solve the problem. In the case that MySQL could not solve high QPS and TPS, we developed Redis sharding scheme by ourselves. In addition, we developed our own Redis client to implement basic clustering functions, support custom read/write ratios, and monitor and isolate failed nodes, slow monitoring, and check the health of each node. However, this kind of architecture does not give much consideration to the issue of operation and maintenance efficiency, and lacks operation and maintenance tools.

When we were planning to improve our operations tools, we found that the Pea Pod team had open-source Codis, which provided us with a good option.

The advantages of pushing Codis+

Codis is a proxy-based architecture that supports native clients, web-based cluster operations and monitoring, and also integrates with Redis Sentinel. It can improve the work efficiency of our operation and maintenance, and HA is easier to land.

However, in the process of using, we also found some limitations. So we came up with Codis+, which is some enhancements to Codis.

First, use 2N+1 copy scheme to solve the problem of Master single point during failure.

Second, Redis quasi-semi-synchronous. Set a threshold such that the slave is only readable within 5 seconds.

Third, resource pooling. You can expand resources by adding RegionServer to HBase.

In addition, there are rack awareness capabilities and cross-IDC capabilities. Redis itself is set up for a single room and does not take these issues into account.

So why don’t we use the native rRedis cluster? There are three reasons: first, the native cluster, which combines the routing and forwarding function with the actual data management function in one function, if one function goes wrong, it will cause data problems; 2. In large clusters, it takes time for P2P architecture to reach the consistency state. Codis is a tree architecture, so this problem does not exist. Third, the cluster has not passed the endorsement of the large platform.

In addition, regarding Redis, we have also recently been looking at a new NoSQL solution, Aerospike, which we are positioning to replace part of the cluster Redis. The problem with Redis is that data is resident in memory, which is expensive. We expect to use Aerospike to reduce TCO costs. Aerospike has the following characteristics:

I. Aerospike data can be stored in memory or SSD, and SSD has been optimized.

Second, resource pooling, operation and maintenance costs continue to reduce.

Support for rack awareness and synchronization across IDCs, but this is an enterprise version feature.

At present, we have two internal businesses using Aerospike. After measurement, we found that a single physical machine carrying a single Inter SSD 4600 can achieve QPS of nearly 10W. For businesses with large capacity but low QPS requirements, the Aerospike solution can be selected to save TCO.

During the evolution of NoSQL, we also encountered some operational issues.

Standardized installation

We are divided into three parts: OS standardization, Redis file and directory standard, Redis parameter standardization, all with SaltStack + CMDB implementation;

Capacity expansion and reduction

As the technology architecture evolves, it becomes easier to scale up and downsize, in part because CODIS alleviates some of the problems. Of course, if the Aerospike is selected, the relevant operation is very easy.

Good monitoring, reduce operation and maintenance costs

Most operations students should read SRE: The Secrets of Google Operations and Maintenance carefully. It puts forward many valuable methodologies in theory and practice, which is highly recommended.

A push Redis monitors complexity

There are three cluster architectures: self-research, CODIS2 and CODIS3, which collect data in different ways.

The three monitored objects are clusters, instances, and hosts. Metadata is required to maintain logical relationships and be aggregated globally.

There are three kinds of personalization: push Redis clusters, some clusters need to have multiple copies, some do not. Some nodes are allowed to be full, while others are not. There are persistence policies, some do not persist, some do persist, some do persist + remote backup, these business characteristics put forward high requirements for our monitoring flexibility.

Zabbix is a very complete monitoring system, and I have used it as the main monitoring system platform for more than three years. But it has two defects: one is that it uses MySQL as the back-end storage, TPS has a ceiling; Second, it is not flexible enough. For example, if a cluster is placed on 100 machines, it is very difficult to do aggregation indicators.

Xiaomi’s Open-Falcon solves this problem, but it also creates new ones. For example, there are few alarm functions, no strings are supported, and manual operations are sometimes added. Then we added functionality to it and we didn’t have a major problem.

The figure below is a push operation platform.

The first is the IT hardware resource platform, which maintains the physical information of the host dimension. For example, which switch is connected to the host in which rack, which floor of which machine room and so on, which is the basis of rack perception and cross-IDC and so on.

The second is the CMDB, which is to maintain the software information on the host, which instances are installed on the host, which clusters the instances belong to, which ports we use, and what personalized parameter configuration these clusters have, including the different alarm mechanisms, all through the CMDB. The data consumers of the CMDB include the Grafana monitoring system and the monitoring acquisition program, which we developed ourselves. Then the CMDB data will come alive. If there is only static data with no consumer, the data will be inconsistent.

The Grafana monitoring system aggregates data from multiple IDCs, and our operations only need to look at the large screen every day.

Slatstack, for automated publishing, standardization and productivity.

The acquisition program is developed by ourselves and highly customized to the company’s business characteristics. And ELK (not Logstach, fileBeat) for the log center.

Through these, we set up a push the whole monitoring system.

Here are some of the pits encountered during the construction process.

1. The master/slave reset causes the pressure of the host node to explode and the master node cannot provide services.

There are many reasons for master-slave resets.

The Redis version is low and the probability of master/slave reset is high. Redis3 has a much lower probability of master/slave reset than Redis2, and Redis4 supports incremental synchronization even after a node is restarted, which is an improvement of Redis itself.

We currently mainly use 2.8.20, which is relatively easy to generate master/slave resets.

Redis master/slave resets typically trigger one of the following conditions.

1. The repl-backlog-size is too small, default is 1M, if you have a large number of writes, it is easy to breach the buffer; 2, repl – a timeout, Redis master-slave default once every 10 seconds ping, ping 60 seconds don’t push will master-slave reset, the reason may be that the network jitter, total node pressure, unable to respond to the package, etc.; Tcp-baklog, default is 511. The default limit for the operating system is 128, which can be raised moderately. We raised it to 2048, which provides some fault tolerance for network packet loss.

These are the reasons for the master/slave reset, which can have serious consequences. The Master is overwhelmed and cannot be serviced, so the business determines that the node is unavailable. The response time becomes long. All the hosts on which the Master resides are affected.

Two, the node is too large, partly caused by human reasons. The first is that the efficiency of splitting nodes is low, much slower than the growth of the company’s business volume. In addition, there are too few sharding. Our shards are 500, CODIS is 1024, and CODIS native is 16384. Too few shards is also a problem. If you develop your own distributed solution, you must set the number of sharding slightly larger, to avoid business growth than you expected. When the node becomes too large, the persistence time increases. We need to persist the 30G node, and the remaining memory of the host should be larger than 30G. If not, you will use Swap to cause the persistence time of the host to increase greatly. A 30GB node persistence may take four hours. A high load may also cause a master/slave reset, causing a chain reaction.

As for the pits we encountered, let’s share a few practical cases.

The first example is a master-slave reset. This happened two days before the Spring Festival, which is the peak of message push service. Let’s briefly restore the failure scenario. First, massive message delivery leads to increased load. Then, the pressure of the Redis Master increases and TCP packets are backlogged. Packet loss occurs in the OS. Packets are lost from the ping of the Redis Master and slave, triggering the repl-timeout 60 seconds threshold, and the Master and slave are reset. At the same time, the Swap and IO saturation are close to 100% due to the large node size. The solution is very simple, we first disconnect the master and slave. The first reason is that the parameters are not reasonable, mostly the default value, and the second reason is that the node is too large to magnify the fault effect.

The second example is a recent problem that CODIS encountered. This is a typical failure scenario. After a host hung down, CODIS started the master-slave switchover. After the master-slave switchover, the services were not affected. However, when we tried to reconnect the master-slave, we found that the connection could not be made, so we reported an error. This error is not difficult to check, in fact, the parameter setting is too small, also due to the default value. When the Slave pulls data from the Master node, new data is stored in the Master buffer. If the Slave does not finish pulling data, the Master buffer exceeds the upper limit, causing a Master/Slave reset and an infinite loop.

Based on these cases, we have compiled a list of best practices.

Configure CPU affinity. Redis is a single-node structure, which will affect CPU efficiency.

2. The node size should be controlled at 10G.

Third, the remaining memory of the host should be greater than the maximum node size +10G. Master/slave reset requires the same size of memory, this must be left enough, if not enough, using Swap, it will be difficult to reset successfully.

Try not to use Swap. Responding to a request in 500 milliseconds is worse than hanging up.

5. Tcp-backlog, repl-backlog-size, and repl-timeout are increased.

6. Master does not persist, Slave does AOF+ timing reset.

Finally, some personal thoughts and suggestions. There are five principles for choosing the NoSQL that suits you:

1. Business logic. First of all, we need to understand our own business characteristics, such as KV type is found in KV; If it is a pattern, look in the pattern, and the range will be reduced by 70% to 80%.

2. Load characteristics, QPS, TPS and response time. When choosing a NoSQL solution, you can measure the performance index of a single machine in a certain configuration. Under the condition that Redis has enough remaining hosts, the QPS of 400,000-500,000 per host is completely OK.

3. Data scale. The larger the data, the more questions to consider and the less selective. At a few hundred terabytes or petabytes, there’s not much choice but Hadoop.

4. Operation and maintenance cost and whether it can be monitored, and whether it can be easily expanded or reduced.

5. Others. For example, is there a successful case, is there a complete documentation and community, is there any official or corporate support. Can let others step on the pit after we smooth past, after all, their own step on the pit cost is quite high.

Conclusion: About the definition of NoSQL, there has been a paragraph on the network: from 1980 know SQL, to 2005 Not only SQL, and now NoSQL! The development of the Internet is accompanied by the updating of technical concepts and the improvement of relevant functions. Behind the technological progress, it is the continuous learning, careful thinking and unremitting attempts of every technician.

Dry goods share

  • Build MySQL master-slave replication based on Docker

  • Technical challenges and practice summary behind 100 billion visits of wechat circle of Friends

  • Python MySQL and SQLAlchemy operations summary

  • Operation and maintenance must be basic: do you know how to play touch

  • Operation and maintenance must be basic: AWK from entry to god

  • What does functools do in Python?

  • O&m attention: Redis vulnerability utilization and defense

  • O&m development answer: What happens from entering the URL to displaying the page

  • Python minimum coding specification

  • Shell Scripting: Gameplay you don’t know

  • Seven of the best Open Source Web servers available for the enterprise

  • Do you know how to check Linux distribution names and version numbers?

  • Common MySQL high availability solution selection interpretation

  • Why learn Linux?

  • Six general scenarios for highly available architectures

  • Ubuntu 18.04 LTS is officially out, but let’s talk about RHEL

  • The loss of opportunity, no developed operating system of the pain of the big country

  • Advanced Python skills: Bubble sorting, 80% of interview questions

  • Who do I go to for help in Linux? Get a man!!

  • All that netstat and port stuff in Linux

  • Twitter has suffered a worldwide outage, prompting some to complain that it is the end of the world.

  • The whole operation and security thing

  • Git served: Git 12 years old, here are 12 tips for Git

  • Wi-fi is also available on the Linux command line

  • Red Hat Enterprise Linux 7.5 is here

  • Linux GitHub gameplay sharing

  • Add “recycle bin” to “rm” command

  • I heard there is another DNS out of the wall. 1.1.1.1?

  • MySQL operations: Index and query performance optimization