• How Data Sharding Works in a Distributed SQL Database
  • Sid Choudhury
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Ultrasteve
  • Proofread by: Jane Eldq, Jake Ggie

How does data sharding work in a distributed SQL database

Today, companies of all sizes are embracing the rapid modernization of user-directed applications as part of their broader digital transformation strategy. As a result, the RDBMS (relational database infrastructure) on which these applications depend now needs to support much larger volumes of data and transactions. In this scenario, however, a single RDBMS often quickly reaches an overload state. Data sharding is one of the most common architectures used to solve this problem, enabling RDBMS to achieve better performance and higher scalability. In this article, we’ll explore what sharding is, how sharding can be used to extend a database, and the pros and cons of several common sharding architectures. We will also explore how data sharding is implemented in distributed SQL databases such as YugaByte DB.

What exactly is sharding?

Sharding is a process of dividing a large table into data fragments, which are distributed on multiple servers. Data shards must be horizontally shelled, with each shard being a subset of the entire data set, each responsible for a portion of the overall workload. The central idea of this approach is to distribute large amounts of data that would otherwise be difficult to fit into a single database cluster. Sharding is also called horizontal sharding. The difference between horizontal sharding and vertical sharding comes from traditional tabular databases. A database can be shelled vertically (spreading different columns of a table across the database) or horizontally (spreading different rows across multiple database nodes).

Figure 1: Vertical and horizontal segmentation (source: Medium)

Why shard the database?

As business scale increases, business applications that rely on individual RDBMS will reach performance bottlenecks. Limited by CPU performance, secondary memory, and main memory size, database performance will suffer someday. In a non-sharded database, the response of read operations and the speed of daily operations become extremely slow. Vertical scaling (also known as scaling up) has a number of drawbacks when it comes to providing more running resources for database operations, and ultimately outweighs the benefits.

On the other hand, horizontal partitioning of tables means you have more computing resources to handle query requests, you get shorter response times, and you can create indexes faster. Sharding makes more efficient use of new resources during expansion by continuously balancing the amount of data and workload between additional nodes. Not only that, but it’s much cheaper to maintain a cluster of smaller, cheaper servers than it is to maintain one large server.

In addition to addressing scalability issues, sharding can also address the potential for unexpected outages. When an unsharded server goes down, all data becomes inaccessible, which can be a disaster. Sharding, however, is a good way to solve this problem. Even if one or two nodes go down, there are other nodes that retain the remaining shards and can still provide data read and write services as long as they are in different error domains. Overall, sharding increases the storage capacity of a cluster, reduces processing time, and provides higher availability with less capital than vertical scaling.

Manual sharding

For applications with large data volumes, fully automated deployment in a shard consisting of a series of table building and load balancing can be of great benefit. Unfortunately, monolithic databases like Oracle, PostgreSQL, and MySQL, and even newer distributed SQL databases like Amazon Aurora, do not support automatic sharding. This means that if you want to continue using these databases, you must manually shard them at the application layer. This greatly increases the difficulty of development. In order to know how your data is distributed, your application needs an extra set of sharding code and needs to know where the data comes from. You also need to decide what sharding method to use, how many sharding you ultimately need, and how many nodes you need. As your business changes, so does the sharding method and the sharding primary key.

One of the major challenges of manual sharding is uneven sharding. Allocating data disproportionately causes sharding to become unbalanced, meaning that while some nodes are overloaded, others may be idle. Because overload of some nodes can slow down the overall response speed and cause the service to crash, we try to avoid storing too much data in one shard. This problem can also occur in a small shard set, because a small shard set means that data is spread over a very small number of shards. While this is acceptable in development and test environments, it is not allowed in production environments. Uneven data allocation, partial node overload and too little data allocation can lead to fragmentation and service resource exhaustion.

Finally, manual sharding complicates the operation. Now you need to back up on multiple servers. Data migration and table structure changes now need to be more carefully coordinated to ensure that all shards have the same structure. Without adequate optimization, database join operations on multiple servers can become inefficient and difficult to perform.

Common automatic sharding architecture

Sharding has a long history, with many sharding architectures and implementations developed over the years for deployment in a wide range of systems. In this section, we’ll discuss the three most common implementations.

Sharding based on hash

Hash-based sharding uses the shard primary key to generate hash values that will be used to determine where this piece of data will be stored. By using ketama, a generic hash algorithm, hash functions can spread data evenly across servers to reduce partial node overloads. In this approach, data with similar shard primary keys are unlikely to be assigned to the same shard. This architecture is therefore well suited for targeted data operations.

Figure 2: Hash-based sharding (source: MongoDB documentation)

Range-based sharding

Range-based sharding, which splits data by reference to the range of data values. Shard data with similar primary key values are more likely to fall into the same range, and therefore more likely to fall into the same shard. Each shard must hold the same structure as the original database. Data sharding will be as simple as identifying the correct range of data and putting it into the appropriate shard.

Figure 3: Scope-based sharding

Range-based sharding makes reading contiguous ranges of data, or range queries, more efficient. However, in this sharding mode, users need to select the primary sharding key in advance. If the primary sharding key is not selected properly, some nodes may be overloaded.

A good rule of thumb is to choose keys with a larger cardinality and lower repetition rate as the shard primary keys. These keys are usually very stable and do not increase or decrease. If the shard primary key is not chosen correctly, data will be distributed unequally in the shard, and certain data will be accessed more frequently than other data, resulting in bottlenecks for those with heavy workload.

The ideal solution to unequal sharding is to merge and automate sharding. If the shard becomes too large or a row is frequently accessed, it is best to split the large shard into smaller shards and redistribute the smaller shards evenly among the nodes. Similarly, when there are too many small shards, we can do the opposite.

Shard based on geographic location

In location-based sharding, data is sharded according to user-specific columns whose values depend on geographic location, and different shards are assigned to corresponding regions. For example, with a cluster deployed in the US, UK, and Europe, we can put shards in the right place according to GDPR (General Data Protection Regulations) based on the value of Country_Code in the user table.

Fragments in YugaByte DB

YugaByte DB, developed by Google Spanner, is a high performance distributed SQL database with automatic sharding function and high elasticity. It currently supports sharding based on hash by default. It’s an active update, and location-based and range-based sharding will be added later this year. Each data fragment in YugaByte DB is called a tablet, and it is assigned to the corresponding sub-table server.

Sharding based on hash

For hash-based shards, the table is allocated in a hash space of 0x0000 to 0xFFFF (in the total 2B range), which holds approximately 64KB of child tables in a large data set or cluster. Let’s take a look at the table with 16 sub-charts in Figure 4. Here we take the entire 2b-size hash space to hold the shard and divide it into 16 parts, each corresponding to a subtable.

Figure 4: Hash based sharding in YugaByte DB

In read/write operations, primary keys are the first to be converted to internal keys and their corresponding hashes. This is done by collecting data from the available child tables. (figure 5)

Figure 5: Decide which child table to use in Yugabyte DB

For example, as shown in Figure 6, you now want to insert data with a key K and value V into the table. A hash value is calculated based on the key value K, and the database queries the corresponding subtable and subtable server. Finally, the request is sent directly to the appropriate server for processing.

Figure 6: Storing k values in YugaByte DB

Range-based sharding

SQL tables can set autoincrement and autodecrement in the first column of the primary key. This allows data to be stored in a single shard, or subtable, in a pre-selected order. Currently, the team is developing features such as dynamic partitioning of subtables (based on various criteria, such as range boundaries and loads) and enhanced SQL syntax for explicitly specifying specific ranges.

conclusion

Data sharding is a solution for building large data sets and meeting scalability requirements in business applications. There are many data sharding architectures to choose from, each offering different capabilities. Before deciding which architecture to use, we need to clearly outline your project requirements and expected payloads. Manual sharding should be avoided in most cases due to the significant increase in application logic complexity. YugaByte DB, a distributed SQL database with automatic sharding, currently supports hash-based sharding, while range-based and location-based sharding will be available soon. You can check out this tutorial to learn about YugaByte DB auto-sharding.

The next step?

  • An in-depth comparison of YugaByte DB and CockroachDB, Google Cloud Spanner and MongoDB differences.
  • Start with YugaByte DB, use it in macOS, Linux, Docker, and Kubernetes.
  • Contact us to find out about certificates and fees or to book a technical interview.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.