0. Abstract

Bigtable is a distributed storage system that stores petabytes of structured data. Bigtable is used by many projects within Google, including Web Indexing, Google Earth, Google Finance, and others. These applications have different requirements for Bigtable in terms of data size, latency requirements, and so on. Bigtable has succeeded in providing a flexible, high-performance solution. This paper describes the data model provided by Bigtable, and the design and implementation of Bigtable.

1. Introduction

It is mainly some general content, as well as an overview of the main content of the following sections.

2. Data Model

Bigtable is a sparse, distributed, persistent, ordered, multidimensional map. The map is indexed by the row key, the column key, and a timestamp. Each value of the map is an array of unparsed bytes, i.e. : (row:string, column:string, time:int64) -> string row key is an arbitrary string (currently up to 64KB, 10-100 bytes is sufficient for most users). A single read or write under a single Row key is atomic (even for different columns), which is designed so that the customer can understand the system behavior when concurrently updating the same Row. Bigtable stores data in lexicographical order of the row key. Each row range of a table, called a tablet, is a unit of distribution and load balancing. Therefore, the reading of short row range is very efficient, and generally only requires communication with a small number of machines. Customers can take advantage of this property to achieve good locality of data access by selecting the appropriate Row Keys.

Column keys, organized as column families, are the basic unit of control. All data stored in the same column family tends to have the same type, fewer column families are supported in a single table (up to a few hundred), and changes are rare. In contrast, there is (by design) no upper limit to the number of columns a table can have. The name of a column key uses the following syntax: family:qualifier. The names of the column family must be printable, while the qualifiers can be any string.

Timestamps are 64-bit integers. Bigtable can store multiple versions of a single piece of data; These versions are indexed by timestamps. Timestamps can be assigned by Bigtable (real time) or can be explicitly specified by the client application. Applications that need to avoid conflicts need to generate unique timestamps themselves. Data of different versions are stored in descending order of TIMESTAMP. To make it easier to manage multiple versions of data, Bigtable supports two garbage collection policies (within a single column-family). For the time – based and quantity – based, see the original paper.

3. API

Bigtable provides functions to create and delete tables and column families, as well as functions to change cluster, table, and column family metadata, such as access control permissions. Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write operations, but not across lines. Skip the rest of this section and see the original.

4. Building Blocks

Bigtable builds on other Google infrastructure, such as using GFS to store logs and data files. Bigtable also relies on the cluster management system to schedule tasks, manage machines and resources, store machine errors, monitor machine state, and more. Bigtable’s data is stored in the Google SSTable file format. An SSTable provides a persistent, ordered, immutable KV map, and the key and value are arbitrary strings. Each SSTable contains a series of blocks (each block is typically 64KB, but this is configurable of course). The last store of the SSTable is the block index, which can be used to locate the block. Bigtable relies on a highly available, persistent distributed lock service called Chubby, which is used by Bigtable to ensure master uniqueness. The bootstrap location to store the data (see 5.1); Discover the Tablet Service and determine Tablet Service Death (see 5.2); Stores Bigtable schema information (column family information for each table); Stores a list of access rights.

5. Implementation

Bigtable is divided into three components: a master service, multiple tablet services, and a library that connects to the client. Tablet services can be added dynamically to accommodate changes in workloads. It then briefly introduces the work of the Master service and the Tablet service, as shown in the article.

5.1 the Tablet Location

Use a three-tier structure similar to a B+ tree to store the tablet location information. The first layer is a file stored in Chubby that contains the location of the root tablet. The root tablet contains all table METADATA information about the location of the tablet. Each METADATA tablet contains a set of information about the location of a user’s tablet. The root tablet is the first tablet in the Metadata table, but it is special because it never splits to ensure that the level of location information is no more than three levels. The format of METADATA is as follows: Row key is encoded by the user table identifier and its last row. Each Metadata Row stores about 1KB of arrays in memory. For a 128MB Metadata Tablet, the above three-tier architecture can store the location of 2 ^ 34 Tablets.

5.2 the Tablet the Assignment

At the same time, each tablet is assigned to a tablet server. The master keeps track of the working collection of Tablet Servers and the current allocation of Tablets. When a tablet is not allocated and a tablet server has enough space, the master will assign the tablet to the server. Bigtable uses Chubby to track Tablet Servers. This is mainly the use of Chubby, similar to ZK. Whenever the Tablet Server is working, it tries to pick up the exclusive lock on a file on Chubby. If the Tablet Server quits service, it will try to release the lock, so the master will be able to allocate its managed tablets faster. The master is responsible for detecting when a tablet server is no longer serving its tablets and reassigns them as soon as possible. Periodically, the master communicates with other Tablet Servers to ask about the status of their locks. If a Tablet Server replies that it has lost its lock, or that the master cannot communicate with the Tablet Server, the master will try to acquire the exclusive lock on that Tablet Server. If the master succeeds in acquiring the lock, the master ensures that the server cannot be served by deleting the server’s files. When a server’s files are deleted, the master removes all the tablets that were previously allocated to that server and places them in the collection of unallocated tablets. To make the Bigtable cluster less susceptible to network problems between Mastre and Chubby, the master will kill itself if the Chubby session fails. As mentioned above, the failure of master does not change the allocation of tablets. When a master is started by the cluster management system, it needs to check the current tablet allocation before it can modify it. The master performs the following steps in sequence during startup. (1) The master gets a unique master lock on Chubby. This mechanism ensures that no more than one master exists at the same time. (2) The master scans the service directory on Chubby to find the live servers. (3) The master communicates with each tablet server to get which tablets are already on that server. (4) The master scans the table METADATA to get tablets. If a tablet is not retrieved by a tablet server, the master places it in the unassigned tablet for later tablet allocation.