Data Model

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

The above is the definition of Bigtable, which is featured by Sparse, distributed, sorted map. In addition, there is a key word: structured.Bigtable is an ordered dictionary (key value pair) where key is (Row :string, Column: String, time:int64) and value is an arbitrary string. In the web page storage example, row is the URL (the reverse URL, so that pages from the same site can be stored together). Column is composed of Colune Family: Qualifier. In the figure above, contens and Anchor are colume family. A Colume family can contain one or more Colume. Time is different time versions. Based on time, Bigtable provides different garbage collection policies: Only last n, only New enough. Bigtable is Structured data, which is created when colume family defines a table, similar to a relational database. The number of colume family is generally small, but the colume under colume family is dynamically added, and the number can be large. As for the above example, some articles may have only one author, while others may have several authors. Although all articles have the Anchor colume family, the number of Colume contained is different, which is also the reason why it is called sparse

The picture above may not seem intuitive, but look at the one I drewAs shown above:

  • Dimension 1: Key (yellow)
  • Second dimension: Attributes (columns and column families)
  • Third dimension: Time (blue)

The same key, different attributes, different times, will store a value. Unlike traditional relational databases, which store in units of behavior, the large, three-dimensional BigTable is a sparse column storage system. The essence of its data model is a map :(key + column + time => value) a super-large map.

Rows

A row key in a table is an arbitrary string (currently 64KB in size, although 10-100 bytes is the most common size used by users). Reading or writing the data contained under each row key is an atomic operation. BigTable maintains data in lexicographical order on row keys. For a table, row ranges are dynamically divided. Each row interval is called a Tablet, which is the basic unit for load balancing and data distribution. Thus, it is very efficient to read a relatively short line interval and only need to communicate with a few machines. The user can take advantage of this property, that is, the user can select an interval of rows whose distribution is local.

Column Families

Column keys are grouped into collections called column families, which become the basic access control units. All data stored in a column family, usually of the same data type (compressed together for data in the same column family), access control and disk and memory account entry are performed at the column family level

Timestamp

Within each cell in BigTable, there are multiple versions of the same data indexed with timestamps. The BitTable timestamp is a 64-bit integer. BigTable allocates time stamps, which represent real time in microseconds. The different versions of a cell are stored in timestamp descending order so that the latest version can be read first.

API

  • The ability to delete and create tables and column families
  • Changing metadata for clusters, tables, and column families, such as access control permissions.
  • Write and delete values in BigTable, query values from a single row, or iterate over a subset of data in the table

BigTable supports several other features that allow users to manipulate data in more sophisticated ways. Note that BigTable supports single-row transactions that perform atomic read-modify-write operations on data stored under a row key. Generic cross-key transactions are not supported.

  • Provides a cross – line key interface to write data in batches on the client
  • Supports the execution of scripts provided by the client in the server address space

Building Blocks

BigTable is built on top of several other Google infrastructures.

BigTable uses the Distributed Google File System (GFS) to store logs and data files. A cluster of BigTable typically operates within a pool of shared machines that run other distributed applications. Processes in BigTable typically share the same machine as processes in other applications. BigTable relies on a cluster management system to schedule jobs, schedule resources on shared machines, handle machine failures, and monitor machine state.

The Google SSTable file format serves as the internal format for storing BigTable data. (See LevelDB, LSM, SSTable) An SSTable provides a persistent, sorted, immutable mapping of keys to values, where keys and values are arbitrary byte strings. BigTable provides the ability to query values associated with a specified key and traverse all key/value pairs within a specified key range. Internally, each SSTable contains a sequence of blocks. Typically, each block is 64KB, although the block size is configurable. A block index stored at the end of an SSTable that can be used to quickly locate blocks. When the SSTable is opened, the block index is read into memory. A query operation requires only one disk scan. We first use binary lookup to find the appropriate block in the block index in memory, and then read the corresponding block from disk. Optionally, an SSTable can be read completely into memory, so that we do not need to read the disk for lookup operations.

BigTable relies on Chubby, a highly available, persistent distributed lock service. (can be superficial understanding to zookeeper, in fact, there’s a difference, can see the paper in detail: research.google.com/archive/chu…). BigTable uses Chubby for a number of tasks:

  • Ensure that only one master copy is active at each point in time
  • The location of the bootstrap to store BigTable data (see Section 5.1)
  • To discover the Tablet server
  • Declare the Tablet server dead
  • Store BigTable schema information (that is, column family information for each table)
  • Store access control lists

If Chubby becomes unavailable after a certain period of time, BigTable becomes unavailable

In fact, there is one very important thing: Borg (Fleet Management software/Service)

Architechture

The overall architecture is not drawn in the Paper, and the general overall architecture diagram is as follows

The BigTable implementation consists of three main functional components:

  • A library that is linked into every client
  • One master server
  • Many Tablet servers.

Tablet servers can be dynamically added or removed from a cluster depending on the workload. The master server is responsible for assigning tablets to Tablet servers, detecting growth and expiration of Tablet servers, load balancing the Table server, and garbage collection in the GFS file system. In addition, it handles schema changes, such as table and column family creation.

Each Tablet server manages a collection of tablets, typically 10 to 1000 tablets per Tablet server. The Tablet server handles read and write requests for those tablets that are already loaded, and partitions excessively large tablets.

To reduce the load on the primary server: instead of reading data directly from the primary server, the client reads data directly from the Tablet server. Because BigTable clients don’t rely on the master server to get information about the Tablet’s location, most clients never communicate with the master server.

Tablet Location

Bigtable uses a three-tier architecture similar to a B+ tree to store Tablet location information.

The first layer is a file stored in Chubby that contains the location of the Toot Tablet. The Root Tablet stores all the location information of the Tablet in a specific METADATA table. Each METADATA table contains location information for a collection of User Tablets. The Root Tablet is really just the first Tablet in the METADATA table, but it is treated differently and is never split in any case to ensure that the Tablet location hierarchy is no more than three levels.

The METADATA table stores information about the position of a Tablet belonging to a row key (row key: encoding both the Tablet table identifier and its last row).

Details: The client library caches the Tablet location information. If the client does not know the location of a Tablet, or if it finds that its cache of Tablet location information is partially correct, it will look up in the Tablet location hierarchy. If the client cache is empty, the location algorithm polls three times, including once to read from the Chubby. If the client cache is out of date, the location algorithm does six rounds of polling. Although the location of Tablets is stored in the cache and does not require access to the GFS, we further reduce the cost by having the client library functions pre-fetch the tablet location: At least two or more Tablet locations are read each time the METADATA table is read

The METADATA table stores two levels of information, including a log that records all events related to each tablet, for performance analysis and program debugging.

Tablet Assignment

Each Tablet can be assigned to a Tablet server. The master keeps track of the Tablet server.

BigTable uses Chubby to track tablet servers.

Tablet Serving

The data of a child table is eventually written to GFS, and the physical form of a child table in GFS is several SstAbles. The following figure shows the basic read and write operations.When a child server receives a write request, the child server first checks whether the request is valid. It then appends a record to the log in the Tablet server and inserts the record into the memTable after the log is successfully appended. This is the same as now most of the realization of the database fully, by order additional records written to the log, and then to the database random write, because random write lengthy far outweigh the additional content, if random write directly, due to random write execution time is longer, equipment failure during write operations performed the possibility of data loss is relatively high. When the Tablet server receives a read, it does a merge lookup on the memTable and SSTable, which are lexicographically stored for keys and values, so the entire read is very fast. Note: Don’t make the mistake of thinking that the child server actually stores data (except for memtable data in memory), the true location of the data is only known by GFS,If the master server assigns the child table to the child server, it should mean that the child server gets all the SSTable file names of the child table, and the child server knows which SSTable file the required data is in through some indexing mechanism, and then reads the SSTable file data from GFSThis SSTable file may be distributed across several ChunkServers.

Compactions

When the memtable size exceeds a certain threshold, the current MEMtable will be frozen and a new memtable will be created. The frozen memtable is converted to an SSTable and written to the GFS system. This Compaction is also known as a Minor Compaction Compaction. Each Minor Compaction creates a new SSTable, which reduces memory usage and reduces the recovery time caused by an abnormally large log when a service process exits. Since there is a Minor Compaction that compacts data from the memtable, there must be a corresponding Major Compaction.

Refinements

This part of the paper mentioned a lot of related optimization wording, not in detail, just a point

Locatity groups

For more efficient reads, clients can group multiple column families together into a Locality Group. We will create a separate SSTable for most of each Locality Group in each tablet. Some useful parameters can be set for each Locality Group.

Compression

Clients can decide whether to compress sstables corresponding to a Locality Group or not. Many clients use the “two-paragraph custom compression mode”. The first pass uses Bentley and McIlroy[6] mode, which compresses long public strings within a large window. The second pass uses a fast compression algorithm that looks for duplicates in a 16KB window

Cache

To improve read performance, the Tablet server uses two levels of caching. The Scan cache is a high-level cache that caches key-value pairs returned by the SSTable interface of the Tablet server code. The Block cache is a lower level cache that caches SSTable blocks read from GFS.

Bloom filters

The Gablon filter reduces invalid queries

Commit-log implementation

Bigtable mentions a tabletLog optimization for tablets. If each tablet has a tabletLog, it will cause many files to be written concurrently in GFS, so you can create a tabletLog for all tablets in a chunkServer. Can greatly improve the efficiency of writing. For tabletlog-based reconstruction, the table ⟨ Table, Row name, log sequence number⟩ would first be ordered, with each tablet reading only its own portion.

Speeding up tablet recovery

If the master server moves a tablet from one tablet server to another. The source Tablet server then does a minor compaction to the tablet.

Exploiting immutability

All the resulting SStAbles are immutable. For example, when we read data from an SSTable, no synchronization of file system access is required. As a result, concurrency control at the row level can be performed efficiently. The only data structure that changes is memTable, which is accessed by both read and write operations. To reduce collisions in reading memtable, we made each memtable row copy-on-write and allowed read and write operations to be performed in parallel.

Discussion

Single row transaction of Bigtable

Bigtable provides single-row transaction support, including single-row read-update-write and single-row, multi-column transactionality. There is very little description of single-row transactions in the original text, but this article describes them based on my understanding. Bigtable is a distributed storage, but a single row transaction of Bigtable is not a distributed transaction in nature. Because a single row of Bigtable data must be in the same tablet, it must not be across tabletServer. A single row transaction of Bigtable is an ACID transaction.

  • A: Atomicity is reflected in two aspects: data cannot be changed by other transactions during read-update-write, and multiple column changes either succeed or fail at the same time.
  • C: consistency. Bigtable does not have consistency constraints.
  • I: isolation, Bigtable transactions are single row, isolation is RC, nothing else.
  • D: Persistence. Bigtable is based on the GFS distributed file system and is considered reliable upon successful writing.

So the key for this one-row transaction is to ensure atomicity. For atomicity of single-row, multi-column operations, Bigtable writes operation logs before writing to memory. If an exception occurs during writing, the operation logs can be replayed to ensure atomicity of multiple columns. The atomicity of read-update-write is guaranteed by CAS. So, Bigtable’s single-row transaction is very simple, even much simpler than Mysql’s single-row transaction, which is probably why the article doesn’t cover single-row transactions.

Chat πŸ† technology project stage v | distributed those things…