Bigtable_A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E.Gruber

8 Practical Application

As of August 2006, Google had 388 non-test Bigtable clusters running on various server clusters, totaling about 24,500 Tablet servers. Table 1 shows a rough distribution of Tablet servers on each cluster. Many clusters are used for development environments, so there will be periods of idleness. In a cluster group consisting of 14 clusters and 8069 Tablet servers, the throughput of the cluster exceeds 1.2 million requests per second. The network load of RPC requests sent to the system is 741MB/s, and the network load of RPC requests sent from the system is about 16GB/s.

Table 2 shows the data for some of the tables currently in use. Some tables store user data, while others store data for batch processing. The total size of the table, the average size of each data item, the percentage of data read from memory, and the complexity of the Schema of the table can vary greatly. This section describes the application of Bigtable in three product development teams.

8.1 Google Analytics

Google Analytics is used to help Web site administrators analyze the traffic characteristics of their sites, including overall health statistics (such as the number of unique visitors per day, the number of visits per URL per day) and reports on the behavior of users using the site (for example, based on certain pages that users previously visited, What percentage of users purchased the product).

Web site administrators can use Bigtable services by embedding a small JavaScript script in a Web page. The Javascript program is called when the page is visited and records various information that Google Analytics needs to use, such as user identity and information about the page retrieved. Google Analytics provides this aggregated data to the webmaster.

Let’s briefly review the two types of tables used by Google Analytics. Each row of the RawClick table (approximately 200TB of data) holds one end user session. The name of the row is a tuple containing the name of the Web site and the time the user session was created. This pattern ensures that access sessions to the same Web site are continuous and sorted chronologically. RawClick tables can be compressed to 14% of their original size.

The Summary table (approximately 20 terabytes of data) contains a variety of predefined summaries about each Web site. MapReduce periodic tasks generate Summary tables from RawClick table data. Each MapReduce process extracts the latest session data from the RawClick table. The overall throughput of the system depends on the throughput of GFS. Summary tables can be compressed to 29% of their original size.

8.2 Google Earth

Google provides users with high-resolution satellite images of the earth’s surface through a suite of services. Users can access Google Earth through the Web-based GoogleMaps access interface (maps.google.com) or custom client software. Users can browse images of the earth’s surface through GoogleEarth and translate, view and annotate the satellite images at different resolutions. The system uses one table to store pre-processed data and another set of tables to store user data.

Data preprocessing uses a table to store the raw image. During the preprocessing, the image is cleared and the image data is merged into the final service data. This table contains about 70 TERabytes of data, so data needs to be read from disk. Images are already compressed efficiently, so they don’t need to be compressed after being stored in Bigtable.

Each row in the Imagery table represents a geographical area. Rows are named to ensure adjacent areas are stored together. There is a column family in the Imagery table that records the data sources for each region. This column family contains a large number of columns: basically, each column corresponds to the data of the original image. Since each geographical region is made up of several images, this column family is very sparse.

Data preprocessing is highly dependent on MapReduce data transmission. When some MapReduce jobs are running, the data processing speed of each Tablet server in the entire system is greater than 1MB/s.

Service systems use tables to index data in GFS. This table is relatively small (about 500GB), but must handle tens of thousands of query requests per second for each data center with short response latency. Thus, the table is stored on hundreds of Tablet servers and protects the memory column family.

8.3 Personalized Search

PersonalizedSearch (www.google.com/psearch) is an optional service that logs user queries and clicks across a variety of Google services, such as Web queries, images, and news. Users can browse the history of their queries and repeat their previous queries and clicks; You can also customize query results based on Google’s historical usage patterns.

PersonalizedSearch uses Bigtable to store data for each user. Each user has a unique user ID, and each user ID is bound to a column name. Different types of behavior are stored in separate column families (for example, some column families are used to store all Web queries). Each data item is used as a timestamp for Bigtable to record when the corresponding user action occurred. PersonalizedSearch Uses MapReduce tasks to generate a graph of user data that is used to personalize the current query results.

Personalized Search’s data is replicated to several Bigtable clusters, which increases data availability and reduces latency caused by the “distance” between the client and the Bigtable cluster. The development team at PersonalizedSearch initially set up a “client-side” replication mechanism based on Bigtable to ensure consistency of all replicated nodes. The current system uses a built-in replication subsystem.

The PersonalizedSearch storage system allows other teams to add new user data to their columns, so many Google services use the PersonalizedSearch storage system to store user-level configuration parameters and Settings, resulting in a large column family. To better support data sharing, we added a simple quota mechanism to limit the amount of space users can use in shared tables, as well as an isolation mechanism for product communities that use the PersonalizedSearch system to store user-level information.

9 Lessons learned

We have learned many useful lessons in designing, implementing, maintaining, and supporting Bigtable.

For example, we found that large distributed systems can be compromised by many types of errors, not just the common network outages or fail-stop types envisioned in many distributed protocols. The types of errors we encountered included memory data corruption, network outages, clock skew, machine hang-ups, extended asymmetric network partitions, bugs on other systems (such as Chubby), GFS quota overflows, and planned and unplanned hardware maintenance. Based on our experience, we learned to fix these problems by modifying the protocol. We use checksums in RPC mechanisms. When designing system functions, make no assumptions about other functions. For example, we no longer assume that a particular Chubby operation returns only one value from the error code set.

Also, decide whether or not to develop a new feature after fully understanding its potential use. For example, as originally planned, our API supports common transaction processing. However, considering that this feature is not needed for the time being, it is not implemented. Bigtable already has a lot of practical applications. Upon observation, we found that most applications only require transaction functionality on a single row. Some applications require distributed transaction capabilities to maintain secondary indexes, and we use special mechanisms to meet this need. While much less generic than distributed transactions, the new mechanism is more efficient (especially for update operations involving hundreds of rows of data) and fits well with the optimization strategy for replication schemes across data centers.

We also found that system-level monitoring of Bigtable is very important (for example, monitoring of Bigtable itself and the client programs that use Bigtable). We have extended the RPC system so that an example of an RPC call details many of the important operations that represent RPC calls. With this feature, we can detect and fix many issues, such as the contents of the lock on the Tablet data structure; Very slow writes to GFS when modification actions commit; And suspended access to the METADATA table when the METADATA table’s Tablet is unavailable. Each Bigtable cluster is registered with Chubby, and through monitoring, we can see the status of the cluster, its size, the software version the cluster is running on, the amount of data flowing into the cluster, and whether there are potential causes of high cluster latency.

The most valuable lesson is the value of simple design. Given the amount of code already in the system (about 100,000 lines of production code), future additions, and a variety of unplanned scenarios, we felt that the simple design and coding would greatly facilitate maintenance and debugging. Take the Tablet Server member protocol as an example. Our first version of the protocol was simple: the Master server periodically signed a lease with the Tablet server, and when the lease expired, the Tablet server killed its own processes. However, when a network problem occurs, the availability of the protocol is greatly reduced and the recovery time of the Master server is prolonged. We have redesigned the protocol several times. But the resulting protocol was too complex and relied on Chubby functionality that was rarely used in other apps. But a lot of time was wasted debugging the Bigtable and Chubby code. In the end, we scrapped this protocol and came up with a simple one that only used Chubby’s usual features.

10 Related work

Some of the components of the Boxwood project are similar in some ways to Chubby, GFS, and Bigtable in that they also support distributed protocols, locks, distributed Chunk storage, and distributed B-tree storage. Whereas Boxwood’s components provide lower-level services, aiming to provide the building blocks for creating advanced services like file systems and databases, Bigtable directly supports data storage for client applications.

A number of projects have implemented distributed data storage or advanced services over wans, often on an “Internet scale,” such as the CAN, Chord, Tapestry, and Pastry projects on distributed Hash tables. These systems deal with issues such as different transmission bandwidths, untrusted collaborators, and frequent configuration changes. In addition, Bigtable does not focus on decentralization and Byzantine disaster redundancy.

In terms of distributed data storage models provided to application developers, the key-values provided by distributed B-trees and distributed Hash tables have significant limitations on the models. Key-values are useful for models, but we should also provide more components for developers. Our model provides components that support sparse semi-structured data. In addition, these components are very simple and can efficiently process flat files; It is transparent enough (through location groups) to allow users to adjust important behavior of the system.

Some database vendors have developed parallel database systems that can store massive amounts of data. Oracle RAC uses shared disks for data storage (GFS for Bigtable) and a distributed lock management system (Chubby for Bigtable). IBMDB2 parallel edition is based on an architecture similar to Bigtable but does not share any information. Each DB2 server is responsible for processing a subset of rows in a table stored in a relational database. Each of these products provides a complete relationship model with transactional capabilities.

Bigtable’s location groups provide column-based storage solutions for excellent compression and disk read performance. Similar products include c-store, commercial products SybaseIQ, SenSage, KDB+ and MonetDB/X100 ColumnDM storage tier. AT&T’s Daytona database provides vertical and horizontal data partitioning in flat files with good data compression performance. Location groups do not support the Ailamaki system’s optimization capabilities at the CPU cache level.

Bigtable stores updates to tables through memTable and SSTable, similar to how log-StructuredMerge Tree stores index data updates. In both systems, sort data is stored in memory before being written to disk, and read operations must merge data from memory and disk.

C-store and Bigtable are similar: both use shared-nothing architecture; Each has two different data structures, one for current write operations and one for “long-used” data. It also provides a mechanism for moving data between two storage structures. However, the two systems are quite different in terms of API interface functions: C-Store operates more like a relational database; Bigtable, on the other hand, provides a low-level read-write interface and is designed to support thousands of operations per second per server. C-store is also a “read performance optimized relational database”, while Bigtable has good performance for read/write intensive applications.

Bigtable also needs to address the load and memory balancing issues that all shared-nothing databases face. Our problem is somewhat simpler :(1) we do not need to consider the possibility of multiple copies of the same data, that is, the same data may be presented in different forms due to views or indexes; (2) We let users decide what data should be in memory and what should be on disk; (3) There is no complex query execution or optimization work in our system.

Conclusion 11

The Bigtable cluster has been in use since April 2005. So far, it has taken us about seven people to design and implement the system. As of April 2006, more than 60 projects are using Bigtable. Our users are satisfied with the high performance and high availability provided by Bigtable. Users can expand the carrying capacity of the system by adding machines according to their system’s resource requirements.

Because the programming interface provided by Bigtable is not common, it is difficult for users to adapt to the new interface. New users may not be familiar with the Bigtable interface at first, but eventually they become familiar with it. Our design has been proven to work in practice.

We are currently designing new features, such as secondary indexing support, and the building blocks of Bigtable that support multi-master replication across data centers. We have now started deploying Bigtable as a service for use by other product teams. This eliminates the need for product teams to maintain their own Bigtable clusters. As service clusters expand, we need to deal with more resource sharing issues within the Bigtable system.

Finally, we found that developing Google’s own storage solution has many advantages. By designing our own data model for Bigtable, our system became more flexible. In addition, because we were in charge of the implementation of Bigtable and the other Google building blocks it used, we were able to quickly resolve bottlenecks or inefficiencies.

The resources

  1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E.Gruber. Bigtable_A Distributed Storage System for Structured Data
  2. Yan Wei. Google Bigtable Chinese version