“This is the 12th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

An overview of the

Facebook TAO[1], or ‘The Associations and Objects’, is an appropriate way to store names for Facebook graphs.

To sum up, TAO is Facebook’s solution to the problem of updating and association reading of super-large data in social scenarios. Its core features are as follows:

  1. Provides graph apis specialized for Facebook social news flow scenarios, such as point-and-search, one-time relational queries, and time-range queries.
  2. Two layers architecture, MySQL as the storage layer, MemeCache as the cache layer; The cache layer can be subdivided into two layers: primary and secondary.
  3. Multi-room expansion, highly read oriented performance optimization, only final consistency guarantee.

Author: A miscellany of wood birdswww.qtmuniao.com/2021/10/07/…

Historical evolution

Facebook’s early data was stored in MySQL [2]. When MySQL failed, Zuckerberg introduced MemCache as a cache layer in 2005 to deal with more frequent read requests. MySQL and MemCache have since become part of Facebook’s storage stack.

Facebook data request loads generally conform to temporal locality (that is, the most recently updated data is most easily accessed) rather than spatial locality. However, data in MySQL is usually not stored in time order, so the block cache in MySQL InnoDB engine does not match this feature. In addition, MemCache itself only provides a memory-based KV access model. In order to utilize this memory more efficiently, Facebook needs to customize its cache strategy for social scenarios to make as many read requests hit as possible.

Exposing these engineering details, including two-tier storage clusters, and self-organizing caches, to application-layer engineers creates a lot of engineering complexity, leads to more bugs, and slows product iteration. To address this issue, Facebook used PHP in 2007 to create an abstraction layer on the server side, based on the graph storage model and providing apis around points (objects) and edges (joins). Because likes, events, pages, etc. in social scenarios can be easily expressed through graph models, this layer of abstraction greatly reduces the mental burden of application layer engineers.

However, as more apis are required, the disadvantages of implementing the graph model layer (on WebServer) and the data layer (on MySQL and MemCache clusters) separately become apparent:

  1. Minor updates from the edge set invalidate the entire edge set, reducing the cache hit ratio.
  2. Requesting a tiny subset of the edge list also requires pulling the entire edge list from storage to server.
  3. Cache consistency is difficult to maintain.
  4. At the time, MemCache clusters were difficult to coordinate to support a pure client-side scare avoidance strategy.

All of these issues can be addressed by redesigning a unified, graph-based storage tier. TAO has been working with a team inside Facebook since 2009. Since then, TAO has evolved into a distributed service that supports billions of reads and millions of writes per second, deployed on massive machines across regions.

Graph model & API

The most basic components of a graph are points and edges, corresponding to TAO, Objects and Associations. Both objects and joins can contain a set of properties represented by key-value pairs.

Object: (id) → (otype, (key  value)*)
Assoc.: (id1, atype, id2) → (time, (key  value)*)
Copy the code

Note: Edges in TAO are directed edges.

Take social networks as an example. The objects can be users, clocking in, places, comments, and links can be friends, commenting, clocking in, clocking in somewhere, and so on.

Imagine an event on Facebook: Alice and Bob punch a card on the Golden Gate Bridge. Cathy comments: I wish I was there. David liked this comment.

After represented by graph model, as shown in Figure B) :

As you can see, all data items such as users, places, punched, comments are represented as typed objec. The relationships between objects such as LIKED_BY, FRIEND, COMMENT are typed objec. It is represented as typed associations.

In addition, although the connections in TAO are unidirectional, in practice most of the relationships are bidirectional. In this case, an inverse edge can be added to represent this bidirectional relationship.

Finally, since joins are triples, two objects can have multiple edges of different types, but only one edge of the same type. But in some non-social scenarios, you might want to have multiple edges of the same type.

Object API

Operations around objects are common additions, deletions, changes, and searches (create/delete/set-fields/get).

Objects of the same object type have the same set of attributes (key values *), that is, an object type corresponds to a fixed set of attributes. You can add or delete attributes of an object type by modifying its Schema.

Association API

The basic operation of Association is to add, delete, change and check. The additions, deletions and revisions are as follows:

Assoc_add (ID1, atype, ID2, time, (k→v)*) -- Add or overwrite assoc_delete(ID1, atype, ID2) -- Delete assoc_change_type(ID1, atype, id2, Newtype) - modificationCopy the code

It is worth mentioning that if its reverse edge ((ID1, INV (atype), ID2)) exists, the above API will work on its reverse edge simultaneously. Because the join is bidirectional in most scenarios, Facebook applies its edge API default behavior to both edges.

In addition, each Association is automatically marked with an important special attribute: Association Time. Because of the time locality of The Facebook load, this timestamp can be used to optimize the cache dataset to improve cache hit ratio.

Association Query API

The query API around Association, which is TAO’s core API, has the most traffic. These load types include:

  1. The specified(id1, type, id2)Is usually used to determine whether there is a corresponding join between two objects, or to obtain the attributes of the corresponding join.
  2. The specified(id1, type)A range query that requires result sets to be sorted in descending chronological order. Here’s a common scenario:What are the last 50 comments on this article?. In addition, it is best to provide access in the form of iterators.
  3. The specified(id1, type)Query the number of outgoing edges. For example, query * what is the number of likes for a comment? * This type of query is so common that it is best to save it directly to be able to return results in constant time.

Although joins are numerous, the nearest range is the focus query object (time locality), so the join query API is centered around time range queries.

To this end, TAO defines the most basic set of joins as an Association List. An Association List is a set of all joins starting with ID1 and with an atype edge type, arranged in chronological descending order.

Association List: (id1, atyle) -> [a_new, ..., a_old]
Copy the code

Based on this, several more fine-grained interfaces are defined:

// start with id1 and end with the point contained in id2set
// Creation time Time The value is low <= time <= high
// join set.assoc_get(id1, atype, id2set, high? , low?)// Return the number of join sets
assoc_count(id1, atype)

// return a subset of joins whose subscripts satisfy pos, pos+limit
Pos is the subscript in the Association List
assoc_range(id1, atype, pos, limit) 

Time <= high ** in reverse order **
Time >= low
assoc_time_range(id1, atype, high, low, limit)
Copy the code

Why are the result sets arranged in chronological descending order? This is because when a Facebook page’s news stream is displayed, it always shows the most recent data first, and then loads older data as it continues to drop down.

Here’s an example:

•"50"Most recent comments on Alice's checkin"632, COMMENT, 0.50• "How many checkins at the GG Bridge?" ⇒ assoc_count (534, CHECKIN)
Copy the code

architecture

TAO architecture is divided into two layers: a caching layer and a storage layer.

Storage layer

TAO uses MySQL as the storage layer for historical reasons mentioned earlier.

As a result, TAO’s external API is eventually converted into MySQL statements for the storage layer, but MySQL queries are relatively simple. Of course, the storage layer can also use a NoSQL storage engine such as LevelDB, so that the query will be translated as a prefix traversal. Of course, choosing a storage engine depends not only on the convenience of API translation, but also on non-API factors such as data backup, batch import and export, and multi-copy synchronization.

A single MySQL service cannot hold all TAO data, so TAO uses a MySQL cluster to support the storage layer. To distribute data evenly across multiple MySQL machines, TAO uses a consistent hash algorithm to logically slice data (shard). Each slice is stored in a MySQL DB. Each Object is associated with a shard when it is created, and the shard_id is added to the object_id, so that the shard does not change during the entire life of the Object.

Specifically, the data stored in MySQL mainly includes two tables, a point table, a side table. Where, point and its outedge will exist in the same MySQL DB to minimize the cost of associated query. All point attributes, when saved, are serialized to a column called data. In this way, objects of different types can be stored in a single table. Edges are saved similarly to points, but are additionally indexed on id1, AType,and AndTime fields to facilitate range queries based on a point’s outgoing edges. Additionally, to avoid the overhead of querying the number of edges, an additional table is used to hold the number of associations.

Buffer layer

** Read and write penetration. **TAO’s storage layer implements all external apis and completely shields the storage layer from clients. That is, Clients only interacts with the cache tier, which is responsible for synchronizing data to the storage tier. The cache layer also consists of multiple cache servers. A set of cache servers that can Serve any TAO request is called a Tier. A single request is routed to a single cache server, not across multiple servers.

The cache strategy uses the classic LRU. It is worth noting that because TAO’s edges are bidirectional by default, when the Client writes an edge, it changes from the cache layer to two directed edges that are responsible for turning them into write-out and write-back edges, but TAO does not guarantee atomicity. If it fails, the intermediate results are removed through garbage collection.

Two-tier architecture. Each logical Shard (Shard) in TAO is essentially isomorphic. The cache layer of each logical shard consists of a group of cache servers, consisting of a Leader cache server and a group of Follower cache servers.

Among them, the Followers cache server is the outer layer and the Leader server is the inner layer. All clients only deal with Followers. The Followers cache server itself only handles read requests. If a read fails or a write request is found, it will be forwarded to the corresponding Leader cache server.

If the load of read requests continues to increase, expand the Follower cache server.

If access to some objects is significantly higher than others, TAO identifies them by logging access frequency, then caches them on the client side, and maintains consistency by version number.

Consistency. After receiving parallel write requests from multiple followers, the Leader will sequence them, serialize them to the storage layer for synchronous read and write, and then return them. For write requests, other Follower services are notified asynchronously to update the corresponding data, so TAO can only provide final consistency. The benefit of this is that the read request throughput is high.

Expand in multiple places. TAO’s read request frequency is about 25 times that of the write frequency, and a single datacenter is not sufficient for a Facebook global scenario. TAO uses a master-slave architecture as a whole, with both datacenters deploying a storage + Cache layer as Primary-Secondary. All write requests are routed from the Leader Cache in the datacenter to the Primary datacenter. The primary data center storage layer then asynchronously passes back to the slave data center. However, the Leader Cache from the data center does not wait for the local store to synchronize data back to the Leader Cache. The Leader Cache then updates the replica and notifies the Followers to the Replica. TAO’s design maximizes the guarantee that a read request will be satisfied within a DataCenter at the expense of clients reading stale data. That is, sacrificing consistency to reduce latency and increase throughput.

consistency

When it comes to consistency and usability trade-offs, TAO chooses the latter. For high availability and extreme performance, the weakened consistency model — ultimate consistency — was chosen. Because in most Facebook scenarios, being unavailable is worse than being incorrect. TAO achieves better read after write consistency in most common scenarios.

The same data in TAO is first backed up by master-slave Region. Second, in the same Region, the leader-follower Cache is used for two layers of Cache. When updating, data in different locations is not synchronized, resulting in data inconsistency. In TAO, given enough time after an update, all copies of the data converge and reflect the latest update. Normally, the interval is no more than 1s. This is fine in most scenarios on Facebook.

For scenarios that have special requirements for consistency, the application layer can mark requests as critical. When TAO receives a request with this mark, it forwards it to the Master Region for processing, thus achieving strong consistency.

reference

Paper [1] TAO: www.usenix.org/system/file…

[2] Facebook technology blog, the power of TAO – figure: engineering.fb.com/2013/06/25/…

[3] Meetup TAO: www.notion.so/Meetup-1-Fa…

[4] Stanford 6.S897 courseware: cs.stanford.edu/~matei/cour…


I am Green teng mu bird, a distributed system programmer like photography, welcome to pay attention to my public number: “Mu Bird miscellaneous Record”.