10,000 attributes, 10 billion data, 100,000 transactions per second, how to design the architecture?

There is a kind of business scenario, there is no fixed schema storage, but has a large number of data rows, how to realize the storage and retrieval of this kind of business architecturally? 58 the most core data “post” architecture implementation technical details, today and we talk about.

I. Background description and business introduction

What is the core data of 58?

58 is an information platform with many vertical categories: recruitment, real estate, second-hand goods, second-hand cars, yellow Pages and so on. Each category has many sub-categories. No matter which category, the core data is “post information”.

Voice-over: Is it like a big forum?

What are the characteristics of the information in each classified post?

It is easy for those of you who have visited 58 to know the following information:

(1) the attributes of various categories vary, recruitment posts and second-hand post attributes are completely different, the attributes of second-hand mobile phones and second-hand home appliances are completely different, I’m afraid there are nearly ten thousand attributes at present;

(2) Huge amount of data, 10 billion level;

(3) Each attribute has query demand, each combination attribute may have combination query demand, recruitment to check the position/experience/salary range, second-hand mobile phone to check the color/price/model, second-hand to check the refrigerator/washing machine/air conditioning;

(4) Large throughput, a few hundred thousand per second;

How to solve the technical problems of 10 billion data volume, 10 thousand attributes, multi-attribute combination query, 100 thousand concurrent query? Step by step.

Second, the easiest plan to think of

Every company grows from small to large, so let’s look at concurrency and data

(1) How to achieve the requirement of attribute extensibility;

(2) Multi-attribute combination query requirements;

Voice-over: At the beginning of the company, the amount of concurrency and data is not large, so business problems must be solved first.

How to meet the storage requirements of the business?

In the beginning, the business had only one job wanted category, and the post form might look something like this:

tiezi(tid, uid, c1, c2, c3);

So how to meet the requirements of the combined query between the attributes?

The easiest way to think about it is to use composite indexes to satisfy query requirements:

index_1(c1, c2)

index_2(c2, c3)

index_3(c1, c3)

As the business grows and a new property category is added, what about storage?

A number of attributes can be added to meet storage requirements, so the post table becomes:

tiezi(tid, uid, c1, c2, c3, c10, c11, c12, c13);

Among them:

C1, C2, and C3 are recruitment category attributes
C10, C11, C12 and C13 are the property category attributes

By extending properties, you can solve the storage problem.

Query demand, and how to meet it?

First, there is generally no composite query requirement for cross-business attributes. Only a number of composite indexes can be established to meet the query needs of real estate categories.

Voice-over: I can’t imagine how many indexes can cover all two – and three-attribute queries.

When more and more business, do you find it difficult to play?

Three, vertical split is an idea

Adding attributes is a kind of expansion method, adding tables is also a kind of expansion method, vertical split is a common storage expansion scheme.

How to split vertically by business?

It can be played like this:

tiezi_zhaopin(tid, uid, c1, c2, c3);

tiezi_fangchan(tid, uid, c10, c11, c12, c13);

What are the problems with vertical split when the business is different and the data volume and throughput are huge?

These tables, as well as the corresponding service maintenance in different departments, seem to be flexible, r&d closed loop, which is exactly the beginning of the tragedy:

(1) How to standardize TID?

(2) How to standardize attributes?

(3) how to query according to uid (query all posts published by themselves)?

(4) how to query according to the time (the latest post)?

(5) How to do cross-category query (such as home page search box)?

(6) Diffusion of technology range, some storage with Mongo, some storage with mysql, some self-developed storage;

(7) Repeated development of many components;

(8) High maintenance cost;

(9)…

Voice-over: Just think about it. It’s impossible to have a list of goods in each category.

Iv. Gameplay of 58: Three central services

First: unified post center service

Platform-based entrepreneurial companies may have multiple categories, and each category has many storage requirements for heterogeneous data. There is no need to worry about whether to divide or merge: the unification of basic data and basic services is a good practice.

Voice-over: This is platform business.

How to store the heterogeneous data of different categories uniformly?

(1) Unified storage of universal attributes of all categories;

(2) Single category specific attributes, category type and general attributes to store JSON;

More specific:

tiezi(tid, uid, time, title, cate, subcate, xxid, ext);

(1) Some general fields are extracted and stored separately;

(2) Define the meaning of ext by cate, subcate, xxID, etc.

(3) Use Ext to store personalized requirements of different lines of business

Such as:

For the post, ext is:

{” job “:” driver “, “salary”, 8000, “location” : “bj”}

For secondary posts, ext is:

{” type “:” iphone “, “money” : 3500}

Post data, 10 billion data volume, divided into 256 libraries, through Ext storage of heterogeneous business data, using mysql storage, the upper shelf of a post center service, using memcache cache, is such a not complex architecture, to solve the big problem of business. This is 58’s most core post Center service IMC (Info Management Center).

Voiceover: The underlying storage of the service was fully switched to a proprietary storage engine in 2016, replacing mysql, but the architectural concept remains the same.

Solved the storage problem of massive heterogeneous data, and encountered the following new problems:

(1) The ext key of each record needs to be stored repeatedly, which occupies a large amount of space. Can it be compressed for storage?

(2) CATEID is no longer enough to describe the contents of Ext. The category is hierarchical and the depth is uncertain. Can EXT be self-descriptive?

(3) Attributes can be added at any time to ensure expansibility;

After solving the storage problem of massive heterogeneous data, the next step is to solve the problem of class scalability.

Second: unified category attribute service

How many attributes each business has, what those attributes mean, constraints on values, etc.,Coupling to the post serviceIt’s obviously not reasonable. So what do we do?

Abstract out a unified category, attribute service, to independently manage these information, and the post library ext field json key, unified by the number to represent, reduce storage space.

Voice-over: Post tables only store meta information, regardless of business meaning.

As shown in the figure above, the key in JSON is no longer long strings such as “Salary”, “location” and “money”, but instead the numbers 1,2,3 and 4. What are the meanings of these numbers, which subcategories they belong to, and the verification constraints of values are all stored in the category and attribute service.

Voice-over: Category table stores business information, as well as constraint information, decoupled from the post table.

* * * * * * * * * * * * * * * *

(1) 1 stands for job, which belongs to the 100 sub-categories of recruitment category. Its value must be an [A-z] character less than 32.

(2) 4 represents Type, belonging to the 200 sub-category of second-hand category, and its value must be a short;

This extends the ext attribute to the original post table:

{” 1 “, “driver”, “2”, 8000, “3” : “bj”}

{” 4 “, “iphone”, “5” : 3500}

There is a uniform constraint on both key and value.

In addition, if the value of a key in ext is not a regular check value, but an enumeration value, we need to have a qualified enumeration table to check the value:

This enumeration check indicates that the value of the key=4 attribute (corresponding to the secondary, mobile phone type field in the property table) is not just a “short” check, but must be a fixed enumeration value.

{” 4 “, “iphone”, “5” : 3500}

Ext (key=4 value= iPhone);

{” 4 “:” 5 “, “5” : 3500}

In addition, category attribute services can also record hierarchical relationships between categories:

(1) Primary categories are recruitment, real estate, second-hand…

(2) There are secondary categories of second-hand furniture, second-hand mobile phones…

(3) There are three categories of second-hand mobile phones: second-hand iPhone, second-hand Xiaomi, second-hand Samsung…

(4)…

Category service explains the post data, describes the hierarchical relationship of Category, ensures the extensibility of various Category attributes, and ensures the rationality of each attribute value verification, which is another unified core service CMC (Category Management Center) of 58.

Voiceover: Are category and attribute services like SKU extension services in e-commerce systems?

(1) Category hierarchy, corresponding to the category hierarchy system in e-commerce;

(2) Attribute extension, corresponding to the attribute of SKU of various commodities in e-commerce;

(3) Enumeration value verification, corresponding attribute enumeration value, such as color: red, yellow, blue;

Through the category service, solved the key compression, key description, key expansion, value check, category hierarchy problems, there is such a problem is not solved: each category under the attributes of the posts are not the same, the query requirements are not the same, how to solve the 10 billion data volume, 10 thousand attributes of the retrieval and joint retrieval requirements?

Third: unified search services

When there is a large amount of data, it is impossible to combine indexes to meet all the query requirements of different attributes. “External indexes, unified search service” is a common practice:

(1) The database provides the forward query requirement of “post ID”;

(2) All the personalized retrieval requirements that are not “post ID” are unified through external indexes;

The operation of metadata and index data follows:

(1) Tid query for posts and direct access to post service;

(2) to modify the post, post service notification retrieval service, at the same time to modify the index;

(3) Conduct complex queries on posts and meet the needs through retrieval services;

Voice-over: This search service, which handles 80% of 58.com requests (whether they come from PC or APP, whether the home page, city page, category page, list page, details page, will eventually turn into a search request), is another unified core service of 58, e-Search, this search engine, is completely self-developed.

A brief description of the search engine architecture of this kernel development service:

In order to cope with the data volume of 10 billion levels, the throughput of hundreds of thousands of levels, and various complex retrieval queries of the business line, scalability is the design focus:

(1) Unified agent layer, as an entrance, its stateless performance can ensure that the system performance can be expanded by adding machines;

(2) The unified result aggregation layer, whose statelessness can also ensure that the increase of machines can expand the system performance;

(3) Search the kernel retrieval layer, service and index data are deployed on the same machine, index data can be loaded into memory when the service is started, and data can be loaded from memory when the request is accessed, with fast access speed:

In order to meet the expansion of data capacity, index data is horizontally sharded. Increasing the number of sharded data can expand performance indefinitely
In order to meet the performance expansibility of a piece of data, the same piece of data is redundant. Theoretically, increasing the machine will expand the performance indefinitely

System delay, 10 billion level of post retrieval, including request split and merge, zipper intersection, from the aggregation layer can achieve 10ms return.

Voiceover: The entry layer is developed in Java, the aggregation layer and the retrieval layer are developed in C language.

In post business, consistency is not the main contradiction, e-Search will regularly rebuild the index in full, to ensure that even if the data is inconsistent, it will not last for a long time.

Five, the summary

The article is a long one, and a brief summary is made at the end. In the face of the business demand of 10 billion data volume, 10,000 columns of attributes and 100,000 throughput, metadata service, attribute service and search service can be adopted to solve the problem:

One solves the storage problem
A solution to the category decoupling problem
One solves the retrieval problem

Any complex problem should be solved step by step.

Thinking is more important than conclusion, I hope you have a harvest.

10,000 attributes, 10 billion data, 100,000 transactions per second, how to design the architecture?

Related Posts

Python crawls to movie heaven

Classes and Objects

After two years of Java development, you may lose your job if you don’t understand the class loader and its loading process.