Big data development platform of Shell data Center engineering architecture practice

This is my 75 th original article \

This is the second offline event I attended this year. Recently the activities are a little intensive ah, before the registration of three activities all caught up, had to choose the most interesting shells to have a look. The shells did not disappoint me. They were all the things I liked. Shell CTO line is also revealed, the five principals respectively shared the evolution of the architecture of big data platform, OLAP platform, DMP platform, recommendation platform and algorithm platform. \

The whole process is full of dry goods, and the harvest is quite abundant. Many of these lessons are useful for companies starting from zero.

Today, I will give you the “Shell one-stop big data development platform practice”, there is PPT download at the end of the article. Due to my poor photography skills, I can not solve the problem of large-screen fluorescence, so the picture is not pretty, please forgive me.

First, the big data development platform practice sharing person posted town building.

The main data sources of Shell’s big data platform can be divided into three categories:

People: sellers (owners), buyers (buyers, renters), brokers;
Property: real estate dictionary, I shared before in the article introduced (link at the end of the article), shell in 2008 to get a team dedicated to the whole real estate master data, built a real estate dictionary of 200 million houses, to each house has compiled a unique ID, this is not the data center ONE ID;

Behavior: online browsing behavior, offline communication, house viewing, negotiation and other behaviors.

For big data platforms, the most important capability is to provide various forms of data for various departments at low cost, fast and accurately. But like every company, Shell is constantly evolving.

In fact, this is also consistent with the principle of architecture: enough is enough, moderately advanced. After all, meeting the business needs is the first priority, which is consistent with the principle of Minimum Viable Product (MVP). The way many companies do big data now is to hire a director, then a director, then an architect, and then look to the most advanced data center. This is the kind of company you should avoid as much as you can.

Simon blow tax classmate, wish good!

Shell the earliest big data development platform, very simple and rough. Classic Kafka+Sqoop+HDFS+Hive, task scheduling using Ooize, after processing data in MySQL, report platform directly read MySQL data for display.

We do not think this is very Low, in fact, this architecture is enough for a small and medium-sized company for a long time. Basically recruit an intermediate big data engineer, with two junior engineers, plus a report engineer, can resist for a long time.

Shell students are very honest, they have listed the advantages and disadvantages of each architecture, I will not repeat the details.

The evolution of an architecture is either expertly planned or painfully forced to evolve. I think Shell has both factors. The reason for judging the expert is that the five leaders of Shell mentioned the core idea of architecture together when sharing, so they should have a good atmosphere of technology sharing and cooperation foundation inside. The reason for forced improvement is that Shell develops too fast. Even if it is based on Lianjia, it should be very painful for the complex business and massive demand gushing out.

As can be seen from the big data architecture of this version, the overall framework is built in accordance with lambda. The real-time processing part is added, which is processed by Storm and SparkStreaming and then directly sent to Hbase to provide real-time data service externally with API.

Compared with the previous version, we have made many improvements to the data processing side, including building a data warehouse and instant query engine, and adding data products to provide self-service query and analysis services. MySQL+Rest API (MySQL+Rest API) This efficiency can not be seen?

MOLAP mainly uses Kylin, and as the OLAP platform will explain later, Shell is a heavy user of Kylin.

Does this architecture look like a data center? It should be noted that shells are also starting to try TiDB, which should be the general trend. In terms of data access, we started to use Ali’s dataX, DataBus and other products to build an Odin on the visualization side. Data opening also added data exchange and subscription, and metadata management, data quality and data permissions were added on the management level. Basically, the basic level of the data center has been relatively perfect. So in this version, a large number of visual programming tools have been added to simplify the development process; A large number of management tools and automatic operation and maintenance tools have been added, data standardization and quality control have been carried out, a large number of data have been opened to the outside world, and data assets have been revitalized.

There’s nothing to be said for data management. It’s the same for anyone. But I am really interested in the shell of the main data – real estate dictionary, this is absolutely a big kill. The scene raised a question about the dictionary fusion of real estate, but Zong Qiang is not this group, did not answer in detail. \

The early days of data integration were particularly gruff Sqoop and Kafka tasks, and maintenance was a death eater. Now switch to DataX, DataBus and other tools, efficient leverage. However, when introducing this film, Zong qiang said that they can automatically access the new library and new tables, data structure changes can also be automatically synchronized. This point is interesting, the technology is better to deal with, read the business library data structure on the line. But how does it work with the business development side? Automatic synchronization of data structures, will not lead to problems in warehouse follow-up tasks? Therefore, I think it should be monitoring data structure changes. If there is no impact on subsequent tasks, such as adding fields, the task will continue. If the fields change, the task will be stopped and the alarm will be reported. In addition, there should be other approvals and notifications of database structure changes coming online on the business development side to give advance notice of major structure changes.

Shell’s buddies are really fierce, the internal system screenshots out (PPT has more screenshots, here I will not put so much) data product manager can copy the operation ha (copy other data in Taiwan products more complete operation). On the job scheduling side, in order to ensure the robustness of the task, there are several lines of defense: SQL execution test, data accuracy test, and finally online.

The data quality here is basically completed by improving the development process, task monitoring system and data quality monitoring after the event. This section is a little weak and lacks data quality analysis, evaluation, validation, and data quality issue management. I guess it’s better to meet the business needs first, because if the numbers are wrong, someone will find you.

Finally, data openness. Several colleagues of Shell all mentioned a sentence: no matter how valuable data is, if it is not open to the public, it is garbage. I agree with them very much. Data is the cost when it is put there, and only when it is open and shared can it be valuable. The following OLAP platform, DMP platform, recommendation platform and algorithm center platform all obtain data from the big data platform. The APP of Shell also obtains a large number of data from the big data platform.

However, I found that the data middle layer of the big data platform uses mysql, Hbase and Clickhouse. It seems that ES is not used. I don’t know why.

Well, the architecture development path of Shell big data platform is very worthy of reference, a living case ah! Thank you very much indeed to the masters of shell CTO LineI hope to come here more often. In addition, the tea break cake is also very delicious

In addition, I organized the content into PPT, and you can get the download link by replying to “shell big Data” in the background.

Enjoy better with the following articles

6 layer commercial moat | shells of all solutions!

Analyze | shell behind the secret weapon – this ACN!

Thinking | according to the formula of shells chaozuoye, can succeed?

I need your upvotes. I love you

Big data development platform of Shell data Center engineering architecture practice

Related Posts

“Machine learning algorithm classification and Kaggle practice” column overview

The way programmers listen to music

Use guide of XIAOMI’s self-developed intelligent SQL optimization and rewriting tool SOAR