Big Data Architect source: Data Warehouse dead? Data lake should stand!

The other day, I dissected some of the most popular data modeling posts of the past two days. It is pointed out that the core reason why Baidu little brother “saw a wide table but did not see modeling” in the post is that the core logic of the whole data circle has changed.

And then there was this crazy teasing in the modeling group.

There are also big factory number warehouse big guy from a strategically advantageous position, pointing out jiangshan, talk with ease.

Why joke? Because we know it’s not the old data-first, engineering-first Tetris game anymore, it’s the customer-first, business-first Temple Run game.

But the vast majority of enterprise data warehouse engineers are still reduced to the situation of stretched meters.

The game has changed

In the early years, before the business changed so frequently, the strategy was determined annually and the KPI policy was released annually.

We had plenty of time to plan, business modeling, domain modeling, logical modeling, physical modeling, and validation modeling. As at that time of love, slow, life is only enough to love one person.

At that time, the game play of the industry was basically the same, so there was a classic data model like FSLDM to apply. Can one model fix an industry?

But now, who plays like everyone else? No! Even douyin and Kuaishou, two direct competitors in the short video world, have very different logic:

One favors algorithmic recommendations, one favors social relationships.

Not to mention the hot community group purchase, are seizing the market, the business model is changing every day.

I can’t believe that I will build a KPI data warehouse + accounting system that can support the monthly adjustment of KPI policies!

The gameplay has really changed! The world has changed!

The model has changed

In this side of the aircraft while changing the engine era, the traditional number warehouse orderly construction of the logic is not good, began to develop in a very strange direction.

In one direction, enterprises with large scale, strong technology and stable business, such as Ali and Meituan, began to try a new modeling concept.

Their subject area division does not follow the old “neutral, universal”, but “individual, dedicated”. So they chose to divide the subject areas by business process because it was easier to support the business metrics system above. How can we extract a general model?

At the time of modeling, traditional modeling, the DWD layer must be paradigm modeled and generally does not provide services externally. If each department needs detailed data, it can establish a DM to solve the problem.

However, the modeling method of these large factories is to compress the scope of paradigm modeling as much as possible, and expand the depth of dimensional modeling. Starting with a structured index system, the dimensional model penetrates continuously down to the DWD layer.

Yes, the DWD layer is also dimensionally modeled. Where do you do all the ID unification, code conversion, data leveling? Do the ETL.

Oh, no! It should be called ELT. Load first, then Transformation. Because of the large amount of data input, we must first solve the problem of data throughput.

The other direction is new business from startups or large companies. This type of scenario is characterized by business changing all the time, product functionality changing, and business databases changing.

In this scenario, the logic of traditional data warehouse construction completely fails. It is impossible for anyone to design a data warehouse model in such a short period of time that it can accommodate biweekly iteration rates.

So they chose the simple and crude stretch meter!

This is the root cause of the crazy ridicule on Baidu. Not not to model, but simply do not have the time, do not have the condition to give you model.

Is the warehouse dead?

That kind of business tends to be stable big factory after all is a few, more situation is start-up company, business trial and error, adjustment of small factory.

In the hell mode of business changing direction once a month, product iterating once every two weeks, business database constantly updated and no one has told you, basically declared the death of data warehouse!

It’s like playing a game.

It used to be Tetris, and we had to design it so carefully that every brick had to be placed in its proper place, stacked neatly, waiting for the arrival of the stick.

And now, is playing temple escape, the operation is the same up and down around, but you simply can’t think of a reasonable structure, layout, a little hesitation, was bitten by the monster ass.

And for those business increasingly stable big factory, data warehouse also has huge trouble. Just like new-energy car owners always suffer from range anxiety, almost all off-line engineers fear failure.

Task failure means that the report does not come out, it means that the operation of the eye roll and deduction of performance.

In addition, our incremental warehousing scheme is slowly becoming more and more complex due to various reasons such as late data, complex business logic, etc. So much so that some small companies simply go full daily, which makes data delays even worse.

Everything seems to be normal offline warehouse T+1 delay, become the last straw. Because the business can no longer be satisfied with yesterday’s numbers.

“We didn’t do anything wrong, but somehow we lost,” the Nokia CEO said in a ringing voice.

What? You say Lambda architecture can be satisfied? Yes, it works, but can you compare the real-time and offline results?

You tell me now, how are you going to save a data warehouse that’s past the age of the Internet?

Data lake should be established

While the Internet HR was handing out a resignation letter to the over-age data warehouse, another HR was handing out an Offer to a baby born in 2009.

It is the data lake.

His father is Pentaho’s CTO, James Dixon. When James created it, he had no idea this guy was going to be this big. All he wanted was to dump all the data on the tapes into one place where he could explore them.

And now the data lake, has grown into a giant! With snapshot-based design, snapshot-based isolation, excellent atomicity, new metadata and other clever design, the data lake has the characteristics of batch streaming integration, perfect incremental storage, storage can be calculated.

What do these features mean?

For ETL engineers, this means there is no T+1 in the data lake! It’s so exciting!

But more exciting for big data architects, the data lake not only means throwing all kinds of data into it, but also means the birth of a new architecture!

A universal structure, can satisfy the algorithm engineer at trawling for raw data architecture, can satisfy the large data engineer always pull a quasi real-time wide table out architecture, can satisfy the incremental quasi real-time data access and real-time analysis of architecture, can make big data engineer don’t have to get up early to see the architecture of the tasks to be failure.

The architecture has changed

The most frustrating part of Kappa architecture is Kafka, which integrates an MQ database. This directly leads to the disadvantage that the Kappa architecture cannot store large amounts of data.

But this disadvantage, data lake can solve ah. The problem was solved by making Kafka a data lake. Kafka has finally taken a break from its inexplicable “database” title.

However, the “data island” problem of traditional data warehouse disappears instantly before the data lake. Because the data lake is a hodgepodge, everything into it!

And there are already various components that are interfacing with the data Lake product. The data lake has really become a lake!

This architecture is amazing!

You can take a data processing unit, pull it out of the lake, make it into a wide table and throw it to the operation.

You can also write a DAG, clean up the data, throw it through another database.

People who know a lot about data can use the query component to go directly to the data lake to look up data.

Algorithm engineers can also directly dock with the data lake, feed raw data from the lake to the algorithm, and train the model.

Crucially, the OLAP engine can also connect directly to the data lake!

This is awesome! In other words, we can build a super super OLAP system based on this, quasi-real-time, no complex layered construction, no worry about running tasks, no business needs can be delivered quickly!

Alas, you think I am sensational, but I do not know the fact. Which direction should the storehouse man go?

To be honest, I don’t know how to answer that question. Times in change, technology in progress, can not keep up with the inevitable elimination.

Alas, Cang does not know if he is dead, but the data lake has arrived. Let’s work hard, come on!

What do you think?