Real-time data warehouse entry training camp: Real-time data warehouse to help the Internet real-time decision-making and precision marketing

Brief introduction:“Real-time Database Bootcamp” by Ali cloud researcher Wang Feng, Ali cloud senior product expert Liu Yiming and other real-time computing Flink version and Hologres technology/product experts to join the battle, together to build the training camp curriculum system, carefully polishing the course content, directly hit the current students encountered pain points. Analyse the architecture, scenario and practical application of the real-time database from the simple to the deep. 7 excellent courses will help you grow from a little white to a great man in 5 days!

The paper sorting from live the number of real-time warehouse help Internet real-time decision-making and accurate marketing – one video link: https://developer.aliyun.com/learning/course/807/detail/13886

In recent years, real-time data storehouse is a hot topic on the Internet, and its main application scenarios are mainly in the fields of real-time decision making and marketing. These fields have high requirements for the precision and real-time of data analysis, and it is a field in the forefront of technology.

Online business and fine operation promote real-time and interactive database

Let’s take a look at some of the trends in data analytics over the past few years, as shown in the chart above.

It can be seen that the basic trend of data analysis is evolving from batch processing to interactive analysis and stream computing. Ten years ago, big data was more about dealing with the problem of scale, processing massive data through parallel computing technology. At that time, we were doing more data cleaning, data model design, and there was not much demand for analysis.

Nowadays, our big data team has basically become a team of data analysis. More and more demands are placed on the precipitation of data model, the support ability of interactive analysis, the support efficiency of query response delay, and QPS. Data analysis is not only about data storage and analysis, but also a lot of scenarios in which calculation is preceded by logic before calculation. For example, in the Double 11, how much trading volume is in how many seconds, such a typical scene is not a problem of trading data before calculation, but must be a process of real-time calculation results along with trading.

As a result, interactive analysis and stream computing are almost a parallel process, and these two new scenarios have many different requirements on the technologies behind them. Batch processing is more about parallelism. In the field of interactive analysis, we began to have a lot of predictive computing, in-memory computing, indexing and other technologies, so this is also to promote the evolution of the whole big data technology.

Summary, the data analysis to support the more and more online business, online business including any time we open a cellular phone, cell phone screen inside the recommended products, display advertising, these are all need to return the result within a few milliseconds, by data intelligent recommending it, if it is not recommended click-through rates, conversion rate is very low.

Therefore, our data business is supporting more and more online businesses. Online businesses have very high requirements on query delay, QPS and fineness, which also promotes the real-time and interactive evolution of data storehouse.

Alibaba’s typical real-time warehouse scene

There are many usage scenarios in Alibaba, such as the big screen of real-time GMV on Singles Day. The GMV is a conclusive number, and for data analysts, the work is really just beginning. We need to analyze down, what is the product, in what channel, for what kind of people, with what kind of promotion means, to achieve such transformation effect, what transformation effect has not been achieved, and so on a series of analysis. These analyses are actually very fine-grained, the result of fine-grained analysis.

After the analysis, all products and groups will be tagged. Through tagging, we can guide the online application to do recommendation, analysis, circle selection and so on, so a lot of data in the middle of the business will also be generated.

Another type of business is partial monitoring business, such as sudden order jitter and growth, network quality jitter, some monitoring of live video, etc., which are also typical application scenarios of real-time data warehouse.

Big Data Real-time Data Warehouse System “Muddled”

We’ve looked at a lot of companies in the past to build our real-time data storehouses, and Alibaba has also gone through a very complicated route.

The top is the architecture diagram I drew. I was very excited when I saw it for the first time. At that time, I was a big data architect, and it was a good thing to draw so many arrows. But when I really do such a system, I find the development efficiency and operation efficiency of this system very frustrating.

The system evolved from the top left corner, the message was collected from the messaging middleware, and then there was an offline processing process. At that time, we didn’t have a lot of real-time technology, so we had to solve the scale problem first. Through offline processing, we will change the processing result set into small result sets and store them in MySQL and Redis, which is a typical binary system of processing services.

After the data is reduced to small data, it can be provided to the external upper application, including reporting application, recommendation application, etc. After that, data analysis needs to be real-time, and this “T+1” off-line processing alone cannot meet the demand. The more real-time data is, the more context it will have, and the greater its value will be. At this time, a number of computing front-end technologies were adopted, such as Flink. Flink directly consumes events in Kafka and does computations in an event-driven way. Because Flink is distributed, it is very scalable. It can lead computations, and it can push end-to-end latency to the extreme in an event-driven way.

Then we will also save the result set into a medium. The result set processed by Flink is a very compact structure, usually based on KV6 structure, on systems such as HBase and Cassandra, which is the fastest to provide large screen reports. For example, double 11 large screen, must not wait for the transaction of tens of millions, hundreds of millions of records to do statistics, otherwise query performance must not be satisfied. So, we start with raw data sets that are granularity per channel per second, and raw data sets can go from a very large data set to a very small data set when we do large screen analysis, and the performance is at an extreme level.

Now we have the scale of processing, we have the speed of processing, both of which seem to meet a certain demand on the surface, but in fact the cost is not small. If we want to compute fast enough, we need to put the computation up front, which reduces the flexibility of the data model because we have already aggregated all the data into one result set through Flink.

For example, if some business logic is not defined at the beginning, for example, if you first analyze an aggregation of three dimensions, if you later want to analyze an aggregation of the fourth dimension, you cannot do the analysis because you have not calculated it in advance, so flexibility is sacrificed here.

At this point, some technologies that are more flexible than HBase and more real-time than Hive, such as ClickHouse and Druid, are adopted. Data can be written in real time, after written also provides a certain amount of interactive analysis, self-service analysis capabilities.

We will find that data processing dispersing the same data in three different media, including offline processing media, near real-time processing media and full real-time processing media. Three medium often need three teams to maintain, three teams with personnel changes in time, the data processing logic must be subtle adjustments, as a result, we will find the same indicators through three different channels of calculation results, this happens almost every company.

This is only a superficial problem, but it is more painful for the analyst, who needs to use different data, access different systems, learn different interfaces, and even have different access control mechanisms, which is very inconvenient for the analyst. As a result, many companies are building a set of so-called data middle platforms to screen out differences in the underlying physics engine. In this case, federated computing technologies such as Presto and Drill are used.

Federated computing has a history of more than two decades, evolving from the earliest days of data information system integration to data virtualization. This technology is a set of data is not moving, calculation of mobile technology, sounds very good, but really is used on the production, the system queries that experience is unpredictable, we don’t know the query through the system data is fast or slow, because it is the query down to different engine, if press the Hive, is not so fast, If you push down to ClickHouse it might be faster. So for an analyst who doesn’t know what the expectations of the system are, it’s called unpredictability. For example, one time it might have taken five seconds to open a report, and another time it might have taken five minutes, which would have left the analyst wondering whether the five minute time was normal or abnormal. If it runs on Hive, it’s called normal, and if it runs on Drill, it’s called abnormal. The unpredictability also makes it difficult for the system to enter the field of production, so at this time, we have to do a data aggregation and turn it into a small result set for analysis. As we can see, the whole link is actually composed of multiple components nested layer upon layer and dependent layer upon layer. In addition to the situation mentioned just now that different team maintenance will cause inconsistent data caliber, the cost of data maintenance will also become very high.

We often encounter situations where we see that a certain number on the report may be incorrect. For example, one day the index suddenly increases or declines a lot. At this time, no one can confirm whether there is a problem with the data quality, or with the processing logic, or with the data synchronization link, because the intermediate link is too long. If a field is modified or a new data is added from the data source, then every link has to be re-run. Therefore, there is no problem with this architecture if it runs normally. However, once there is a problem with data quality or data scheduling, loop dependence makes the whole operation and maintenance cost very high.

At the same time, people who know so much technology are hard to find and expensive. Often, one company produces such people and then they are poached by other big companies, making it difficult to recruit and train such people.

This is the current state of the architecture.

The complexity of big data today is reminiscent of the 1960s

The above architecture reminds me of the 1960s, when databases were barely alive, and the late 1970s, when relational databases were born. How did we manage the state of data in the ’60s?

In the 1960s, every programmer had to write their own state maintenance. Some people use file systems, but file systems alone can be very discrete and difficult to maintain because the data is related to each other. At this time there are some network structure of the system, through a file can jump to another file to go, but the network structure management is relatively complex, because of loops, nesting, and so on. In addition, there is hierarchy, which is also a way of describing data types. So as you can see, it was very expensive for programmers in the 1960s to manage the state of the data themselves.

In the 1980s, we basically stopped doing that, because we knew that all the states were stored in databases as much as possible, and because relational databases made it a lot easier to do that. Although we have many kinds of relational databases, most of them are based on SQL interfaces, which dramatically reduces the cost of data storage, query, management and so on.

This has changed a lot in the past 20 years, from the era of big data to NoSQL to NewSQL, with the birth of various technologies like Hadoop, Spark, Redis, etc., which has made the data field flourish.

But I always felt that there was something wrong with this place, and maybe that’s not necessarily the case in the future. Now big data is booming but not unified, so I wonder if it is possible to close the big data technology in the future and solve problems in a more descriptive way, instead of every programmer having to learn different components, learn dozens of different interfaces and configure thousands of parameters. In order to improve development efficiency, it may be necessary to further consolidate and simplify these technologies in the future.

The core requirement of real-time data warehouse: timeliness

Let’s go back and look at what the business requirements of the real-time database are, apart from these technical components, and what data architectures can be required for those business requirements.

What is real time? Many people think of real-time analytics as a short enough time between data generation and the time it can be analyzed, but this is not entirely accurate. In my opinion, real-time data warehouse is divided into two kinds of real-time, one is called short end-to-end delay, the other can also be called punctual, that is, when we really analyze the data, we can get effective data and draw conclusions, which can be called punctual.

The first kind of real-time is a technical concept, but does our business necessarily need such low latency?

Usually, the business is not making decisions every second, the business when we open a report, we look at that moment, the moment we care about the data may be the last day, the last hour, the five minutes ago, or the one second ago. However, experience tells us that 99% of data analysis can basically meet the data analysis needs of the business if the delay of 5 minutes can be achieved. In the analysis scenario it is the case that if it is an online service scenario, this is certainly not enough.

So most companies probably don’t require that much real time, because the cost behind it is different.

Because once the end-to-end delay, we need a lot of computing the business logic of the front, can’t wait to do.is data, then to the query, because the request delay is very short, we must put a lot of data gathering, processing, wide, and in advance to make data analysis computation quantity is small enough, can be sufficiently short delay.

So if you’re looking for end-to-end latency, there’s a lot of computing ahead. As we mentioned earlier, there is a loss of flexibility if you calculate all fronts, so there is a cost to this. In contrast, if the pursuit is a just-in-time system, you can put a part of the calculation logic behind, in exchange for a flexible analysis scenario.

For example, we only save the last GMV on Singles’ Day in order to pursue delay. This is a kind of scene, and the matter is over. However, this matter does not meet the requirements of the company. The company must issue a detailed report, and it needs to know when it sells well and when it does not. Therefore, in this case, all the way of prediction in advance is definitely not suitable. More detailed data should be saved, and we can do more interactive analysis, correlation analysis and exploratory analysis. Therefore, the requirements behind these two systems are different.

Relatively speaking, I think what most companies want is a just-in-time system, which requires the ability to compute postscript, write in real time, analyze in real time, even if the analysis is not that efficient, and have the ability to analyze flexibly.

The core requirement of real-time data warehouse: data quality

Data quality is the number of real-time warehouse construction in an important part of, just mentioned if not pursuit of data quality, only the pursuit of timeliness, first pre processed into a set of results by calculation, the result set tells us GMV reached $10 billion, the vast majority of the boss is not believe, because it may be behind the 10 billion data quality problems, It is possible that the calculation logic is miswritten, so the system must be able to guarantee the data quality.

Data quality is divided into two aspects, one is how long to find the quality problem, the other is how long to correct the quality problem, these two solutions are not quite the same.

If you want to find data quality problems, you need to have the state of the computation persisted, you want to have details in the data warehouse engine, and you want the aggregated data to be dropped, and the data can be checked. In this way, when the boss asks why the index has increased so much, and which customer has brought the growth, you can check the reason through these detailed data, and it is also a key issue whether you can correct the mistakes when analyzing the data.

Some systems can only be seen but not changed, which is a common fault of many big data systems. Big data systems are great at dealing with problems of scale, but dealing with small problems, like correcting data, is particularly difficult. Because each correction requires a large chunk of data, or no primary key, the entire file is replaced, so it is really difficult to update.

I prefer a system with the ability to fix data problems to find problems.

Correcting the problem is that when the data quality is found, you can simply update the status of the data, such as single-row data update, single-column data update, batch update, etc., there is a very simple way to do data refresh. The status of data refresh often occurs. For example, there are problems with the quality of upstream data, or the processing logic is wrongly written, etc., which all require data refresh.

Secondly, I also hope that only one system can be corrected when the data is corrected.

Just now we see that a data source of the data made a mistake, it should be in the back-end 4 ~ 5 link flow repeatedly, this means that once an error in the data, in 4 ~ 5 links all revisions of work to do, here there are a lot of data redundancy and inconsistencies, every link to correct correction, one thing that is particularly complex. Therefore, we hope to save only one place for the status of the warehouse, so that I can only correct one place.

Core requirement of real-time data warehouse: cost optimization

The cost is divided into three parts: development cost, operation and maintenance cost and manpower cost.

Development costs indicate how long we want the business requirements to go live. Do you need a team or a person to do the same thing, do you need a week or a day, so the development cost is one thing that many companies are very concerned about, if the development is not out, it means that the needs of a lot of business are suppressed or suppressed. We hope that IT students will no longer have to respond to the business team’s requests for numbers. It often happens that after the data is really processed, the business team feedback that the data is processed incorrectly, and after the IT students have corrected IT with great difficulty, the business team says that the activity is over and the data they want to see is meaningless. Therefore, a good data warehouse system must be decoupled from technology and business. The technical team ensures the stable and reliable operation of the data platform, while the business team should try its best to fetch data self-help and generate interested reports by dragging and dropping. This is a good system in our opinion. Only through this way to reduce the threshold of development, let more people go to fetch the number, to achieve the reuse of data assets, business self-development.

At the same time, in order to speed up the development efficiency, the link must be short enough.

As we saw in the architecture diagram at the beginning, if the middle four or five links, every link has to configure the schedule, every link has to monitor after the failure, then the development efficiency must be very heavy. If the development link is short enough, the data transfer is reduced, and the development efficiency is much higher. Therefore, we need to improve the development efficiency, and hope to achieve the decoupling of technology and business, and also to make the data link short enough.

The simple translation of O&M costs is that there were too many clusters and too much money spent.

We need to open four or five sets of clusters, scheduling and monitoring repeatedly. Therefore, if there is an opportunity to re-type a new warehouse in the future, we should consider how to reduce the cost and provide more capabilities, including some elastic capabilities, with fewer clusters and less operation and maintenance. At the end of the month or in promotional activities, when the company has requirements on the amount of calculation and analysis, it can make some elastic expansion and shrinkage capacity to adapt to the change of different calculation loads. This is also a very important ability, which can save the company’s operation and maintenance costs.

The third cost is manpower cost, including recruitment cost and learning cost.

Big data is a relatively complex system. Those who have done Hadoop technology should know that thousands of parameters and more than a dozen components are interdependent, and the failure of any node may cause a variety of problems in other systems.

In fact, the cost of learning and operation and maintenance is relatively high. The example of database mentioned just now is a good example. The cost of development can be reduced by using descriptive language, that is, the way of SQL, the way of SQL can dramatically reduce the entire development threshold. Most of the students have learned such courses as database when they are undergraduates, so using SQL is more efficient than those that need to learn API and SDK. In this way, talents are easier to find in the market, which also allows companies to shift more energy from low-level platform operation and maintenance to upper-level data value mining.

Through standard SQL docking with other systems, whether it is development tools or BI presentation tools, will be more convenient.

These are some of the technical requirements derived from the business requirements.

The first generation of real-time data warehouse: the database stage

Let’s take a look at whether past real-time storehouse development techniques can meet these requirements.

The first generation of real-time database technology is called the database phase, which is the typical LAMDA phase.

This architecture basically has a set of data links for a business requirement. The data links are divided into two parts: real-time and offline. The data is collected to the messaging middleware, part of which is processed in real time, and the result set is stored in MySQL/HBase. The other part uses Hive/Flink offline, processes result sets, and stores them in MySQL/HBase, all of which turn big data into small data and then serve them.

The problem with this architecture has been analyzed by many people. The two architectures are redundant with each other, resulting in data inconsistencies. This is relatively straightforward, but the more important problem here is the development of the chimney.

When the business side put forward a new demand, programmers will go to the data from which the data source, it should do, with which the third party data source to do how about gathering and processing, and then generate a result set, finally, we will find that the result set or generate hundreds of reports, 80% of the parts have a lot of redundant to each other. It is often the case that one line of business looks at these three metrics, while another looks at two of them, with only one field changed in the middle. The original data are the same, but you have one more statistical field and I have one less. However, we have to re-develop end-to-end, which seriously reduces the development efficiency. We have developed thousands of reports, but we don’t know if anyone is using these reports, so we don’t dare to log off easily.

In addition, the need to adjust, operate, and modify fields in any data source as they are added or reduced is almost a disaster. If is a person development so the question is not big, we have seen a lot of the development team, every four or five classmates often write scripts, then these personnel leaving someone, someone into the job, the last who also dare not delete old colleague code, finally becomes a scheduling thousands of scripts, every day in error correction, may be a script fault, also may be a fault of data quality, This makes the cost of operation and maintenance very high.

Therefore, this kind of smokestack development is quick to start, but it is not sustainable in practical operation. The way we solve the chimney problem is to precipitate the shared parts, which brings us to the second phase of the real-time warehouse: the traditional warehouse phase.

The second generation of real-time data warehouse: the traditional data warehouse stage

The data warehouse is a very good concept, which is to precipitate the computational indicators that can be reused. Therefore, the data warehouse should be layered, including three layers of DWD, DWS and ADS. Through layers of precipitation, the shared part sinks down and the different part moves up, so as to reduce the problem of repeated construction. This is also a set of basic methodology developed by several storehouses after decades of precipitation.

This approach uses Kafka to drive Flink, which does some dimension table associations during the calculation. The dimension table association basically relates Flink to a KeyValue system to widen the dimension table, after which the result will be re-written to another Topic in Kafka, and then the second aggregation and summary will be done to generate some DWS or ADS, and finally the result will be stored in OLAP/HBase system.

Why is the result store part OLAP and part HBase?

OLAP systems are a great table structure for storehouse layering, but such systems cannot support online applications. Online applications require tens of thousands of QPS queries per second, the query mode is relatively simple, but the QPS requirements are very high, most of the database system is difficult to meet this requirement, so at this time, we have to put the system in HBase, to provide millisecond-level response ability.

This system solves the previous smokestack problem, but we still see data redundancy in OLAP/HBase systems. The same piece of data, describing the same piece of logic, still has redundancy in the data storehouse and the KeyValue system. The company’s business team will make a choice based on the SLA. If it is sensitive to latency, it will put the result in HBase. If it is not sensitive to latency, but requires query flexibility, it will put the result in OLAP.

On the other hand, HBase system is not convenient in development, because it is a set of keyValue system with relatively simple results, all queries need to be accessed based on Key, and there is no Scheme for Value part. Once business units find data quality problems, they cannot simply check the value of a certain row or column, or coordinate and update it at any time. This is some limitations of a Schema Free system, and metadata management is not convenient.

The third generation of real-time data warehouse: the analysis service convergence stage

The third stage of real-time data warehouse is the analysis service integration stage. This stage has been realized inside Alibaba and most external companies are also exploring the road.

This stage there are two difference with the previous stage, on the one hand is unified on the service end, whether the OLAP system or check point system, through a unified system, reduce the number of data, reduce the inconsistency of the interface, reduce a data passed back and forth between different systems, unified storage effect, this allows us the state of the data check and correction easier. Interface unified into a system, security, access, control, application development interface can be consistent, in the server side to do some optimization.

On the other hand, the data processing link has also been optimized to a certain extent. There is no Kafka in this link. Kafka or not is an option, since some systems have some event-driven capabilities, such as Hologres, which has built-in Binlog event-driven capabilities, so you can use Hologres as a Kafka. Binlog with Flink to achieve DWD, DWS, ADS layers of real-time convergence, which is also a very good ability. With such an architecture, components are left to Flink and Hologres, with no other external systems in between, and can still be driven by real-time links. Data is not split, and all data is stored in the database.

The key benefit is that the development link is shorter, making the error cost lower. Second, the detailed state of the data is stored, because it is difficult for HBase to store detailed state. If the details are stored in the service system, the cost of data checking and fixing errors becomes very low. In addition, there are fewer components and correspondingly lower costs of operation and maintenance.

Alibaba Double 11 real-time business practice

Alibaba’s Singles’ Day is done in the above way. The throughput of Singles’ Day is almost the largest in the world.

As you can see, the data goes through the real-time channel to the messaging middleware, usually by clicking on the stream data first, and then you need to do some dimension table broadening through Flink. Clickstream inside records what ID clicked on what kind of SKU, this time to the SKU as a wide expression, to translate it into what goods, what category, what crowd, through what channel click. Once the dimensions are widened, they are stored in the Hologres, where the Hologres act as a real-time database, and the other data is synchronized offline to the offline database MaxCompute, which acts as a concept of historical global data. We often calculate the changes in consumer behavior over the past period of time. This part of the data processing is not very timely, but the data volume is very large. It is performed in MAXCOMPUTE.

At analysis time, Hologres and MaxCompute are linked together by federated queries, so you don’t need to have all the data in one place, and users can still reduce data redundancy by doing federated queries in real time and offline.

Apache Flink – The industry de facto standard for real-time computing

Apache Flink has become the industry de facto standard for real-time computing, used by all industries, and Alibaba is the leader in this area. Every open source technology has a successful commercial company behind it, and the commercial company behind Flink is Alibaba.

The real-time calculation part is very simple, namely Flink, but it is not easy to choose the warehouse system, because there are a lot of options, which can be divided into three categories: transaction system, analysis system and service system.

A transactional system is a system that generates data. It has a lot of business systems, and it is not easy to do direct analysis on this system. In direct analysis, the load is likely to affect the stability of the online system, and the performance is not guaranteed, because these systems are optimized for random read and write, basically based on line storage.

For statistical analysis scenarios, IO is very expensive, it is basically for the DBA to ensure the stability of the online system. So the first thing we did was separate the transaction system from the analysis system, and synchronize the data into the analysis system once.

The analysis system has done a lot of optimization specifically for analysis. We will adopt modeling methods such as column storage, distribution, summary, and semantic layer construction to simplify and enrich the data model for analysis, and then improve the performance of data analysis. This is a system for analysts.

Another system is called the ad-serving system, which used to be primarily NoSQL, but today also has HBase, Cassandra, and Redis, which I consider to be a type of analysis as well. It’s just that ad-serving is relatively simple and does a lot of ad-serving through primary keys. However, this kind of system is also a data-driven scenario, which supports more online applications, and also has the capability of high throughput and update of data.

As you can see, after the data is generated from the source, there are generally two outlets, one is to enter the analysis system, the other is to enter the service system to support online application. We find that data actually creates a lot of fragmentation in different systems, which then means data movement, data relocation, data dependence, interface inconsistency, and we start to think of ways to innovate.

The first innovation is at the boundary between a service system and an analytics system, and we think about whether a system can do both analytics and services, which is an ideal idea but it does work in some scenarios.

But there are some limitations to such a system. The first limitation is that the bottom line of the system is to guarantee transactions. Transaction is the ability to calculate cost very big a thing, you can see most of the analysis system does not support transactions, because the transaction will bring a lot of lock, if you have a distributed lock overhead will become bigger, so this kind of system has a certain ability, but also sacrifice part of the performance, so in high concurrency scenarios are not suitable.

Another limitation is that the data model is not suitable. There may be hundreds of tables generated in TP, but the analyst is not willing to look at hundreds of tables. Instead, he is more willing to gather hundreds of tables into several large fact tables and dimension tables, which is more in line with the requirements of analysis semantic layer.

To sum up, it is difficult for this system to fit both TP system and AP system in the data model, so I think HTAP system has great limitations.

On the other end of the spectrum, we think about whether a system can support both an analytical scenario, which is flexible enough, and an online service scenario, which supports high concurrency, which supports fast queries, which supports updatability of data, which is also a very important scenario.

If we use a system to support these two scenarios, we will let the cost of data migration, data development and a lot of lower.

We can see that there are many commonalities in these two innovation systems. For example, the data is mainly read-only and there is no lock requirement, so the data needs to be processed and abstrused. It can be seen that the similarities between the analysis system and service system defined here are very large, and there are opportunities for innovation.

Common real-time data warehouse architecture selection reference

Next let’s take a look at Flink and other common database technologies on the market to do real-time database analysis from various dimensions.

For example, the MySQL system, which provides a very good interface, is more convenient to develop MySQL and supports a variety of functions, but in fact, its scalability and flexibility of data will be much worse. Because Flink and MySQL turn the data into a result set for use, there is a lack of many intermediate states, the data needs to be relocated, and the correction cost is very high.

As we continue to think about it, MySQL does not scale very well, HBase scales very well, it can handle very large data scale problems. However, the HBase data refresh ability is relatively weak, and it is not convenient to update the data of a certain row/column at any time. The flexibility of the model is not very good, because it is basically the KV system, and it is very inconvenient if you do not check according to the Key.

Then there’s ClickHouse, which has been very popular in recent years, which is very fast and is very suitable for wide-table scenarios. Data analysis with only a wide table is fine, but we know that data analysis with only a wide table is often not enough, requiring a lot of table associations, nesting, and window calculations, so ClickHouse is a bit reluctant for this scenario. ClickHouse does not support standard SQL, and many functions and associative operations are not supported, especially in the case of table associative scenarios, the performance degradation is obvious. In addition, ClickHouse also has a big limitation in metadata management. It lacks a distributed metadata management system, so the cost of operation and maintenance is relatively high.

In addition, there are some technologies of the data lake, such as Flink and Hudi/Iceburg, which provide the big data platform with a certain ability to update data, but still maintain the problem of data scale. Therefore, we say that it performs well in data correction, but in terms of query performance, this kind of system does not do much optimization, because to do good performance, you need to do a certain index, you need to do enough deep optimization with the query engine. HUDI/ICEBURG basically still stays on the updatability of the storage layer, and more on the optimization of metadata management, so its performance is relatively general.

The last one is Hologres. Relatively speaking, it is a good system with excellent performance in data model flexibility, self-help analysis, scalability, data correction ability, operation and maintenance ability, etc. It supports Scheme for full tables, users can join/nest any table, and it also supports second-level query response.

If you are applying an online system that requires a high QPS response of tens of thousands or hundreds of thousands, Hologres can also be supported. In addition, it is a fully distributed and extensible architecture, which can be flexible and scalable with the change of load. The data support is flexible and can be updated directly using SQL UPDATE and DELETE to do data Update. It is a managed service, elastic scalability is through the white screen interface to operate, so the operation and maintenance is relatively simple, development is a standard SQL interface, so will write JDBC, ODBC procedures can be used to do development.

To sum up, I think that Hologres has the advantages of other systems and also makes up for some deficiencies. It is the recommended selection of real-time data warehouse architecture at present.

Next Generation Big Data Warehouse Concept HSAP: Integration of Analysis and Service

HSAP: Hybrid ad-serving & Analytical Processing

The concept behind the design of Hologres is called HASP, which supports Hybrid ad-serving and Analytical Processing loads at the same time. We hope to achieve a unified storage, so that when data is being written, whether in batch or in real time, the writing efficiency can be high enough.

Secondly, the service interface is unified when the object is serving. No matter the scene of internal interactive analysis or the scene of online dot checking, it can be output through the same data service interface. By integrating the two scenes together, the development efficiency can be improved.

Hologres = Better OLAP + Better Serving + Cost Reduced

Hologres, belonging to Alibaba’s self-developed big data brand MAXCOMPUTE, the cloud native distributed analysis engine, supports the high-concurrency and low-latency SQL query for PB data, supports real-time data warehouse, big data interactive analysis and other scenarios.

In our definition, Hologres is a better OLAP, and in the past OLAP systems had to be fast enough, flexible enough, and so on.

Second, it’s a Better ad-serving system, which can also support the high throughput write, update, query, 100k + QPS capabilities of the point-and-check system.

Cost Reduce does not mean that the system must be cheap, but means that our learning Cost, operation and maintenance Cost can be sharply reduced through the system, and these are the benefits we bring to the system.

Cloud native real-time warehouse solution: real-time computing Flink version + Hologres

At the architectural level, it is a more flexible architecture.

After real-time data access, the processing layer is divided into two links, one is called the detail layer processing, the other is called the aggregation layer processing.

Fine-layer processing is also done when cleaning, correlation, conversion, through Flink to complete the calculation. After the processing, the detailed results can be saved directly. If the processing is order data, it is basically OK, because the data volume of order data is already a relatively large volume in a company with a scale of ten million.

This kind of detailed data provides reporting service directly to the outside, and does not need to do a lot of secondary aggregation processing. In the process of cleaning and processing, the correlation broadening of dimension table is often done, and this kind of point checking scene is also very suitable in Hologres. What used to be done with HBase/Redis can be done by creating a table in Hologres. In the detail processing stage, it can not only be widened by the row table, but also be saved by the detailed data. The two scenes of reading and writing can be supported.

If it is behavioral data and click-stream data, the data volume is often larger and the data value is lower. At this time, the analysis efficiency is relatively low if the full storage details are kept. At this time, the secondary aggregation and pre-processing can be done. It can be processed into some mild summary layers, that is, into the DWS layer, which can be saved or further processed into the ADS layer. It can also be processed into a final ADS result set for a certain business scene, which can be saved. Such scenarios become smaller data that can support higher QPS.

So for the above, this database system gives us more flexibility, do interactive analysis as much as possible to save details, do online point search, this table can be processed into a query by the primary key of a table, which brings a lot of flexibility to the development.

Processing is not always through the way of stream calculation, in some cases can also be through the data archive detail data, in the database and do batch scheduling can continue to do the secondary convergence pre-processing.

Real-time warehouse scenario 1: Ad hoc queries

Hologres defines three ways to implement a real-time database, the first of which is called an AD hoc query.

Ad hoc queries are situations where you save the data and provide as much flexibility as possible without knowing what the query mode looks like.

Therefore, at this time, we would suggest that the data of the operation layer (ODS layer) should be cleaned and associated with simple data quality as far as possible, and then stored in the detailed data, without doing too much secondary processing and summary. Because it is not clear how to analyze the data at this time, we can provide services by encapsulating views and building a lot of views.

View is made very good encapsulation of logic, under the table of correlation, nested, such as complex calculation encapsulated in advance, but this time no curing ahead, so that the original data if there is any update, we are always refresh a data layer, data update only one place, do not need to put all the above together table updates, because there is no convergence on table, What aggregates is the view.

Therefore, the flexibility of this kind of calculation is very high, but its query performance is not necessarily the best, because the calculation of the view is very large, every time when the view is checked, the underlying original data must be associated and nested, so the calculation is relatively large, suitable for the scene with low QPS requirements and high flexibility requirements.

Real-time warehouse scenario 2: minute-level quasi-real time

The second scenario is called minute level real-time. Compared with just now, this scenario requires higher QPS, and the query is relatively more solidified.

Now, we’re going to take that part of the view, and I’m going to materialize it into a table, and the logic is the same logic, but it’s a table. After it becomes a table, the amount of data query will be much smaller and the query performance will be improved. This is also a relatively simple way. Through the scheduler of DataWorks, minute-level scheduling can also be used to achieve quasi-real-time, generating a scheduling batch in 5 minutes or 10 minutes, which can meet the requirements of most companies’ business scenarios.

In this way, development becomes very simple, any link, any data error, as long as the DataWorks schedule to run again, operation and maintenance is also very simple.

Real-time warehouse scenario 3: Real-time incremental data statistics

There are also some scenarios that are very sensitive to data delay and must be processed when data is generated. In this case, through incremental calculation, Flink is used to gather layers such as detail layer and summary layer in advance. After the aggregation, the results are stored and then provided services to the outside.

Unlike traditional incremental calculations, we recommend that the intermediate state be persisted as well, with the benefit of increased flexibility in subsequent analysis and a sharp reduction in the cost of correcting data errors. This is also the way we see a lot of companies are using. In the past, they string data together in full real-time through Kafka Topic. Once there is any problem in the quality of the intermediate data, it is difficult to correct the data of Kafka, and it is difficult to check where the data is wrong, so the cost of error detection is very high.

After each Topic data is synchronized to the HologRes, if there is a problem with the data, it will correct the table in the table database, and then rebrush the data in the database to realize the data correction.

The Hologres real-time counting storehouse scene has three selection principles

In actual business, we can make choices according to different situations. The selection principles of the three scenarios of Hologres real-time warehouse are as follows:

If purely offline, take precedence over maxCompute;
If the real-time demand is simple, the amount of data is small, and only incremental data is needed to calculate the results, it is suitable for Scenario 3: real-time statistics of incremental data;
If there are real-time requirements but the real-time requirements are not very high, but the development efficiency is preferred, the minute-grade real-time solution is preferred, which is suitable for Scenario 2: minute-grade real-time;
Very high real-time requirements, AD hoc query demand, relatively sufficient resources, suitable for scenario 1: AD hoc query.

Alibaba Customer Experience System (CCO) data warehouse real-time transformation

CCO service operation system: digital operation capability determines the service experience of consumers and merchants.

Alibaba Customer Experience System (CCO) real-time data warehouse transformation

The real-time data warehouse has experienced three stages: database -> traditional data warehouse -> real-time data warehouse.

Business Difficulties: 1) High data complexity, including the omni-channel of purchase, order, payment and after-sales, with 90% real-time data; 2) Large amount of data, log (10 million/s), transaction (10 million/s), consultation (10 thousand/s); 3) Rich scenes, large real-time monitoring screen, real-time interactive data analysis products, and TOC online applications.

Technical Architecture: DataHub + Real-time Compute Flink + Hologres + MaxCompute
Revenue: 1) Overall hardware resource cost decreased by 30+%; 2) High throughput real-time write: support line storage ten million/second, column storage hundreds of thousands/second write requirements; 3) Simplified real-time links: common layer oriented development and reuse; 4) Unified service: it supports multi-dimensional analysis and high-QPS service-oriented query scenarios at the same time; 5) MC-Hologres query service: the average latency query on November 11, 2020 is 142ms, and 99.99% of queries are within 200ms; 6) Support the construction of 200+ real-time data screen, and provide stable data inquiry service for nearly 300+ minors.

Analysis of e-commerce marketing activities

Let’s take a marketing example.

In the past, marketing campaigns were all about planning a month in advance, including what kind of ads to put in where, who to advertise to, what kind of vouchers to put in, but this has higher requirements in the Internet space. For example, in scenes like Double 11, there are more and more demands for real-time strategy adjustment. The activity may only last for one day, so we need to know the transaction ranking, transaction composition, inventory situation and conversion rate of each flow channel in real time. For example, when we open a webpage, we want to recommend a certain product at the beginning, but as time goes by, we will find that the conversion rate of the product is very low. At this time, we need to adjust our marketing strategy.

Therefore, in this scenario of real-time marketing, there are very high requirements for real-time data analysis. It is necessary to understand the conversion rate/transaction rate of each channel, each type of population and each type of commodity in real time, adjust marketing strategies according to the situation in real time and finally improve the conversion rate/transaction rate.

The general architecture of Ali’s internal computing is shown as above. QuickBi is used for product construction, Hologres is used for interactive analysis, and Flink and MaxCompute are used for data processing. This is a classic architecture.

Real-time calculation of FLINK + RDS/KV/OLAP scheme

Real-time computing Flink version + RDS/KV/OLAP architecture is an early scheme, all the computing logic through Kafka string processing, and then use Flink to aggregate the results of the collection, its limitations are development efficiency and resource consumption is very large.

We know that the data analysis may be N dimensions, such as time, region, product, crowd, categories, etc., in which different dimensions can be further divided, for example points level 1, level 2 special, category 3 categories, crowd portraits include consumption ability, degree of education, geography, etc., any dimension combination do correlation analysis will produce certain conclusion.

Therefore, we say that the past is a large amount of calculation is because is 2 ⁿ combinations, to advance the all Angle analysis may be good save, allows analysts see statements, no matter what he choose dimension combination, can find the corresponding calculation results, but this is an astronomical amount of calculation, there can be no programmers write so much calculation, If there is no evaluation, there is no result set.

On top of that, if we start with a combination of three dimensions, and suddenly the boss decides that these three dimensions aren’t enough to figure out what’s really going on today, he wants to add another dimension. However, we did not write this logic in advance. At this time, the online data had been consumed, so there was no way to recalculate it, or the recalculate cost was very high, so it was a method of high resource consumption. And after we finished the development, we calculated thousands of intermediate tables, we could not be sure whether these result set/combination relations were used, we had to work out first, put a temporary table in the database, the cost is very high.

Real-time calculation of Flink version + Hologres interactive query scheme

As shown above, there is no essential change in the structure after the reform. It is still the logic of processing. The biggest change is the replacement of a lot of intermediate processing by the way of views, which do calculations in the database and put some of the tasks in front of the calculation behind the calculation.

The raw data is still processed through messaging middleware, parsed and duplicated by Flink, and then stored in detail. Detailed data no longer do a lot of secondary summary processing, but through a lot of logical views, no matter how many dimensions you want to see, you can write to the SQL statement at any time, the view can be online at any time, it does not affect the original data. In this way, we can look up whatever data we want, and all the analysis load is not wasted, and development efficiency is very high. From the calculation of 2 ⁿ in the past, to the present only one detail is saved, according to the business demand scene to do some mild summary layer, DWD and DWS logic encapsulation, build the view is enough, greatly improve the development efficiency.

The cost of operation and maintenance requires the structure to have a very good flexibility and scalability, which is also the ability of Hologres. In the case of double 11, when the flow is ten times larger, it can expand flexibly at any time, which is very convenient.

Help business to make fast decisions and control

This has brought the very big gains, used to need a very complicated logic flow calculation, data from produce to the output results, calculation process may take several hours, which means that we had a few hours can be sentenced to a few hours before the data, adjust the business logic to see the results and to wait for a few hours, the flexibility of the entire business.

With the use of Flink + Hologres architecture, data can be generated and analyzed in real time, and real-time decision-making and regulation of business can be made at any time. The flexibility of analysis is very high. For example, many customers reported that they thought some products would sell well at the beginning, only to find that in the evening they were completely different from their expectations, and the explosive products were not expected. These are problems that cannot be seen from the pre-planned statements. In real scenarios, there are many demands for flexible launching of new businesses. If new businesses can be launched within a few minutes or an hour, these flexible scenarios can be solved, which is also a big benefit of Hologres real-time database.

Efficient real-time development experience

We can see that there are dozens of different indicators in the large screen development. In the past, the development of this kind of system was also relatively complex. Only by using different data sources to do different aggregation and through different scheduling could such data be generated.

After using Hologres, the development can be completed in 2 people and days, because there is no need for so much scheduling behind it, and only a few SQL statements can be written, which greatly improves the development efficiency.

It can write more than 50 billion records in a single day, has a peak write QPS of 100W+/s, and a query response latency of less than 1 second. It does a very good job in both the amount of data written and the analysis experience.

The development of the Hologres database is in three phases

Hologres not only has the construction of real-time data warehouse, but also has different computing methods such as quasi-real-time, full real-time and incremental real-time. In different stages of the construction of data warehouse, Hologres has different ways of use, including exploration, development and maturity.

Explorations are scenarios where flexible analysis is required, the data model is not fixed, and the analyst is not sure what he or she wants to do with the data. Therefore, the construction of data warehouse at this time is mainly focused on the convergence of detail layers. First of all, the company’s data should be centralized. If the data is not centralized, the analysis efficiency cannot be guaranteed. ODS construction should be the main task and DWD should be the auxiliary task. To ensure the quality of the ODS layer, we should do a good job in the basic work of data cleaning, correlation and broadening, generate some DWD detail layer data, and then directly provide analysis on the detailed data. We must make the ADS layer thin, and do not do a lot of scheduling and aggregation on it. Because the method of data analysis is still uncertain at this time, the construction of the following layer is mainly given priority to. Detailed data can be stored in Hologres, either by column or row.

The second stage is called the rapid development stage. The main characteristic of this stage is that the company starts to consider the data center and begins to reserve the roles of data product manager and data analyst. At this time, we begin to precipitate the index system of the company. DWD is no longer sufficient to precipitate the data assets of the company that can be reused. We need to continue to make some wide tables for reuse on DWD, and some scenes with key performance can also be converted into ADS layer tables through secondary processing. Hologres has row/column storage, and the indexes that can be reused are mainly column storage. If the ADS point search scene is needed, the secondary processing can be continued to turn it into row storage.

The third stage is mature and stable. At this time, the index system of the company is relatively perfect and there are many indicators that can be reused. Therefore, it is necessary to precipitate the aggregation layer of DWS that can be used from the aggregation of business scenes before. Hologres also provides technical support at this stage. As you can see, Hologres supports the built-in Binlog driver for continuous aggregation from DWD to DWS.

It can be seen that the warehouse construction is divided into different stages, and different stages have different construction emphases. In different construction stages, different technologies of Hologres can be adopted to meet different needs.

conclusion

So to conclude, what are Hologres?

First of all, HologRes is a relatively simple system to develop. It can be used as long as you can write SQL statements. It has almost no restrictions on SQL, and supports standard JDBC for connection operations and window operations.

Hologres is Postgres compliant, so it has the option of using the Postgres protocol with Holo, whether it’s a development tool or a BI tool.

All the data structures in HOLO are basic tables with data types, primary keys, and indexes, which are familiar concepts and therefore very cheap to learn.

Secondly, this system query is fast enough, it has real-time write, write can be checked, end-to-end real-time is the most basic requirements, data written in all do not need to drop disk can be checked.

Beyond that, it’s very native to the big data ecosystem, whether it’s Flink or MaxCompute. Flink has many native connectors and supports Holo’s native Binlog interface for the highest performance throughput. Hologres and MaxCompute are connected at the file system level, so in many cases, data is processed in MaxCompute and does not need to be exported to the Hologres system. Hologres can be used as an external query for it and the performance is better than MaxCompute directly. But if you want more performance, you still need to derivative the MAXCOMPUTE table into the HOLOGRES.

Traditionally, MAXCOMPUTE takes a long time to direct to almost any other OLAP system, but because MAXCOMPUTE connects to the underlying file system of Hologres, the synchronization process does not read data out and write it to another system. But through the file system’s underlying interface directly like Copy operations, so performance can improve 10 to 100 times.

The most important point is the improvement of development efficiency. In the past, we might build the system by week, but now we can build the system by day. We can spend more time on data mining, rather than on the low-level operation and maintenance of the data platform.

There are also many benefits in terms of friendliness. Hologres is a hosted service that gives users the flexibility to scale up and down. During capacity expansion, the computing node and the storage node can be expanded independently. When the computing node is not enough, the capacity can be expanded separately. It is not necessary to expand the capacity of computing and storage simultaneously.

This technology is under the scenario of alibaba is strictly verified, this and many other technology is different, it is not to make goods to sell products, the technology inside the ali after four years many pairs of 11 scenes, dozens of different BU repeated verification, we think it is already a mature technology can be used by more users, So we turned it into a cloud-hosted service for a wide range of users. Through HSP architecture, multiple scenarios can be supported with a simpler architecture to achieve better cost performance.

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.