I spent 10 hours, write a small white can understand ali data analysis!

Author: data analysis is not a thing www.jianshu.com/p/05a8db84e…

The data Center is known as the next station of big data. It was raised by Ali, whose core idea is data sharing. In 2015, Ali proposed the strategy of “big center, small front desk”. In 2018, because of “Tencent data center theory”, The center once again became the focus of people’s talk.

In 2019, it seems everyone is talking about a data neutral, but not everyone is quite sure what it means. Is data center a lofty concept that only big factories need to consider? Should ordinary enterprises do data center? Will the emergence of a data center pose a disruptive challenge to existing data practitioners?

Data Center is not a big data platform!

First of all, it’s not a platform, it’s not a system, and if some vendor says they have a data center to sell you, sorry, it’s a liar.

To answer what is the data center, first of all to discuss what is the center. Although there is no clear definition, as straight men of science and technology, we can first consider the middle stage as a kind of middle layer. Since it is a kind of middle layer, so the middle stage is really a kind of dyed-in-stone technical term, we can discuss completely from technical Angle.

We can apply Gartner’s Pace Layer to understand why there is a middle tier, so that we can better understand the positioning and value of the middle tier. As mentioned in Pace Layer, things can be layered according to the rate of change, so that you can analyze and design reasonable boundaries and services Layer by Layer.

In data development, the change of core data model is relatively slow, and the workload of data maintenance is also very large. But the pace of business innovation, the changes in demand for data, is very rapid.

The emergence of data center is to make up for the lag of response between data development and application development due to the mismatch of development speed.

Efficiency: Why does it take more than ten days for application development to add a report? Why can’t I get user recommendation lists in real time? When the business people have a little question about the data, it takes a long time, and it turns out that the data is changed at the data source, which ultimately affects the online time.

Collaboration issues: When a business application is developed, the requirements are similar to those of other projects, but the data has to be developed again because it is maintained by another project team.

Capability issues: Data processing and maintenance is a relatively independent technology, requiring a fairly professional person to complete, but many times, we have a large number of application developers, but few data developers.

All three of these problems slow down application development teams. This is the key to midstage — keeping the development speed of the front end development team free from the impact of data development in the back end.

Data middleware is a logical concept that aggregates and governs cross-domain data, encapsulates data abstractions as services, and provides business value to the foreground.

As shown below:

DData API is the core of the data center. It is the bridge connecting the foreground and the background. It provides data services through API, rather than directly giving the database to the foreground and letting the foreground develop and use data by themselves. As for the process of generating DataAPI, how to make DataAPI produce faster, how to make DATA API clearer, how to make DATA API DATA quality better, these are to build around the DATA center.

In fact, these concepts are very empty, so let’s combine ali’s example to explain.

Ali data center explanation

Ali Data In Taiwan enabling business panorama

In the architecture diagram, the bottom content is mainly data collection and access. Access data according to the format (such as Taobao, Tmall, Hema, etc.) and extract these data to the computing platform. Through OneData system, “common data center” is constructed with “business block + analysis dimension” as the architecture.

Based on the public data center in the upper layer according to business needs to build: consumer data system, enterprise data system, content data system and so on.

After deep processing, data can play its value to be used by products and businesses; Finally, the unified data service is provided through the unified data service middleware “OneService”.

Ali data center three systems

After years of practice, the core capability framework system of Ali Cloud data center has been developed: product + technology + methodology *.

After various practical experiences in Ali Ecology, data Center on cloud builds data and manages data assets intelligently from a business perspective rather than purely technical perspective, and provides multiple services such as data call, data monitoring, data analysis and data presentation.

It is the engine of building intelligent data and promoting intelligent data. Under the guidance of OneData, OneEntity and OneService, especially their methodology, the kernel capability of the data center on the cloud has been accumulating and precipitation. At Alibaba, almost everyone knows about the three main systems in the data center on the cloud, as shown above.

OneData is committed to unifying data standards so that data is an asset rather than a cost; OneEntity aims to unify entities so that data can be integrated rather than isolated; OneService aims to unify data services so that data can be reused rather than replicated.

These three systems not only have methodology, but also profound technological precipitation and continuous optimization of product precipitation, thus forming the core capability framework system of Data center on Alibaba Cloud.

Ali data center and enabling business model support

Alibaba Data Center has experienced the test of all alibaba ecosystem businesses, including new retail, finance, logistics, marketing, tourism, health, entertainment, social and other fields.

In addition to establishing its own kernel capacity, the data center is endowed with the business front desk upward, and connected with the unified computing background downward.

According to six data technology fields in Taiwan

As mentioned above, at the beginning of the construction of Ali data public layer, six data technology fields are planned, namely, data model, storage governance, data quality, security authority, platform operation and maintenance, and R&D engineering.

In ali data public layer construction project completed the second phase of storage management field, have been expanding in the field of resource management, asset management and then upgrade to the data field, the field of security permissions, upgrade to the data field of trust, because a lot of work has been implemented in the product, is no longer as an operational domain data technology is advancing, The field of data model and data quality is still in progress, but many new connotations are added, and the field of smart black box is a new show.

Thus, the field of data technology is not immutable, but with the development of business and technological breakthroughs continue to expand and sublimate.

So, what about the real-time data center?

The following is a logical framework for implementing the real-time data center to help you understand that the most critical layer is the real-time model.

1. Real-time access

Different types of data need different access methods, flume+ Kafka is now standard, other files, database DSG and so on. For example, operators have real-time data such as orders and calls in Domain B, location and Internet access in domain O.

2. Computing framework

List a, based on the Kappa architecture integrates real-time/offline business development capability, compared with traditional Lambda architecture, developers need to face a framework, development, testing, and the difficulty of operations are relatively small, and can give full play to the Flink streaming computing framework point execution, high throughput, millisecond response, the characteristics of the batch flow integration.

For example, the streaming computing component is divided into real-time data slices, and the batch processing component provides an offline data model (resident memory), and the two types of data are associated with the batch stream during processing.

3. Real-time model

Like the data warehouse model, the real-time model must be business-oriented first, for example, operators have a series of real-time scenarios, such as traffic operation, service reminder, competition response, good and new, hall and store drainage, voice consumption, operation evaluation, real-time care, real-time warning, real-time insight, real-time recommendation, etc. You always need to extract common data model elements based on your real-time business.

For example, in the real-time marketing of migrant workers in Laxinzhong, the possible trigger scenario is to launch marketing for users who enter a transportation hub and stay there for more than 10 minutes. The common element of “residence time at a certain location” may be a reusable real-time model.

The real-time model can be longitudinally divided into DWD and DW. The DWD model actually conducts standardized naming and filtering operations for all kinds of real-time data, which is convenient for standardized data management. DW models are divided into three categories: Dynamic model, event model, and timing model, each model is suitable for different scenarios, and needs to adopt the appropriate storage format.

Dynamic model: Collects and analyzes real-time data for real-time statistical indicators, such as real-time service operations. It can be stored in Kafka or Hbase.

Event model: Real-time data is abstracted into a series of business events, such as user location change events recorded from location log tracks, which can trigger location marketing of LBS. The following is a typical location event model design, which can be stored in MQ and Redis:

You can also design a sliding window model, such as saving the latest minute sliding window position information:

Timing model: it mainly stores the time and space location and other information of users online, which can be used for quick calculation based on business scenarios. For example, it is very convenient to calculate the resident duration, which is stored in Hbase or TSDB (timing database) :

4. Real-time service

Real-time model is not enough, data Center also needs to provide graphical, process-based, choreographed data development tools, in order to truly reduce the cost of real-time data development. However, due to the different technical means of offline and real-time data processing, the development and management of these two types of data are mostly carried out on different platforms.

Such as we used offline data model by DACP platform management, but free in DACP real-time data platform, it is often to a part of the application itself, the application needs by writing specific script to consumption and processing of the original data of stream processing engine, the threshold of this treatment is not only high, but also pretty serious resource waste, Every real-time application is an island of streaming data.

Standing on the application point of view, the business need is actually a unified data management platform development, off-line and real-time data should be managed as a unified object, such as a hybrid arrangement, ability, such as mixing associated with simple SQL custom class output needed for the application of all kinds of data, which provide real-time and efficient foreign/offline data services.

5. Real-time applications

If the data center can support the rapid arrangement of real-time data, according to our calculation, the period of data development, testing and deployment for real-time scenarios will be reduced from 0.5-1 months to 1-2 days, which is highly beneficial.

The amount of data alibaba processes has reached eb-level, equivalent to the storage capacity of 1 billion high-definition movies. On November 11, 2016, the amount of data processed in real time reached 94 million pieces per second. However, it only takes 2.5 seconds to collect, integrate and speed up data from user generated data source, provide data services, and display the data to the front desk.

“Umeng +” is a data company formed by the integration and upgrading of several data companies acquired by Alibaba. Here are just some of the indicators disclosed by umeng + in 2017, which cover 1.4 billion active devices, 6.85 million websites and 1.35 million applications, and process about 28 billion pieces of data on an average day. All of these are built on the base of Alibaba’s powerful data processing technology.

If there is enough real-time data and scenarios are rich enough, the need to establish a real-time data center is very high.

With the deepening of internal and external operations of big data, we find that the demand is more and more, you will be surprised to find that many times the demand is increased with the strengthening of your technical capabilities, many times, technology is the primary productivity. Many of our product and operation managers who are in charge of liquidating should know this.

Since then, I have wondered if we could build a real real-time data center that could create massive real-time applications quickly and efficiently, so as to take the management and application of big data to a new level. Finally, we have come to this road.

Architecture Digest, a public account, publishes a great article every day in the field of architecture, covering the application architecture of first-tier Internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture and other hot fields.