Brief introduction:In 2021 ali cloud financial data intelligent summit – the cloud native driver for intellectualization operational growth “dark horse”, “special performance, senior technical experts Wei Chuang ali cloud database first value from the data link Angle, to interpret how cloud native data warehouse support digital operation, the whole link marketing and ali group double 11 business, Financial customer best practice cases and application scenarios are presented. The content of this article is arranged according to the speech recording and PPT.

In 2021 ali cloud financial data intelligent summit – the cloud native driver for intellectualization operational growth “dark horse”, “special performance, senior technical experts Wei Chuang ali cloud database first value from the data link Angle, to interpret how cloud native data warehouse support digital operation, the whole link marketing and ali group double 11 business, Financial customer best practice cases and application scenarios are presented. The content of this article is arranged according to the speech recording and PPT.

Wei Chuangxian, senior technical expert of Ali Cloud Database

1. Background and trend

(I) Alibaba’s 15 years of cloud computing practice

Reviewing the road of the development of Alibaba Cloud Yuansheng for 15 years, it can be roughly divided into three stages.

The first stage is the stage of Internet application architecture from 2006 to 2015, which is the process of cloud native from 0 to 1. At the earliest, Alibaba did middleware on Taobao, which was the prototype of the earliest cloud. We were working with Oracle databases and IBM minicompressors. However, Alibaba found a problem that with the increasing flow of Taobao, Oracle machines could not continue to meet the business demands. After three months, our data would not be stored and counted. This was such a serious problem that Alibaba launched a plan to de-IOC at that time.

At this time, Alibaba found that our business was doing very well, but there were a lot of technical challenges. Therefore, Alibaba established Aliyun in 2009 to develop the Feitian operating system, and opened the era of cloud. Taobao and Tmall were merged to build the middle platform of the business, and the three core middleware systems were launched at that time.

Apsara operating system is a distributed operating system based on Apsara. There are two core services on top of the basic public module: Pangu and Fuxi. Pangu is the storage management service and fuxi is the resource scheduling service. The storage and resource allocation of Feitian kernel are managed by pangu and fuxi. Feitian core services are: computing, storage, database, network.

In order to help developers easily build applications in the cloud, Feiten provides rich connectivity and choreography services to easily connect and organize these core services, including: notifications, queues, resource choreography, distributed transaction management, and so on.

The top layer of Fetien is the —- cloud market, the first platform for software trading and delivery created by Ali Cloud. It is like the “App Store” of cloud computing. Users can open “Software + Cloud Computing Resources” with one click on the official website of Ali Cloud. There are thousands of products for sale in the cloud market, which supports software and service access of types such as mirror, container, choreography, API, SaaS, service and download.

This was the earliest cloud infrastructure, and a cloud-native architecture.

Since 2011, we started to do container scheduling, and started to do online business in the group. The online business began to be containerized. In 2013, the self-developed Feitian operating system fully supported the business of the Group.

In 2015, the original cloud technology of Aliyun was not only used for Alibaba’s internal business, but also started to be commercialized externally. The above is the first stage.

The second stage is the comprehensive cloud primitive phase of the core system from 2016 to 2019.

Since 2017, we not only do online, offline also all use cloud-native technology. There is a large amount of transaction data in the Double 11 Shopping Festival, and the background analysis and post-processing of these data are completed offline. We unify the underlying resource pool of online and offline based on the original cloud, supporting million-scale e-commerce transactions.

By 2019, 100% of Alibaba’s core system will be in the cloud, which is actually very difficult, because Alibaba’s business volume is so huge that no ordinary system can support it.

The third stage, which covers the period from 2020 to now, is the stage of comprehensively upgrading the next-generation cloud native technology. Alibaba established cloud native technology committee, cloud native upgrade for Ali technology new strategy. Alibaba core system comprehensive use of cloud native products to support the promotion. Ali cloud cloud native technology comprehensive upgrade, Serverless era opened.

(2) Aliyun’s assertion of cloud computing

How does Alibaba view cloud computing? What is the difference between cloud computing and traditional technology?

For example, in a village where every family needs to dig a well, each family decides how wide a well to dig based on factors such as the size of its population, the approximate amount of water needed, whether guests will come, and so on. If there are too many people in the house or if there is a drought, water may not be enough. In addition to the cost of digging the well, the daily maintenance of the well requires a high cost.

The above scenario is mapped to the enterprise, which is based on its own IT foundation, but also to the operator to buy a computer room, buy a few servers to support their own services. If these machines are left idle, companies still have to pay a large fee, which is very costly.

The cloud solves the problem of pooling resources through virtualization, which in the case of digging a well above is building a waterworks. The difference between a waterworks and a well is, first of all, the water supply is so large that even 100 guests can supply enough water to meet the demand. Second, instead of costing a lot upfront to dig a well, they charge for it as much as they need. Even if you connect the water line, if you don’t use it, you never have to pay for it.

The first is that companies need to make quick decisions, not spend a lot of time “digging Wells”, but out of the box. The second is low upfront investment costs.

That’s what the cloud is for, so what is cloud nativeness?

Cloud native is a standard service, a lot of things we don’t need to plan ahead. For example, if I want to do digital transformation, the requirements are very simple. I need someone to provide me with this service, and he will assign me as much as I want without any preparation in advance. As my business grows, the infrastructure underneath it grows with it, and it’s very resilient. This also greatly reduces the cost and energy of enterprises, and enables them to focus more on doing what they are best at and greatly improve their efficiency.

From the above example, the following points are easy to understand.

First of all, we believe that Container + K8S will become the new interface of cloud computing, which is a trend in the future.

Second, the entire software life cycle changes. The original software life cycle is very long, now through the cloud native technology can achieve faster and faster iteration speed, downward extension of software and hardware integration, upward extension of architecture modernization can be done.

Finally, accelerate the digital upgrading of enterprises. It used to be very complicated to make the digital transformation of enterprises. It may take three to five years to complete the process, including buying machines, databases and applications. Today’s corporate digital transformation can be fully transformed in just a few months.

(III) Industry trends: Data production/processing is undergoing qualitative changes

In terms of industry trends, what is going to happen to the data in the future and what is going to happen to applications?

First of all, we think data is going to explode in size. The global data size in 2020 is about 40 ZB. What is the concept of 40 ZB? For example, if each movie is 1 gigabyte, and if everyone in the world goes to see one movie, then the amount of data adds up to about 40ZB.

In addition, we predict that the global data size in 2025 will be 430% of that in 2020, and the global data size is growing every year.

The second is real-time data production/processing. We used to look at reports once a month, but now with big data, we can look at yesterday’s data once a day. Data is becoming more and more real-time, enabling second-scale responses. Take the marketing scene as an example. In the scene of the Singles’ Day shopping festival, when merchants find that an activity of their store cannot produce an effect, they can adjust their advertising or delivery strategies within a minute or a few minutes to achieve better marketing effect. If the data is reported on a daily basis, by the time you see the data on November 12, the effect of the activity will have been significantly reduced. Therefore, real-time data plays a very important role in such similar scenarios. Real-time data will also lead to real-time application.

The third is the intelligence of data production/processing. At present, unstructured data accounts for 80% of all data, mainly including text, graphics, images, audio, video, etc. Especially in the current popular field of live broadcasting, intelligent processing of unstructured data enables us to know the audience’s preferences and other information, so as to facilitate better business development. In addition, unstructured data continues to grow at a rate of 55% per year and will become a very important source of data analysis in the future.

The fourth is data acceleration to the cloud. We think the data cloud is overwhelming, just as gasoline cars will eventually be replaced by trams. By 2025, it is projected to be 49% in the data storage cloud and 75% in the database cloud by 2023.

(IV) Industry trend: Cloud computing accelerates the evolution of database systems

Another industry trend cannot be ignored: cloud computing accelerates database system evolution.

First let’s take a look at the history of database development. Databases have been around since the ’80s and’ 90s. At that time, they were mainly commercial databases, such as Oracle, IBM, DB2, etc., some of which still dominate the market today.

By the 1990s, open source databases such as PostgreSQL, MySQL, etc. Domestic use of MySQL is more, foreign use of PostgreSQL is more. After the 1990s, the amount of data became more and more large. When the amount of data was small, it was possible to use PostgreSQL or MySQL, and a single computer could solve the problem. With the explosive growth of the amount of data, it was necessary to solve the large amount of data and analyze the problem in a distributed or minicomp way.

What is the importance of data analysis?

For example, there’s Snowflake, a data warehouse that went public with a $100 billion market cap, and today it’s $70 billion, which is a lot of money for a company that only makes one product. Why is it worth so much?

Some time ago, I communicated with a teacher. He said that for the current enterprises, especially the Internet enterprises such as e-commerce or live broadcasting, the biggest cost of their enterprises in the early days was manpower, and the salary of employees was the main expenditure. But now the biggest expenditure is information and data, for the company’s future development planning, we need to have a lot of data to analyze what customers want most, what they need most, and what is the development of the industry. As a result, companies need to buy a lot of data and do a lot of data analysis, and the cost of this has exceeded the cost of personnel. That’s why a company that only does data warehousing has a market cap of $70 billion.

After 2000, people began to use Hadoop and Spark. In 2010, cloud native and integrated distributed products began to appear, such as AWS and AnalyticDB.

(V) Industry trend: Data warehouse accelerates from Big Data to Cloud-Native + Fast Data evolution

The top is the evolution of data warehousing, from offline to online, to off-line integration, and then to distributed computing. The capabilities range from statistics to AI, data types from structured to structured and unstructured multimodal integration, loads from OLAP to HTAP, hardware upgrades to hardware integration, and delivery from on-premise to cloud-native + Serverless.

In different stages of evolution, there are a variety of products to support.

(VI) Evolution of database system architecture

Above for the database system architecture evolution, simple logic can be understood as a person work turned out to be a workshop, then become a workshop ten men working for him, and then develop into multiple factories more men working for him, this is the whole development history of data warehouse, from the original single to distributed, and a more personal use of the data.

The development of the database is the same as the human work, the original store can be maintained by two people, one person responsible for production, the other responsible for sales. As it grows, it gets more and more customers. It’s still a store, but it probably has ten employees. Later, the business grew even bigger, hiring 100,000 employees and then working in 10 sites, which is the distributed cloud native data warehouse.

(VII) Industry trends: key technologies of cloud native database

The top is the key technology of cloud native database.

Here are two simple technologies. The first is cloud native. What does cloud native mean? If some user bought a database, when business volume is little, perhaps when legal holiday is not used, collect fees with respect to little, and when business volume is big, collect fees with respect to a few more. Charge according to demand, which is a requirement of our data warehouse.

For example, Alibaba has an investment department. If it invests 5 million yuan in Company A and 1 million yuan in Company B, all the information is highly private and cannot be disclosed to the public. If the information is managed by the employee, the employee may leave his/her job, and once the leakage occurs after leaving his/her job, it is also difficult to be held accountable at the legal level. How to make this highly private information completely encrypted, so that even the highest authority of the DBA can not see this kind of information, so that the security and credibility. This will be explored in more detail later in this article.

2. Cloud native and big data applications

(I) Business challenges

The business faces a number of challenges, mainly in four areas.

The first is that the data is scattered and inconsistent, and there are so many sources of data that it’s a big challenge to get it together.

The second is that the system is extremely complex, with 40+ systems or components. It used to be based on Hadoop, but now you need a lot of systems or components. You can have HDFS at the bottom, Yarn, HBase at the top, Hive, Flink at the top, and so on. It’s very complex.

In addition, the analysis is not real-time. Its data can only do T+1, which is the traditional big data architecture.

Finally, there is high learning cost. The iteration speed of versions of different technologies is very fast and the learning cost is very high.

(2) Cloud native data warehouse + cloud native data lake to build a new generation of data storage and processing scheme

Ali cloud then used from the simplest architecture, through one or two products can solve the architecture of the whole product, can let users use more simple, with SQL can solve a variety of problems. For example, the original OSS data and the centralized analysis of the data of various production processes.

(III) Cloud native data warehouse: Cloud native

The cloud native feature of cloud native data warehouse is mainly reflected in that, if there is only one piece of data, only one piece of data will be allocated for storage, and if the amount of data increases, it will automatically allocate more storage.

In the same way, it doesn’t allocate resources if it doesn’t calculate or analyze the demand. It only allocates resources to calculate or analyze the demand. The whole thing is paid as needed, plus the elasticity of the resource.

(IV) Cloud native data warehouse: integration of database and big data

The above are the key technologies in cloud native data warehousing, such as column and column blending, which can support high throughput writes and high concurrency queries.

The second is the mixed load, where you can both run ETL and do queries.

There are also smart indexes. One of the things that’s really important in a database is to understand the business, to understand the Index, to know what affects the query, what affects the write, so we want this thing to be a little bit more intelligent so that the user doesn’t have to manage these things.

(V) a new generation of data warehouse solutions

The top is the architecture diagram of the new generation data warehouse solution. The bottom is the number storehouse, and the top is the number storehouse model. Ali has made a lot of models in Taobao index, data insight and so on, including associating all the information through an ID. This information is aggregated into a model. The model has a data construction management engine, which can do warehouse planning, code development, data asset management, data services, etc.

At the top is business empowerment, which has many applications, including regulatory reporting, business decision making, risk warning, and marketing and operations.

(VI) Data security on cloud

Let’s expand on the issue of data security on the cloud. Every company has top-secret data, which is subject to many security problems, such as administrators/users overstepping their authority, stealing data backups, maliciously modifying data, etc. In addition, there are data in the storage, query, sharing process of the entire encryption, anyone (including administrators) can not obtain the plaintext data. Ensure the integrity of the log in an untrusted environment and that no one (including administrators) can tamper with the log file. Ensure that the query results are correct in an untrusted environment and that no one (including administrators) can tamper with the query results.

The previous solution is very simple, is to write to the database when the data encryption, for example, write in 123, through the encryption will become out of order, such as 213,312, etc. This seems like a good approach, but what’s wrong with it? It has no way to do the query, for example, we want to check the transaction of more than 50 yuan, but because 50 through encryption is not 50, may become 500, and the original 500 encryption is 50, so this query can not be carried out, equivalent to it becomes a storage, can not do the analysis query.

(7) Cloud encrypted data will never be leaked

Is there a way that we can do the data analysis in a way that is confidential and the original SQL can do it?

The core thing in this is the hardware we use, through Apsaradb RDS (PostgreSQL version) + DPCA bare metal server (security chip TEE technology), you can save the Key in advance, and then all the calculation and logic is done in the encryption hardware. Because the whole process is protected by encryption hardware, even if someone copies all the memory of the system, the copied data is all encrypted, which ensures that the operation and maintenance personnel will not have the risk of leakage even if they get the top-secret data.

Third, best practices

Let’s look at a few best practices:

DMP: Full link marketing

DMP(Data Management Platform) stands for Data Management Platform, also called Data marketing Platform.

What is the core thing in marketing? The core of marketing is to find people, to find the most concerned about a group of people, known as the professional word circle people.

For example, what scene requires circling people? For example, today we want to find someone who is interested in cloud nativeness to discuss cloud nativeness. The process of finding people who are interested in the cloud is called circling people.

The other is similar to the Tmall Taobao report. For example, in the period before Singles’ Day, the merchant thought that a customer might buy a dress or a bag this year and was a potential customer, so he would push some consumption coupons for him.

This is the most critical is the precise crowd positioning, can accurately distinguish the crowd. There are about 800 million e-commerce consumers in China, and the core of which is to circle people by pushing messages to people who are interested in a certain product.

Alibaba does circling people based on the number of storehouse. First of all, we find some seed groups, which are about several million people. These seed groups are what we consider to be high quality customers, such as those who spend more than 5000 yuan or more than 10,000 yuan on Taobao every month. After all the people out, the second step is to cluster the groups.

Clustering means dividing millions of people into small groups. Each group might like one category, such as cosmetics, another like digital products, and another like books. After dividing into small categories, for example, there may be 100,000 people who like to buy cosmetics, but most of these 100,000 people may have bought cosmetics before, and they are likely not to buy cosmetics this time.

So, we need to find out among the 800 million consumers who are actually likely to buy cosmetics. How do we do that?

We need to convert each customer’s purchasing behavior and purchasing history into a vector of the AI model. If there are two customers with similar purchasing behaviors, the distance between them will be very small, so our approach is very simple. For example, we are interested in digital products as the seeds of 800 million people to find, and the nearest vector to these people, let’s say there are 10 million people, and then to these 10 million people to send digital products advertising or coupons, etc., in this way to do business marketing.

There are several aspects at the heart of this process.

The first is to cluster the crowd, divide the crowd, know their historical transactions, the data must be able to support any dimensional multi-dimensional analysis.

The second is to be able to do a specific analysis of the data in the whole storehouse.

The third is the vector approximation retrieval after clustering, to find out the crowd that is similar to each class vector for message push.

This is the capability we have and is currently implemented based on AnalyticDB.

One more thing to do is to do ad-hoc queries. For example, we need to find people who are interested in digital, who didn’t buy, say, an iPhone12 last year, so that they might buy an iPhone12 this year. Or someone who bought an iPhone12 last year and bought AirPods at the same time, we think there’s a good chance they’ll buy an Apple keyboard, or an Apple computer, etc. We need to do a variety of transaction queries on these people in order to pinpoint our target demographic.

Advertising fine management

Business challenges:

1) Putting keyword search events requires high concurrent real-time storage;

2) Conversion rate of all users query at the same time through the dashboard, and QPS of complex query is high;

3) The response time is high, so as to avoid missing the prime time of price adjustment.

Business value:

1) Unified keyword management of multiple sites and stores;

2) handle thousands of TPS and write concurrently;

3) Real-time analysis of massive data, intelligent price adjustment in time period;

4) fast identification and analysis of key words to maximize revenue.

Online mall

Business challenges:

1) The analysis of the traditional MySQL database is full, and the complex report of 10 million / 100 million level cannot be returned;

2) Second-level return of complex report;

3) Compatible with MySQL ecology;

4) With the rapid development of the business, there are different requirements for computing and storage.

Business value:

1) RDS + AnalyticDB to realize HTAP joint scheme, business and analysis isolation;

2) 2-10x improvement in analysis performance;

3) Distributed architecture, horizontal expansion, flexible allocation, to support different requirements of data volume and access volume

This is 2020 to date, the comprehensive upgrade of the next generation of cloud native technology stage —-Serverless era. Alibaba established the cloud native technology committee, cloud native upgrade for Ali technology new strategy, cloud native data warehouse will have more new functions in the future, to solve the more core pain points for the industry, please look forward to.

Related reading:

Cloud native data warehouse AnalyticDB MySQL Edition

Cloud native data warehouse analyticDB PostgreSQL version

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.