Introduce: in at present in the Taiwan in the financial industry that is in full swing in building tide, many financial institutions still have a lot of confused thoughts to Taiwan in building, where will Taiwan in building go? How exactly should data assets be managed? Alibaba’s road of building middle and Taiwan should be a reference for financial institutions. A few days ago, in 2021 ali ali cloud hosting cloud financial data intelligence summit, ali cloud intelligent computing platforms group researcher GuanTao middle three core elements of alibaba how to build data has carried on the full sharing of platform technology, including data platform four typical stages of development, support the four technical challenges of China business, And the four major technology trends in data platforms.

Article/Ali Cloud Intelligent computing platform Division researcher Guan Tao

The four stages of alibaba data platform development

To build the data center, a powerful data platform is indispensable as a base. The four stages of the development of Alibaba’s data platform, to some extent, are actually the four stages of the development of Alibaba’s data platform. In these four stages, you can see alibaba’s extraction of the commercial value of its own data, the aggregation of the original divisible and conquer data system, the new ideas of computing data asset and data efficient application, and the organizational changes faced in the process of data platform governance.

Phase 1: Business blooms and discovers data value

From 2009 to 2012, Alibaba’s e-commerce business entered an outbreak period, and many famous business teams emerged, such as Taobao, 1688, AliExpresss, Etao and so on. Each business is based on data driven full scenario business, business side has a strong demand for data. At that time, Alibaba’s technology was almost all IOE architecture, and its core data system was Oracle. Within two years, Alibaba built the largest Oracle cluster in Asia. However, in 2010, Oracle has been unable to meet the computing requirements, there are a lot of data delays and unsatisfied, coupled with the high cost, can no longer support the business development. Alibaba began to seriously examine the importance of building the next generation of data platform, at the same time launched two parallel projects: one is “Ladder 1”, based on the open source Hadoop technology system, multiple business teams to build multiple Hadoop clusters, the size of the cluster reached 4,000 servers. One is “Ladder 2” (ODPS, now MaxCompute), which is launched as a self-developed product of Alibaba. The cluster size is about 1,200 units. The “Sheepdog” business of Ant Small and micro loan is the first business to eat crab. The process of launching “ladder 2” is called “human flesh cloud computing” and “step-by-step trial computing”. In 2018, Academician Wang Jian read “Into thin Air” on CCTV’s “Reader” program, describing the current situation and beliefs of self-research data platform at that time. The two projects form a state of competition and cooperation within Alibaba, and explore the development track of Alibaba’s data platform in parallel. In this period, the data of all business parties are almost vertical construction, running fast forward in the form of independent small closed loop in their own business form.

Phase 2: Business vertical small closed loop, data islands appear

From 2012 to 2015, alibaba’s e-commerce business developed rapidly, and more emerging businesses emerged. In 2013, It founded Cainiao and launched the “all-in wireless” strategy. In 2014, Alittravel was established as a joint venture with Autonavi and Intime. In 2015, we launched Dingdingtong/Retailtong, established word-of-mouth, and controlled Ali Health. During this period, Alibaba’s business developed rapidly, forming 12 business departments and 9 sets of different platform systems, and each platform system architecture was different, and the user digital process required multiple sets of data systems across multiple BU. The phenomenon of data silos is becoming more and more serious, and the data cost is getting higher and higher. The construction of a unified data platform is imminent, which is also the starting point of Alibaba data Center. Meanwhile, Ladder 1 and Ladder 2 are undergoing major changes. On March 28, 2013, Yun Zheng, an architect of the Technology security Department of Alibaba Group, sent an email to senior executives of the group: “According to the situation of data increment and future business growth, the storage and computing capacity of ladder 1 and Ladder 2 will reach the bottleneck on June 21 this year.” At that time, many businesses will not be able to start because of the limitations of technology. This means that the data platform can no longer parallel ladder 1 and Ladder 2 projects at the same time, and must choose one or the other. If Ladder 1 is selected, how can the 5000 nodes limit of Hadoop be broken? When it comes to financial business, how can open source system ensure the security and availability of big data? Cross-machine room solution No reference in the industry how to solve this problem? Service interactions are frequent. How to ensure stable data interaction across computer rooms? A series of technical difficulties have gradually pushed the data platform to the road of self-research. In the end, alibaba group’s multiple technology departments joined forces and decided to choose “Ladder 2” to challenge the 5K peak. In just a few months, “Ladder 2” from 1500 units to 5000 units and broke the limit of a single physical room, passed the 10 times pressure test, and supported cross-cluster computing and high availability, laying a solid technical foundation for alibaba’s big data development in the coming years. After the technological breakthrough of the 5K project, new pressure followed. The rapid development of business leads to the rapid expansion of data scale. How to manage data uniformly, ensure data security uniformly, and have unified and open capability have become the core of the thinking of data platform. To this end, Alibaba launched a well-known internal project to synchronize the data of all business departments to a unified big data platform for unified management. This project has gone through two years, involving all business units of Alibaba. During this process, the general data platform capability has been gradually promoted into productization and the capability of financial grade platform. From the point of view at that time, the process of Alibaba’s data platform construction was the process of comprehensive data unification, and also the process of China’s first super large-scale data platform construction and migration.

Stage 3: Data center supports business sustainable development

From 2015 to 2018, the methodology of Data Center in Alibaba began to be established, which kicked off the construction of data center in Alibaba. In 2015, after alibaba Group announced the launch of the “Middle Taiwan Strategy”, it began to build a more flexible organizational mechanism and business mechanism of “large middle Taiwan, small front desk” in line with the DT era. Every operator in Alibaba can develop data-based operation strategies covering the life cycle of users based on data. Business advisers begin to explore the commercialization of data, and more businesses begin to move towards real-time. However, the rapid growth of data and computing and the high consumption of resources have brought about the problem of data governance. The team of Alibaba began to think about how to implement the methodology of data platform to the platform layer, so that the data platform can support the construction of data platform.

Whose data is it? Who is going to use? Who controls it? Who is responsible for data quality? · Platform team and business team are two teams. What is the cost relationship? · How to land the middle Platform methodology on the data platform? How to govern? · What should I do if the number grows rapidly and exceeds the business growth? · One core table is 12PB, and each department makes one copy. What if tens of millions of copies are lost every year? · I know I need to delete half of the data, but which half?

Behind these problems is the governance and asset of data, we need a platform system to carry the methodology in, and truly form a unified. On the data platform side, DataWorks builds a one-stop capability of large-scale collaborative data development and governance, and MaxCompute supports server cluster up to 100,000 levels, serving all BU of Alibaba Group and the daily operation of more than 200,000 employees, together supporting the sustainable development of various businesses.

Phase 4: Cloud data center and business accompaniment

After 2018, the whole Data platform system of Alibaba has become very mature, and the platform side and business side have reached a very good coordination state. The business side recognizes the value of the data platform, and the business department and the technical department are born together. The service business of the data center has reached a positive cycle, which has become a symbol of the success of the construction of the data center. In 2018, all the internal systems of Alibaba started to use the cloud, and by 2021, the data center and business were associated with the cloud: 100% of the core systems of Double 11 were used in the cloud, and Alibaba was fully engaged in cloud origin. With 538,000 transactions per second, Aliyun can withstand the world’s largest flood peak; The data center covers all BU of Alibaba Group; Operation primary 2 discovers and analyzes problems in time to realize real-time operation decision; New services such as short video and live streaming continue to emerge… It can be seen that Alibaba’s data platform construction is successful and is still developing at a high speed.

MaxCompute intelligent data warehouse makes Double 11 daily, and the integration of lake warehouse gradually becomes the next generation of big data platform architecture. The data center built by DataWorks provides full service business, supports hundreds of data applications within the group, and supports the rapid growth of the group’s business with low cost growth through full-link data governance.

Four core challenges to building a data platform

The core indicator of the success of a platform construction in data is not system efficiency or platform efficiency, but “data efficiency”. Alibaba mainly measures “data efficiency” from four aspects: scale and elasticity, data cost, data correctness and maintainability, and data utilization.

Under this core index, methodology, organization, and platform capability are the three core elements for the success of the platform in data. So, if the data platform wants to build well, what are the methods behind it, and what are the difficulties in the construction process? In fact, there is a lot of work to be done behind it. This time, I will only introduce the four aspects of business, without involving the challenges of storage and computing engine.

Challenge 1: Data asset management system

For data assets, one of the first questions to be addressed is: What is an enterprise data asset? Each BU of Alibaba has a panoramic picture of its own business unit’s data assets. We manage 99.9% of Alibaba’s data assets through a picture, and all the storage and computing costs of each department are quantified and directly displayed in front of managers. Second question: how to look at assets? For enterprises, assets are a number of costs? Through the perspective of data assets, Alibaba lets managers know where their data comes from, whom they serve, and who is my best partner. At the same time, it can meet the needs of data flow audit. Third question: How do you scale your assets? How can new business mergers/acquisitions/innovations quickly replicate this asset system? Data center platform modeling tools are provided in tools such as DataWorks, which can provide standardized drawings for data center platform construction, divide different business domains, and conduct intelligent modeling, so that new business can quickly reuse the mature data architecture before, and achieve the ability of asset scale.

Challenge two: Data quality systems

For data quality, one of the first questions to be addressed is: how to define ex ante quality? The financial industry often mentions a concept called account reconciliation, Alibaba data also to account reconciliation, for more than ten million level data sheet reconciliation problem, we put forward the concept of “quality rules”. More than 7 million quality rules, more than 10,000 new every day, how to manual distribution? Alibaba built 37 rule templates, and the adoption rate reached 75% through intelligent rule recommendation and matching. Second question: how to implement the quality in the event? What if more than 7 million quality rules consume a lot of computing resources? What are the ways to reduce costs? We have built data quality scheduling engine and ETL engine through intelligent technology, which can trigger quality monitoring in real time after data change, adopt priority strategy and carry out idle operation. Third question: how can after-the-fact quality be automated? The rules are dead, but the data is alive. How to deal with periodic fluctuations and changes? We integrate a lot of artificial intelligence technology into the data quality construction. We can learn the appearance of the generated data through machine learning, make intelligent prediction of the dynamic threshold, and match the periodic fluctuations through the algorithm.

Challenge 3: Data security system

For data security, to solve how to reduce the use cost, improve ease of use; How to cover the full data lifecycle; How to do permission control; How to desensitize data, how to identify sensitive behaviors for data traceability and other issues, Alibaba has accumulated more than 20 different security governance rules, these rules can ultimately help the platform to meet the requirements of personal compliance while meeting the rapid growth of business.

Challenge 4: Data governance system

When data governance enters the deep water zone, how to keep data cost growth from outpacing business growth; How to arouse the enthusiasm of personnel management, training cost consciousness, in alibaba, data governance is engine, platforms, and cooperate with each other, the engine to calculate cost of force and the ultimate pursuit, constantly breaking the rapid growth of data calculation and cost growth, the linear relationship between the platform by health points, calculate health into storage as the core indicators of control battle group, each group data, Promote people to do data governance and management, use the platform full link tools, build data governance technology operating system. The cost and value of the platform layer are clearly displayed through such a cost report. It can be seen that during the 12 years of data platform construction, Alibaba has accumulated the ability to productize data from the aspects of data assets, quality, security and governance.

As the middle platform base, data platform a station to where

Data in the future, as base of China, China will be smart to intelligent data from the data, “one lake storehouse” satisfy flexible architecture upgrades, “intelligent warehouse” solve the problem of large scale data management, intelligent query greatly lower the difficulty of data analysis, the cloud of the original AI biochemical/scale/standardization and pratt & Whitney let it become the ultimate big data export, The integration of big data and AI is accelerating.

Trend one: one two-sided lake warehouse

As the next generation of data platform architecture, lake silo integrated to meet the complex status of the flexible upgrade architecture. Data warehouse focuses on enterprise-class data, processing more refined, more economical, more efficient. Companies can build their own data platform, whether it’s engine optimization or data management, with a set of methodologies and tools to support it. But there are high barriers to entry, high costs, and barriers to use. Data lake is a technology derived from open source system, with low entry threshold, low cost and flexibility. It is easy for enterprises to build their own data lake. However, in addition to unified data storage, enterprises need to further do all kinds of fine management, hoping that data can be managed and managed with low cost and operation and maintenance. How to lake and fragmented data warehouse system and data structure of data fusion on lake flexibility as well as the ability of enterprise data warehouse, warehouse integration architecture, alibaba lake unified storage and metadata, through data system, using intelligent warehouse technology for different data and obligation, do the automatic classification of storage and processing.

Trend two: Data warehousing into the “autonomous driving” era

The very large scale of data brings management problems, and the traditional “DBA mode” has been difficult to handle. Alibaba has more than ten million tables, and many core data development engineers are responsible for tens of thousands of tables, so they cannot do detailed governance and modeling. Such a system cannot be expanded with the way of people. Therefore, in the future, more and more AI technologies will be integrated into the big data system, entering the era of “automatic driving”.

Trend 3: Natural language-based intelligent data query

Alibaba is trying to build a super-scale knowledge graph on top of the data, to translate data into semantic layer through the way of knowledge graph, and then combine with users through NLP(natural language processing) and other technologies to form a bridge. For example, the user input Beijing Internet customers, which can be automatically generated a data. Alibaba is trying to use intelligent queries through natural language on massive amounts of data to scale up and allow more non-data professionals to do data analysis independently.

Trend four: Data is intelligence, AI engineering basic capabilities

Data requires intelligent acceleration, and AI is the ultimate outlet for big data. As we know, it is very difficult to really use AI. The whole link from the initial data emergence, data extraction, model training, model tuning, to model deployment and service is very long. If we have 50,000 people who can directly use the data, the number of people who can actually use AI may not be more than 5,000. Then how to empower the AI technology with the data to the business side is the so-called AI engineering.

To sum up, the above mentioned content is only a general mention of the four typical stages of alibaba data platform base construction, the four major technical challenges encountered, and the four major technical trends of data platform and other topics, these contents are not all of Alibaba data platform. Over the past 12 years, Alibaba has accumulated a lot of technology in the construction of its data platform. These platform capabilities are also constantly promoting the evolution of data platform to intelligence, and will continue to evolve, serving Alibaba and exporting to the whole society.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.