Guide language | hot cloud native is the first in recent years by Matt Stine put forward and continue to use up to now, but it is not standard, the strict definition, is recognized as one of the four elements are: enterprise, micro service, continuous delivery, as well as containers, there are also more towards a kind of application system architecture and methodology. So how to improve the big data infrastructure on the cloud to make it conform to cloud native standards, while bringing real data analysis cost reduction and performance guarantee to enterprise customers is an open topic. This article is compiled and shared by Chen Long, expert engineer of Tencent and head of Tencent Cloud EMR technology, in Techo TVP Developer Summit "Song of Ice and Fire of Data -- From online database technology to Mass Data Analysis Technology" speech "Evolution of Big Data Foundation Technology in cloud Native Environment". To share and discuss how to realize storage computing cloud native and the next generation cloud native big data infrastructure in the cloud.
First, cloud native standards and big data basic technology
The content shared today is divided into four parts: the first part is cloud native standards and big data basic technology; The second part is how the big data basic technology to achieve cloud native; The third part is Tencent cloud big data cloud native solution; The fourth part is the basic technology of the next generation cloud native big data.
Next, look at cloud native standards and big data infrastructure technologies, and what is cloud native big data infrastructure technology.
1. The core idea of cloud native
"Cloud native" this word, these years very hot, for this word, it can be said that people have different opinions, wisdom. If a product does not add "cloud native", it seems to be behind The Times, so what is cloud native about? What are the core ideas it promotes? Let's start by looking at the definition. Cloud native was first proposed by Matt Stein and is still used today. It generally consists of four elements: DevOps, microservices, continuous delivery, and containerization.
DevOps is really a combination of development and operations. Unlike development and production, which are often at war with each other, DevOps is really agile thinking, a culture of communication, and a form of organization that provides continuous delivery of cloud-native capabilities. Continuous delivery is timely development, continuous update, small steps, and anti-traditional waterfall model development, which requires the coexistence of development version and stable version, continuous delivery needs many tools or processes to support.
Looking at microservices, almost all cloud natives contain the definition of microservices. Corresponding to microservice is the single application, microservice has a theoretical basis -- Conway's Law, guide how to dismantle the service. But anything that can be called a law, the text looks very simple, can really grasp its essence or thorough understanding of it is very difficult, sometimes the micro service is not good, but it is a disaster. How well a microservice can be dismantled is really limited by the architect's understanding of the business scenario and macro abstraction.
Containerization refers to the use of open source technology stack, which generally means that K8S and Docker are containerized. Based on micro-service architecture, the flexibility and expansibility of the entire system are improved. With the help of agile methods and DevOps, continuous iteration is supported.
So if you put these four factors together, cloud native is really about two words: cost and efficiency, that is, the industrialization of the entire software development.
2. Cloud native definition of big data
Based on the discussion above, cloud native is about making full use of the cloud computing software delivery model to build and run applications in the process of turning source code into a product, so as to industrialize the entire software production, thereby reducing costs and increasing efficiency.
How to deduce the cloud native of big data according to this principle? "Big data" is also a hot word at present. I personally understand that big data is actually the analysis and processing technology of super-large data sets. There are two official definitions of big data: First, McKinsey believes that the scale of index data exceeds the data set acquired, managed, stored and analyzed by conventional database tools, but it also emphasizes that big data is not data exceeding a certain size. Second, IDC believes that big data has four characteristics: large data scale; Fast data flow; Type; Low value density.
It can be seen that no matter which definition, the basic characteristic of big data is that it is very large in scale, which is difficult to be processed by conventional management means, which means that more complex distributed systems and parallel computing are needed to solve it, and complexity also means higher cost and lower efficiency.
Therefore, combined with the analysis just mentioned, I think big data cloud native is to make full use of cloud infrastructure to solve the acquisition, management, storage and analysis of super-large data sets, and achieve cost reduction and efficiency improvement in this process, so as to realize data-driven business.
3. How to achieve big data cloud native
After clarifying the goal of big data base cloud native, let's see what means or measures should be adopted to realize big data cloud native.
Data drives the business development of an enterprise. As one of the most important assets of an enterprise, data is generated by the enterprise's systems, which are diverse, such as CRM system, IOT device, OA system, HR system, etc. The data generated by these systems can be divided into structured and unstructured data at macro level. After analysis and transformation, these data become various indicators of enterprise operation, according to which enterprise decision makers adjust the operation direction of the whole enterprise. Combined with this analysis, what are some rules and measures we need to implement cloud native handling of these issues in this process? I think there are four points:
The first is industrial delivery, what is industrial delivery? At the present stage, it is difficult to complete a single system to deal with all data problems, so an ecology-level system is needed to deal with data problems. Delivery industrialization means that when I need a certain system, I can create these systems minute by minute, and provide control and operation and maintenance capabilities.
The second is cost quantification, cost quantification is divided into two dimensions, one is storage cost quantification, the other is computing cost quantification. The ability to quantify the storage or computing resources used by a system that processes data.
The third is load adaptation, which means that the scale of resources used by the systems that analyze and process the data should change as the data size changes.
The fourth is data-oriented. For enterprises, data is one of the most important assets of enterprises, rather than the systems or technologies that process data. Suppose I am a logistics company, I buy a plane is to make the whole logistics efficiency, rather than the aircraft manufacturing technology, so the data should be take full advantage of the cloud platform oriented ability, to solve the problem of data analysis, make enterprises more focused on the relationship between the data, the value of data mining, and better realize data driven.
Based on these criteria and measures, we combine some landing technologies to see how to achieve big data cloud native.
At the present stage, all the basic analysis of big data is built around Hadoop ecological technology. We look at this problem in terms of data flow. The log data generated by the system, the data generated by ERP, or the data generated by IOT and other data generally enter the message pipeline, which can be divided into flow scenario or batch scenario according to the scenario. There are flow processing engine and batch processing engine respectively. After the processing is completed, the data service layer is entered. The data consumption terminal then consumes the data through an OLAP engine or other data service storage component. Big data cloud should be realized to deal with these problems, that is, industrial delivery should be realized for each sub-module or system in the whole processing system, and the storage or computing resources used by each sub-module or system in the whole ecological processing link can be cost quantified. The amount of resources it takes should vary with the load of the entire data processing. This is implemented through a series of means, and we will see how each of these criteria is implemented in detail.
2. Cloud native implementation of big data basic system
At the present stage, in fact, the Hadoop ecological technology stack has become the de facto standard for big data basic processing. In order to realize the cloud native processing of big data basic problems, that is, the cloud infrastructure and Hadoop ecological technology stack should be combined to realize industrial delivery, cost quantification, load adaptive and data-oriented. Industrial delivery based on the Hadoop ecosystem is not just about cluster creation, but also governance, operation and maintenance, data apis, etc. You might not create clusters very often on a regular basis, but when the data is in the cloud, my data is in the cloud, pulling up a cluster every minute to compute as needed, releasing the cluster when it's done, and then industrial delivery becomes very important, even if you're a resident cluster, The management and operation of the daily cluster is also a part of the industrial delivery, so cost quantification refers to the cloud visibility of the storage resources or computing resources used by the entire Hadoop cluster. The third is load adaptation, which means that the IaaS layer used by the components of the Hadoop ecosystem should scale as the amount of data processed changes, with minimal human intervention.
Through the above three measures, finally let the enterprise see the flow of data, rather than the system itself, and finally realize data-oriented.
Let's look at how industrialization of delivery can be achieved. Delivery of industrialization is to take full advantage of the cloud infrastructure is a key to build the cloud data analysis system, and at the same time provides query control, task, or manage some of the API ability, through these API ability significantly reduce the cost of using the large data analysis technology and operational costs, at present data processing need a level of ecological solutions, One-click construction can be selected according to the data scale or business scenarios. Hadoop service, streaming computing service, real-time data warehouse service or data lake service can be selected on the cloud. Tasks can be submitted through API based on different solutions. For example, to submit a Spark task through the API, after Spark completes the calculation, I release the cluster directly, either directly through the API, or through our data lake service, and pay for the amount of data scanned. This industrial delivery capability can significantly reduce the resource cost of the entire big data analysis.
To quantify the cost of big data analysis, it must be improved based on the existing Hadoop architecture. For a conventional Hadoop cluster, its topology is divided into distributed coordination nodes, mainly deploying processes such as ZooKeeper and journalNode. Namenode and ResourceManager are deployed on the active node, and DataNode and Nodemanager are deployed on the computing node. However, computing resources and storage resources are not equal. Sometimes computing resources are the bottleneck, and sometimes storage resources are the bottleneck, especially when cloud storage is used. At this time, only a few storage nodes need to be reserved for storage, computing interim results and logs, etc. Based on the cost quantification criterion, we improved the topology of a Hadoop cluster, including Master, Router, core and Task. The Master node is similar to the previous Master node in the traditional mode. Deploy Namenode and ResourceManager. Router nodes can be used to deploy stateless services, such as HiveServer2 types of processes, while leveraging the infrastructure on the cloud to simplify the high availability of the entire big data analytics landscape. For example, a Presto coordinator can be deployed on a router to implement disaster recovery using cloud load balancing. When a Presto Coordinator fails, the coordinator automatically switches over. Core nodes are similar to computing nodes in traditional mode, where datanodes and NodeManagers are deployed. In the cloud, only a few Core nodes can be selected. Task node is an elastic node in which only computing processes are deployed. Based on this architecture, four kinds of basic analysis services of big data can be realized.
The first is a traditional mode, the traditional mode can fully retain the IDC architecture under the whole cluster, the cluster, there is no elastic node through the cloud provides the bootstrap of EMR and the cluster program, can greatly reduce the operational problems of using the Hadoop cluster, cloud EMR also made a lot of the optimization of the kernel level for cloud storage, When the storage capacity of the entire cluster is insufficient, data can be quickly transferred to the cloud storage system and the cluster can be rapidly expanded based on the computing resources in the cloud storage system.
The second mode is the computing and storage separation mode. In this mode, when the whole data is stored in the cloud, a cluster of thousands of nodes can be pulled up every minute for calculation and released after calculation, or the cluster can be maintained in a small scale and expanded to a larger scale every minute when necessary.
The third is the hybrid cloud solution. Before THE IDC cluster migrates to the cloud, the IDC environment and the cloud environment can be connected through VPN or dedicated line. After the connection, the EMR cluster can be built on the cloud, and the FILE system and metadata of THE IDC cluster can be identified through the EMR cluster to rapidly expand the computing power of the IDC self-built cluster. Release the EMR cluster on the cloud after the calculation is complete.
The fourth way is mixed computation, now the container cluster TKE or STKE in this cluster is the main business systems on deployment, this kind of system has a feature of it during the day when the load is very high, the night when the load is low, I have this ability, the EMR cluster load is low when the container can quickly add container cluster resources to EMR cluster.
These four computing methods can flexibly use storage resources and computing resources on the cloud, thus greatly reducing the hardware cost of big data analysis. Tencent Cloud also provides this capability.
This is our data lake solution, where we implement paid computing optimization based on the volume of scanned data, analysis systems such as BI or some visual data management tools and applications, which can connect to our service via JDBC or ODBC. Layer in the service we provide unified authentication and authorization, at the time of the query can be set at the same time used by each SQL resources situation, in the solution of DLF cloud all the metadata management, at the same time it also responsible for data warehousing service, submit SQL can be carried by the DLC to, can according to the data of DLC resources bill to pay.
Let's look at load adaptation, how to realize load adaptation based on Hadoop cluster. The diagram above on the right is a real load diagram of a cluster. Under normal circumstances, the size of the cluster should be the time line x the rectangular area of the peak value. The computing resources in the cloud are used to realize load self-adaptation, so that the size of the whole cluster changes with the change of load. The EMR supports load scaling and time scaling. Based on the load scaling mode, the EMR can automatically expand or shrink the CAPACITY based on the VOCRE and VMEM congestion on the resource scheduling component YARN. Based on the time segment mode, users can expand capacity in peak hours and reduce capacity in off-peak hours.
When implementing load adaptation, we also need to ensure the SLA of the entire business, that is, we need to be insensitive to the service during capacity reduction and control the application failure rate. Suppose there is a streaming scenario, once the AM node is assigned to the elastic node, scaling this elastic node will definitely cause the streaming task to fail. Similarly, if a Container that has failed for more than two times in YARN is allocated to an elastic node next time, the application fails again if the elastic node is offline. Therefore, a lot of optimization is made at the kernel level during load adaptation to avoid this situation. Now that we know about cost quantification and load adaptation, let's look at how to implement data orientation.
I illustrate the data-oriented problem with a conversion of data 1 to data 2. We can be in the field of data processing problem for how to efficiently abstract the efficient conversion of a data set into another data set, as the data size is more and more big, the use of tools or techniques also will be more and more complicated, we may need to database when GB or some other stand-alone systems can be completed, When it comes to TB, MPP or parallel computing may be needed; when it comes to PB, distributed storage and computing may be needed. As the system becomes more and more complex, data problems gradually evolve into technical problems, resource problems and operation and maintenance problems. Taking the example just now, if I am engaged in logistics, when my logistics is within a region, I may need a tricycle to solve the problem; when my business develops to trans-city, I may need a car; when it develops to trans-provincial or trans-international, I may need a plane or train. However, the reality in the field of data processing is that many enterprises have to solve these data problems, to deal with the data of the train and plane, thus deviating from the goal of data as one of the most important assets of the enterprise. The direction of the evolution of social productivity must be a more detailed division of labor, so my understanding of data-oriented is to make full use of cloud infrastructure to solve data analysis problems, so let's see how data-oriented is implemented.
Take the actual data processing flow as an example. For example, the application system outputs data to databases or logs, and the data in these databases or logs will enter the cloud storage or message pipeline after data synchronization tools. Incremental data, such as data in the CDC, goes into a message pipeline or into a streaming batch engine, and when it's done, it goes into a real-time data warehouse or cloud storage. In this picture each thread is processing the data automated production lines, each node was quite deal with the data of the engine, for enterprise and developers, the cloud to provide seamless docking between each engine, for the enterprise only need to pay attention to the whole data flow itself can, and don't need to pay attention to data processing technology itself. In addition to solving the problem of data-oriented, it is also necessary to ensure the high performance of the whole big data analysis. Let's see how the basic performance of Tencent cloud big data analysis is guaranteed.
In order to ensure the performance of big data processing on the cloud, Tencent Cloud big data provides multiple optimization from infrastructure hardware layer to component kernel and architecture. If you choose the traditional mode to build big data applications, cloud host provides a variety of hardware to choose from. For example, if a NoSQL HBase application requires RT in milliseconds, you can use IT cluster to build a cluster. Here I focus on performance assurance in computing-storage separation scenarios. In traditional mode, computing processes and storage processes are deployed together, especially in offline computing scenarios. Why do you want to do this? Reason is the volume of a calculating program is far less than the volume of data, so now the mainstream such as YARN scheduler will be fully considered when scheduling data of affinity, it when doing the computing tasks section and scheduling would consider the data distribution of computing tasks, and the computing task scheduling to node data, Compute programs are built and executed on these nodes for good performance; However, in the scenario of computing and storage separation, the situation is reversed. When the data is stored in the cloud, the performance of the entire data will be degraded. Therefore, CacheService is introduced and the kernel is modified to be transparent to the computing engine, so that the upper-layer applications do not need to sense the existence of CacheService. You can also intelligently load data from cloud storage to CacheService based on metadata information, such as table or partition access information, to improve performance. Next, let's look at Tencent cloud big data cloud native solution.
Iii. Tencent Cloud big data base cloud native solution
Tencent cloud big data provides perfect product capability support in each link of big data processing foundation. In terms of storage data import, data can be directly imported into the object store through the tools provided by the object store. Incremental data can be processed by streaming the CDC capability, written into Kafka directly, or imported into the cluster through SQOOP or Spark in the EMR cluster. After data enters the cloud, there are two streaming batch scenarios. In the batch scenario, we provide EMR or DLC for processing, while in the streaming scenario, we provide Oceanus, a fully managed service. At the data services layer, we provide search solutions ES, real-time data warehouse GP and ClickHouse. EMR also provides Hadoop ecosystem based service layer components to meet some of the requirements of the data services layer. These products also provide comprehensive monitoring, API, decision rights and SDK. It can greatly simplify the cost of big data analysis, so as to complete the common applications such as spot search, adhoc, olap or Nosql in the field of big data.
In the field of big data processing, computing engines mainly have two types of batch flow, Spark and Flink. At present, there is a trend of convergence, and each has its own ecology. It is not certain who will win. In order to solve the complex problem of data service layer, delta-related data lake technology and some closed source projects are also emerging to try to unify the data service layer. Here are my thoughts on the data services layer.
No matter how the data services layer changes, the business must achieve the lowest cost and high performance data analysis. For example, updating a small part of a partitioned table requires a lot of invalid calculations, which wastes a lot of IO and CPU resources and leads to a high cost. There is no single component in the Hadoop ecosystem that can solve all data problems, and as the scenario becomes more complex, more components are introduced, leading to massive data redundancy and higher costs. In a distributed system, data structures, algorithms, and data partitions are essential for high performance, but now each computing engine has hundreds of parameters, not including the PARAMETERS of the JVM or the operating system kernel itself. Exposing these parameters to developers or users can be confusing to users. On another level, for the performance of a single node, derivative, is the machine hardware ability at present situation that the code you write will be really put the machine reasonable hardware to the limit, want to make a big question mark, so I think the next generation of large data base processing engine could be like this:
Limited by the hardware architecture, the frequency of single-cpu CPU has reached the upper limit. To extend its performance for single-cpu, it must be the multi-CPU mode of NUMA architecture. In this mode, code execution across numanode will show performance jitter exponentially. In distributed scenarios, shuffling of data across nodes is inevitable. Now, the Linux kernel encapsulates and unencapsulates a large number of data packets, which also wastes a large amount of precious CPU resources. Meanwhile, in programming model, it is difficult for a programming framework to expand the resource limit of a single function call. Even cgroups can only be used at the process or thread level. It is also difficult to set quote for each IO. Especially at the storage level, you may need to provide different storage costs depending on the importance and cost of storing data. At the data organization level, there is a need to provide a unified DataFormat, but today there are a variety of data formats in the Hadoop ecosystem. So I think the next generation of computing engines is likely to look like this: first at the access layer, you can provide uniform authentication based on SQL, and you can also set up resource groups based on SQL for this SQL execution. In the computing scheduling layer, the resource cost of each operator and IO basic unit of each operator are precisely controlled according to the resource group set by SQL. When the operator is actually executed at the bottom, the programming framework ensures that no cross-node situation will occur in each function execution. At the same time, shuffle can be further improved by USING RDMA or DPDK technology to simplify the entire data service layer architecture and achieve good performance.
The lecturer introduction
Tencent Cloud EMR technology leader, expert engineer, joined Tencent in 2011, led the development of Tencent Cloud Redis successively, responsible for the technical work of Tencent Cloud database HBase and EMR, Apache HBase Contributor, I have contributed code to Apache Hive and other open source projects, and currently focus on the construction of Tencent cloud EMR technology.