Introduction: The 2021 Singles’ Day has just ended, and AnalyticDB, a cloud native data warehouse, has been steadily supporting the Singles’ Day promotion for many years. It is still stable during this year’s Singles’ Day. In addition to the stable smooth basic disk, AnalyticDB also has what bright spot? Here are the secrets.

The author | | ali AnalyticDB source technology public number

A preface

The 2021 Singles’ Day has just ended. AnalyticDB, a cloud native data warehouse, has been steadily supporting the Singles’ Day promotion for many years. During this year’s Singles’ Day, it is still stable as always. In addition to the stable smooth basic disk, AnalyticDB also has what bright spot? Here are the secrets.

Two changfeng breaker | AnalyticDB war double tenth again

This year’s Double 11, AnalyticDB battlefield across Ali digital economy, public cloud and hybrid cloud, three battlefields are stable as Mount Tai, achievements are remarkable. In ali digital economy, the business supported by AnalyticDB covers almost all BU, including nearly 200 core businesses related to Double 11, such as Taobao order search, Cainiao, Tao, Hema, Feizhu, Maochao, Ali Cloud, etc. In the public cloud, AnalyticDB supports the number of cloud, jushuitan and many other e-commerce related core business; On the proprietary cloud, AnalyticDB mainly supports various businesses of China Post Group. This year, the business load supported by AnalyticDB is particularly diversified, from the real-time data writing of the single database of millions of peak TPS to the high concurrency online order retrieval and keyword accurate recommendation of the core transaction link. From complex real-time analysis in various business scenarios to large batch offline Batch&ETL tasks of various crowd and label data as well as data import and export tasks, such varied business loads, and even the scenarios of simultaneous execution of online mixed loads, pose great challenges to AnalyticDB.

In the face of these business scenarios and technical challenges, AnalyticDB grasped the nettle, since this year, fully embrace the cloud native build extreme flexibility, advancing a stored separately computing architecture, by hot and cold temperature stratification significantly reduce storage costs, by upgrading to quantify engine and optimizer framework greatly improve computing performance, comprehensive advancement from online integration architecture, Further improve the ability to run online real-time query and offline batch computing tasks stably at the same time under a set of technical architecture. It is with the accumulation and precipitation of these technologies, AnalyticDB in this year’s Double eleven field can be more stable and calm, the business indicators continue to hit a new high, this year’s double eleven period cumulative real-time write 21 trillion data, batch import 113 trillion data, 35 billion times online query and 25 million offline tasks, The accumulated 590PB data was involved in the calculation.

Whether it is a look from the support of the complexity of the business scenario, or scale, the scale and calculated from the data AnalyticDB as from online integration framework of a new generation of cloud native data warehouse has been more and more mature, can offer a variety of business core reporting calculation, activities and system monitoring, real-time analysis and decision, smart marketing general ability. At the same time, this year AnalyticDB focus on the combination of mobile orders search and recommendation, real-time order synchronization and other core business scenarios, with technological innovation as the core, to help the business to solve a lot of long-standing intractable problems, help business in user experience, green and low carbon, business innovation, security and stability and other aspects of new breakthroughs.

Experience 1: how AnalyticDB optimize the search for mobile transactions order

Mobile order search supports users to enter keywords or order numbers to query historical orders, which is one of the important traffic portals for re-purchase by associating products and stores with historical orders. Order a very large amount of information, but because the user’s history was more than hundreds of billions, and users often only remember the fuzzy information of goods or stores, cause the user to enter keywords or inaccurate may receive orders, or short keywords in order to find a lot of response time is very long, affect significantly the product of hand for the user experience, by the user for a long time.

AnalyticDB is based on the online ability of the new real-time engine + line storage table + unstructured index + wide table retrieval, which realizes the unified storage and analysis of online and historical trading orders for the first time, enabling the trading business center to conduct multi-dimensional search for the whole network users’ full trading orders for ten years, and make accurate recommendation according to the search keywords of users. During the peak period of great promotion, the public opinion of users’ feedback that “orders can not be found” decreased significantly year-on-year, and the query performance was also greatly improved, greatly improving the user experience of mobile order search.

2 green and low-carbon: how does AnalyticDB help reduce cost and increase efficiency

By adopting AnalyticDB as the core of the overall plan of cloud native data warehouse, cloud Winner 2.0 all-domain CRM platform comprehensively upgraded the core functions of customer insight and segmentation, automatic marketing and so on during Double 11. Based on cloud native, resource pooling and cold and hot data separation capabilities, the business research and development cycle is shortened by 39% than usual, the overall cost is reduced by 50%, the operation efficiency is increased by 3 times, the problem of high procurement implementation cost is solved, and merchants are helped to respond to business changes quickly, with significant results of cost reduction and efficiency increase.

Ali a core business of the group of data analysis service every day need to perform a large number of ETL offline operation, need to support a large number of real-time data is written to at the same time, in order to satisfy the crowd of quasi real-time service requirements, select the services and online population perspective to support data written to and from the online real-time hybrid load AnalyticDB has always been the service of choice. During the double Eleven this year, AnalyticDB undertook dozens of PB data read and write requests for the service, millions of times of off-line mixed queries, and completed ETL operations of PB-level data volume. Benefited from AnalyticDB 3.0 cloud native elastic capability, combined with storage/computing/optimizer full link optimization, the cost decreased by nearly 50% compared to last year’s Double 11.

How does the warehouse integration architecture support high-throughput real-time business

This year’s Singles’ Day for the first time uses AnalyticDB+DMS warehouse integration architecture to replace DRC+ESDB to realize the full real-time historical order search and other core scenes, quickly build real-time data link and data application with millisecond delay, so as to give full play to the value of real-time data. To help businesses make new changes and breakthroughs in more real-time data application scenarios and more extreme user experience.

In the transaction order search business, a multi-channel data synchronization link is built according to the characteristics of the transaction business and AnalyticDB active-standby disaster recovery deployment scheme is adopted. During the double eleven promotion period, RPS can easily reach millions of peak traffic, and the whole process is millisecond delay.

Three winning secret | AnalyticDB latest core technology

1 Remove online storage as a service

AnalyticDB’s storage layer has been transformed into a service this year, with one data and one set of storage format supporting real-time update, interactive query, offline ETL and detailed point search multiple scenarios integration capability. Based on technologies such as storage service layer, mixed storage of columns and columns, hierarchical storage, and adaptive index, it supports both online low latency + strong consistency and offline high throughput data read and write scenarios.

Storage service: Offline unified access interface

In terms of interface layer, AnalyticDB storage provides unified data access interface, data interaction adopts Apache Arrow[1] data format, and efficient transmission is realized based on zero copy technology. The computing layer can accelerate CPU-friendly vectorization calculation based on Arrow memory column interface. Metadata is compatible with Hive MetaService Thrift interaction protocol, and the open source computing engine can seamlessly connect to AnalyticDB storage system.

In terms of service layer, AnalyticDB storage adopts lSM-like architecture [2], which divides the storage into real-time data and historical data. Real-time data is stored on online storage nodes as “hot” data, which supports low latency data access and strong consistent CURD. Historical data is stored on low-cost distributed file systems such as OSS or HDFS. As “cold” data, historical data supports high throughput data access. At the same time, AnalyticDB storage service layer also supports predicates, projections, aggregation, Top N and other computing push-down capabilities, reducing the amount of data scanning and reading, further accelerate the query.

Mixed column: off – line unified storage format

Since the integrated storage service is provided, it will inevitably involve online low-delay query and off-line high-throughput calculation scenarios. AnalyticDB storage format adopts PAX format [3], taking into account two off-line scenarios.

Online scenario, in conjunction with the index to provide efficient search capabilities. AnalyticDB storage format Each Chunk of AnalyticDB is of fixed length, which can be deeply integrated with indexes and can be searched randomly based on the line number to ensure efficient random read performance and meet the requirements of online multi-dimensional filtering. In addition, it provides a wealth of statistics that can be combined with indexes to do overlay optimization to further speed up queries.

In offline scenarios, AnalyticDB storage format can be used to read data in parallel according to the Chunk granularity. Multiple chunks are accessed in parallel, improving the offline read throughput performance. A table of AnalyticDB supports multiple partitions and multiple segments within a partition. The parallelism of data writing can be improved by Segment segmentation to improve the throughput performance of offline write. In addition, each Chunk provides rough set index information such as Min/Max, which can be used to reduce the amount of offline data scanning and I/O resource consumption.

Adaptive index

Another feature of AnalyticDB is its self-developed adaptive index framework, which supports five index types: string inverted index, Bitmap index, numeric KDTree index, JSON index and vector index. Different types of indexes can support multiple conditions (intersection, union, and difference) any combination of the following levels of indexes; Compared with traditional databases, AnalyticDB has the advantage that it does NOT need to manually build composite indexes (composite indexes require delicate design and tend to cause space expansion of index data), and supports index push-down with more conditions such as OR/NOT. In order to reduce the threshold for users to use, AnalyticDB can automatically enable full-column indexes with one key when building a table, and push down indexes through Index CBO adaptive dynamic filtering during query. The Index chain that is determined to be pushed down will be output by streaming progressive multipath merging through predicate calculation layer.

Hot and cold stratification: reduce user cost, charging by volume

AnalyticDB provides hot and cold layered storage capability 4 to bring users a more cost-effective experience. Users can according to the table size, table secondary partition granularity independent choice of cold and hot storage medium, such as user specified all table data is stored in the thermal storage medium, or specify all table data is stored in the cold storage medium, or specify a part of the partition table data is stored in thermal storage medium, another part of the partition data is stored in cold storage medium, completely can according to the needs of the business freedom specified, And hot and cold strategies can be switched freely. At the same time, hot data and cold data space usage is charged by volume. Can according to their own business characteristics, heat and cold storage layer technique based on AnalyticDB management life cycle of business data, need to frequently accessed data partition is specified for the data is stored in hot storage medium to accelerate query, do not need to frequently accessed data partition is specified for the data is stored in cold storage medium in order to reduce the storage costs, Automatically clears expired data through the data partition lifecycle management mechanism.

2 Off the online mixed load

Online computing load (such as online query) requires high response time and fast data reading and computing engine. In the offline scenario, computing loads (such as ETL tasks) are insensitive to response time, but have high requirements on computing throughput. In addition to the large amount of data computation, data reading and writing may also be large, and the task execution time may be long. It has always been a huge challenge to perform two completely different scenarios of load on one system and one platform. The current mainstream solution is that offline tasks run on offline big data computing platforms (such as Hadoop/Spark/Hive) and online queries run on one or more separate OLAP systems (such as ClickHouse/Doris). But under this architecture, multiple internal data storage system and the format is not unified, computing logical representation (such as SQL standard) are not unified, lead to data needs to be in the import and export between multiple systems, computational logic also need adapter corresponding system respectively, and the data link, data calculation and the cost is high, the efficiency of the data is not good.

In order to solve such problems, AnalyticDB comprehensively upgraded its off-line mixed load capability this year. In addition to the storage layer providing off-line unified storage format and unified access interface to solve the data reading and writing problems of off-line mixed load, the computing layer also completed a comprehensive upgrade. The same SQL query can be executed in Interactive and Batch modes. Resource groups, read/write load separation, multi-queue isolation, and query priority are used to isolate and control resources of different types of loads. Time-sharing flexibility is used to meet load expansion and load error requirements. At the same time, the computing engine has been upgraded to a vectorization engine, greatly improving computing performance.

Two execution modes for the same SQL

AnalyticDB supports Interactive and Batch execution modes to meet requirements of online query and offline calculation respectively. In Interactive mode, the query is executed in Massive Parallel Processing (MMP) mode. All tasks are scheduled and started at the same time, and data is read and output results are calculated by streaming. All calculations are performed in memory, minimizing query execution time and suitable for online scenarios. In Batch mode, the computing task is executed in Bulk Synchronous Parallel (BSP) mode. The whole task is divided into multiple stages according to the semantics. Scheduling and execution are carried out according to the dependencies between stages. Data transfer between stages needs to be disembarked, and the intermediate state will also be disembarked when the memory is insufficient during calculation. Therefore, the overall execution time of the task is long, but the demand for computing resources such as CPU and memory is relatively small, which is suitable for the offline scenario with large data and relatively limited computing resources.

In AnalyticDB, no matter in Interactive mode or Batch mode, the SQL expressing calculation logic is unified, and the logical execution plan generated is completely the same, but different physical execution plans are generated according to different modes, and most of the operator implementation in the calculation engine is the same. It also lays the architectural foundation for the unified upgrade to vectorization computing engine.

New vectorization query engine

Vectorization is one of the hot technologies for query performance optimization of contemporary query engines. Related ideas can be traced back to the research of Array programming in the field of scientific computing, and the exploration in the field of database originated from MonetDB/X100[6]. At present, every mainstream system in the industry has its own practice of vectorization, but there is still a lack of standard formal definition. Generally speaking, it is regarded as a general name of a series of optimization schemes of query engine for CPU Microarchitecture, involving Batch Based Iterator Model [7], CodeGen, Cache-awareness algorithm [8], SIMD instruction set application and other technical applications, as well as computing/storage integration architecture design. To explore and identify the orthogonal/dependent relationship among these technologies is the key to achieve significant performance improvement by using vectorization technology.

AnalyticDB implements comprehensive vectomization of core operators, including Scan, Exchange, group-by /Agg, and Join operator. Batch Processing, Adaptive Strategy, Codegen and cache-Awareness algorithms are combined in the solution. By cooperating with the JVM team, Based on AJDK intrinsic ability [9], the application of SSE2, AVX512 and other instruction sets on the algorithm critical path is innovatively realized. Significantly improved CPU IPL and MPL during query execution, and increased the throughput performance of Agg/Join operators by 2X-15x.

Mixed load isolation and stability assurance

When multiple loads are running under the same architecture or even at the same time, resource competition and mutual influence inevitably exist. AnalyticDB has a complete set of mechanisms and strategies to ensure the stability of the cluster and meet SLA requirements of different business loads as far as possible.

First, in mixed load scenarios or multi-tenant scenarios within instances, effective business load isolation can be achieved through resource groups. Computing resources in different resource groups are physically isolated from each other, preventing the load of different types or services from affecting each other. Different services can be specified to run on corresponding resource groups by binding accounts or specifying resource groups when submitting queries.

Secondly, AnalyticDB automatically distinguishes the write load (part of insert and DELETE), query load (such as SELECT query) and read and write load (part of INSERT /update/ DELETE). Different types of load tasks are automatically assigned to different queues. And allocate different execution priorities and computing resources. Specifically, write requests have separate write acceleration links and resource guarantees, query loads have a higher execution priority by default, and read/write loads have a lower execution priority by default. In addition, AnalyticDB dynamically reduces the execution priorities of Query tasks that have been running for a long time according to the current load of the cluster and the running duration of Query tasks to mitigate the adverse impact of Slow Queries or Bad Queries on other queries.

Finally, many businesses have very obvious cyclical peaks and troughs. In particular, offline computing tasks are often scheduled and executed periodically. During business peak periods, the sudden increase in resource demand may lead to business instability, and during business peak periods, idle resources may lead to additional costs. AnalyticDB provides time-sharing flexibility to help users expand resources during peak hours to ensure stable load execution and reduce resources during peak hours to save costs. With proper service planning and resource group management, some resource groups can release all resources during peak hours, greatly reducing costs.

3. New optimizer framework

In recent years, Autonomous capability has become an important direction and trend of database development. Compared with traditional database, cloud database provides one-stop hosting service, and also puts forward higher requirements for autonomy. To this end, AnalyticDB developed a new optimizer architecture, to more intelligent autonomous database direction forward, for users to bring better experience.

The AnalyticDB optimizer is reconstructed and upgraded in a large area. The main body is divided into RULE-based Optimizer (RBO) and cost-based Optimizer (CBO). The RBO is responsible for deterministic optimization. For example, filter conditions can be pushed down as far as possible to reduce the amount of work done by subsequent operators. CBO is responsible for uncertainty optimization. For example, adjust the order of JOIN operations. The benefits of adjustment are uncertain, so decisions need to be made by proxy module. In order to make these two modules work better and give users a better experience, AnalyticDB introduced four new features.

Feature 1: Histogram

The introduction of the histogram can effectively improve the quality of the estimate, allowing the CBO to select a better plan. Histogram can effectively solve the problem of uneven distribution of user data values and solve the problem of “uniform distribution hypothesis” in generation estimation. In order to verify the histogram effect, AnalyticDB also built a set of quality evaluation system. In the gray scale case, the rate of comprehensive error can be reduced by more than 50%.

Feature 2: Autonomous Statistics

How to manage the statistics of the table is also a very headache. If this problem is thrown to the user and the user is asked to execute commands to collect statistics, it will cause great trouble to the user. To this end, AnalyticDB introduces a statistical autonomy framework to manage this matter. AnalyticDB minimizes the impact of collection actions on users and provides the best possible experience. AnalyticDB identifies the level of statistics to be collected for each column, and the collection cost varies with the level. It also identifies whether statistics are out of date and decides whether to collect them again.

Feature 3: Incremental Statistics

Traditional statistics collection methods require full table scan. Full table scanning costs a lot and affects users greatly, which does not meet the original intention of improving user experience. To this end, AnalyticDB introduced incremental statistics framework, can only update the statistics of a single partition, and then through the global merge technology, get the global statistics of the entire table. This can significantly reduce the overhead of collection and reduce the impact on users.

Feature 4: Property Derivation

Attribute derivation plays an important role in how to make plans better. It’s like an Easter egg in a movie. This feature allows us to explore the hidden information in SQL to further optimize our plans. For example, if A user writes “A=B” in SQL and then does “GROUP BY A,B”, it can be simplified to “GROUP BY A” or “GROUP BY B”. Because we know by property derivation that A is the same thing as B.

4 Intelligent diagnosis and optimization

Intelligent diagnosis

The intelligent diagnosis function of AnalyticDB integrates logical execution plan and physical execution plan, from “Query level”, “Stage level”, “operator level” three levels of diagnosis and analysis, help users quickly locate Query, Stage and operator problems, directly give intuitive problem analysis. Such as data skew, index is not efficient, conditions are not pushed down, and the corresponding tuning suggestions are given. Currently, more than 20 diagnosis rules have been online, covering memory consumption, time consumption, data skew, disk I/O, and execution plan. More diagnosis rules will be online in the future.

Intelligent optimization

Intelligent optimization of AnalyticDB Provides the optimization function for database and table structure, and provides suggestions for reducing cluster usage costs and improving cluster usage efficiency. Based on the performance indicators of SQL queries and the data tables and indexes used, the function analyzes the algorithm statistics and automatically provides tuning suggestions, reducing the burden of manual tuning. Intelligent optimization currently offers three types of optimization recommendations:

1) Optimization of cold and hot data: analyze the usage of data tables. For data tables that have not been used for a long time, it is suggested to move them to cold storage. About 60% of the cases can use the suggestion to move the data tables that have not been used for 15 days to cold storage, saving more than 30% of the hot storage space;

2) Index optimization: analyze the usage of data indexes, and suggest to delete the data indexes that have not been used for a long time. About 50% of the instances can delete the indexes that have not been used for 15 days according to the suggestion, saving more than 30% of the hot storage space;

3) Partition optimization: analyze the differences between the distribution keys actually used in data query and the distribution columns defined in the data table. If the distribution keys are not reasonably designed, it is suggested to change the distribution keys of the data table to improve the data query performance.

Summary and prospect

After years of double 11 refining, AnalyticDB not only withstood the extreme load and flow of year after year, but also gradually grew in the increasingly rich business scenarios, and constantly enabled to a variety of new and old businesses and scenarios inside and outside the group, and gradually grew into the leader of a new generation of cloud native data warehouse. Next AnalyticDB will continue to “everyone available data services” as the mission, further embrace cloud native, build database + big data integration architecture, build extreme flexibility, from online integration, cost-effective, intelligent autonomy and other enterprise capabilities, further empower users to mine data behind the business value.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.