Guan Tao is a senior expert in big data systems, having spent the last 15 years of his 20-year big data development at Microsoft (Internet /Azure Cloud business group) and Alibaba (Aliyun). From the perspective of system architecture, this paper tries to give an overview of big data architecture hotspots, the development of each technology line, as well as technology trends and unsolved problems.

The author researcher GuanTao | ali cloud computing platform, alibaba Wang Cui project management experts

Any technology has gone from high to low, just as our understanding of computers has gone from “shoe-covered rooms” to ubiquitous smartphones. In the past 20 years, big data technology has gone through the same process, from “rocket science” to technology that everyone can benefit from.

Looking back, in the early days of big data development, many open source and self-developed systems emerged, and there was a long period of “red sea” competition in the same field. For example, Yarn VS Mesos, Hive VS Spark, Flink VS SparkStreaming VS Apex, Impala VS Presto VS Clickhouse, and so on. After fierce competition and elimination, the winning product scales and starts to dominate the market and developers.

In fact, in recent years, the big data field has no new star open source engine (Clickhouse@2016 open source, PyTorch@2018 open source), with Apache Mesos and other projects to stop maintenance as a representative, the big data field has entered the “post-red Sea” era: The technology gradually converges into the phase of technology inclusion and large-scale business application.

Guan Tao is a veteran of big data systems, having spent the last 15 years of his 20-year big data development at Microsoft (Internet /Azure Cloud business group) and Alibaba (Aliyun). From the perspective of system architecture, this paper tries to give an overview of big data architecture hotspots, the development of each technology line, as well as technology trends and unsolved problems.

It is worth mentioning that the field of big data is still in the development stage, with some technologies converging, but new directions and new fields emerging in an endless stream. The content of this article is related to personal experience and is from a personal perspective, so it is inevitable that there are flaws or biases. At the same time, it is difficult to be comprehensive due to the limited space. Only to throw a brick to attract jade, hope to discuss with the same industry.

I. Current hot spots of big data system

The concept of BigData was first proposed in the 1990s and was based on three classic Google papers (GFS, BigTable, and MapReduce), and has been in development for nearly 20 years. In the past 20 years, excellent systems including Google big data system, Microsoft Cosmos system, Aliyun feitian system, open source Hadoop system and so on were born. These systems are gradually pushing the industry into an era of “digitisation” and later “AIisation”.

Massive data and its value have attracted a lot of investment and greatly promoted the technology in the field of big data. The rise of Cloud makes big data technology available to small and medium-sized enterprises. It can be said that big data technology is developing at the right time.

From the perspective of architecture, the evolution of “shared-everything” architecture, the integration of lake storage technology, the upgrade of basic design brought by cloud native, and better AI support are the four hot topics of current platform technology.

1.1 From the perspective of system architecture, the platform as a whole has evolved to a shared-everything architecture

The system architecture of pan-data domain develops from scale-up of traditional database to scale-out of big data. From the perspective of distributed systems, the overall architecture can be divided into three types: shared-nothing (also known as MPP), shared-data, and shared-everything.

The digital warehouse architecture of the big data platform was originally developed from the database, and the shared-nothing (also known as MPP) architecture has long been the mainstream. As cloud native capabilities have grown, shared-data, represented by Snowflake, has grown. The big data system based on DFS and MapReduce principles was designed with shared-everything architecture at the beginning.

The shared-everything architecture represents GoogleBigQuery and alibaba MaxCompute. From an architectural perspective, the Shared Everything architecture, with its greater flexibility and potential, is the way forward.

(Figure: Three big data architectures)

1.2 From the perspective of data management, data lake and data warehouse are integrated to form a lake and warehouse

The high performance and management capability of the data warehouse, and the flexibility of the data lake, the two systems of the warehouse and the lake are learning from each other and integrating. In 2020, each manufacturer respectively proposed the lake warehouse architecture, which has become the hottest trend in the current architecture evolution. However, there are many forms of lake and silo structure, which are still in evolution and debate.

(Figure: Reference and integration of data lake and data warehouse)

1.3 From the perspective of cloud architecture, cloud native and hosting become mainstream

As the big data platform technology enters the deep water area, users also start to be divided. More and more small and medium-sized users no longer develop or build their own data platforms, and begin to embrace fully managed (usually also cloud native) data products. Snowflake is widely recognized as the quintessentially product of the field. Going forward, only a small number of very large head companies will follow the build (open source + improvement) model.

(Photo: Snowflake’s cloud-native architecture)

1.4 Computing mode Angle, AI gradually becomes the mainstream, forming BI+AI dual mode

BI, as a statistical analysis, is mainly a summary of the past. AI computing is getting better and better at predicting the future. In the last five years, the load of the algorithm class has increased from less than 5% of total data center capacity to 30%. AI has become a first-class citizen in big data.

Ii. Domain architecture of big data system

Among the three architectures (Shared-nothing, shared-data and shared-Everything) introduced above (#1.1), the author has experienced two systems (Microsoft Cosmos/Scope system, And Ali Cloud MaxCompute) are Shared Everything architecture. Therefore, the author divides the big data field into 6 superposition sub-fields and 3 horizontal fields, a total of 9 fields, mainly from the perspective of Shared Everything architecture, as shown in the figure below.

(Figure: Domain architecture based on shared-Everything Big data system)

Each area has evolved over the years, and the following chapters provide an overview of the evolution history, driving forces, and direction of each subarea.

2.1 Evolution of distributed storage to multi-layer intelligence

Distributed storage, which specifically refers to general big data massive distributed storage, was a typical distributed system that was Stateful, and its core optimization direction was high throughput, low cost, disaster recovery, and high availability. (Note: The following generations are for convenience only and do not represent a strict evolution of the architecture.)

** In the first generation, the typical representatives of distributed storage are Google’s GFS and Apache Hadoop’s HDFS, which are both append-only file systems that support multiple backups. ** Because the early NameNode of HDFS failed to meet the high data availability requirements of users, many large companies developed their own Storage systems, such as Microsoft Cosmos (which later evolved into Azure Blob Storage) and Alibaba Pangu system. As the foundation of open source storage, HDFS interface has become a de facto standard. At the same time, HDFS has the plug-in capability to support other systems as the storage system behind it.

In the second generation, based on the above chassis, with the surging demand for massive Object Storage (such as massive photos), a metadata service layer supporting massive small objects is encapsulated on top of the universal Append-only file system, forming object-based Storage (object-based Storage), typical representatives of which include AWS S3, Ali Cloud OSS. It is worth mentioning that BOTH S3 and OSS can be used as standard plug-ins to become the fact storage backend of HDFS.

** The third generation is represented by the data Lake. ** With the development of cloud computing technology and the progress of network technology (after 2015), the integrated architecture of storage and computing is gradually replaced by the new architecture of cloud native storage (storage hosting) + storage and computing separation. This is also the starting point of the data lake system. At the same time, the bandwidth performance problem caused by the separation of storage and computing has not been completely solved, and caching services such as Alluxio have been born in this niche area.

The fourth generation is also the current trend. As the storage system is cloud managed and the underlying layer is transparent to users, the storage system has the opportunity to develop into a more complex design. Therefore, the storage system begins to evolve into a multi-layer integrated storage system. From a single SATA disk-based system to a multi-tier system such as Mem/SSD+SATA (3X backup)+SATA (1.375x EC backup)+ ice storage (typical AWS Glacier).

How to intelligently/transparently stratify data storage and find trade-offs in cost and performance is a key challenge for multitier storage systems. This field is not long in the beginning, there is no significant good product in the open source field, the best level is led by the self-developed digital storage system of several big factories.

(Figure: Multi-layer integrated storage system of Alibaba MaxCompute)

On top of these systems, there is a File Format layer, which is orthogonal to the storage system itself.

Storage format first generation, including file formats, compression and encoding technologies, and Index support. The two main types of storage formats are Apache Parquet and Apache ORC, from Spark and Hive ecology, respectively. Both are column storage formats adapted to big data. ORC is strong in compression coding, and Parquet is better in semi-structure support. In addition, there is another memory format, Apache Arrow. The design system also belongs to format, but it is mainly optimized for memory swapping.

** Storage format second generation – near real-time storage format represented by Apache Hudi/Delta Lake. ** Storage format In the early days, large file column storage was optimized for throughput (not latency). As both of these major storage modes evolve toward real-time support, Databricks launches Delta Lake, which supports Apache Spark for near-real-time ACID data operations. Uber has launched Apache Hudi with near-real-time data Upsert capabilities.

Although the details of the two are slightly different (such as Merge on Read or Write), the overall approach is to reduce the data update cycle to a shorter period by supporting incremental files (avoiding the traditional Parquet/ORC undifferentiated FullMerge operation for updates). And then realize near real-time storage. Because of the near-real-time direction, which typically involves more frequent file merges and fine-grained metadata support, the interfaces are more complex, and Delta/Hudi are not mere formats but a set of services.

The storage format will evolve towards real-time update support and will be combined with real-time indexes. It will not only be used as a file storage format, but will be integrated with the memory structure to form an overall solution. The mainstream real-time update implementations are based on LogStructuredMergeTree (almost all real-time data stores) or Lucene Index (Elastic Search format).

From the perspective of interfaces/internal functions of storage systems, simpler interfaces and functions correspond to more open capabilities (such as GFS/HDFS), while more complex and efficient functions usually mean more closed, and gradually degrade into an integrated storage and computing system (such as RedShift of AWS master storage products). Technology in both directions is converging.

Looking ahead, we see the following possible directions/trends:

1) At the platform level, the separation of storage and computing will become the standard in two or three years, and the platform will move in the direction of hosting and cloud native. Within the platform, refined layering becomes a key means of balancing performance and cost (something that the current data lake products are far from doing), and AI plays a bigger role in layering algorithms.

2) At the Format level, it will continue to evolve, but big breakthroughs and generational changes will likely depend on the evolution of new hardware (there is limited room for coding and compression optimization on general-purpose processors).

3) Further merging of data lakes and silos makes storage more than just file systems. How thick the storage layer should be, and what is the boundary between computing and it, remains a key question.

2.2 Distributed scheduling, based on cloud native, develops to unified framework and algorithm diversification

Computing resource management is the core capability of distributed computing. Its essence is to solve the problem of optimal matching of different kinds of loads and resources. In the “post-Red Sea era”, Google’s Borg system and open source Apache Yarn are still key products in this field. K8S is still catching up in the direction of big data computing and scheduling.

Common cluster scheduling architectures are as follows:

  • Centralized scheduling architecture: MapReduce in early Hadoop1.0, Borg and Kubernetes in later development are all centrally designed scheduling frameworks, where a single scheduler is responsible for assigning tasks to machines within the cluster. In particular, in the central scheduler, most systems adopt a two-level scheduling framework by separating resource scheduling from job scheduling, allowing different job scheduling logic to be customized for specific applications, while retaining the feature of sharing cluster resources between different jobs. Yarn and Mesos are both architectures.
  • Shared state scheduling architecture: a semi-distributed pattern. Each scheduler at the application layer has a copy of the cluster state, and the scheduler updates the cluster state copy independently. Such as Google Omega, Microsoft Apollo, are such architecture.
  • Fully distributed scheduling architecture: The fully distributed architecture proposed since Sparrow’s paper is more decentralized. There is no coordination between the schedulers, and many separate schedulers are used to handle different loads.
  • Hybrid scheduling architecture: This architecture combines centralized scheduling with shared state design. Generally, there are two scheduling paths: distributed scheduling designed for partial load and centralized job scheduling to handle the remaining load.

(图 : The Evolution of cluster scheduler Architectures by Malte Schwarzkopf)

No matter which architecture the scheduling system of the big data system is based on, the following dimensions of scheduling capabilities are required in the massive data processing process:

  • Data scheduling: Multi-room cross-region system services bring global data scheduling problems, which require optimal storage space and network bandwidth.
  • Resource scheduling: The trend of cloud-based IT infrastructure brings greater technical challenges to resource scheduling and isolation. At the same time, with the further expansion of the physical cluster, the decentralized scheduling architecture becomes the trend.
  • Computing scheduling: The classic MapReduce computing framework has gradually evolved into a refined scheduling era that supports dynamic adjustment, global optimization of data Shuffle, and full utilization of hardware resources such as memory networks.
  • Single machine scheduling: SLA guarantee under high resource pressure has been the direction of academia and industry. Borg and other open source exploration all assume that they unconditionally tilt to online business in case of resource conflict. However, offline services also have strong SLA requirements, which should not be sacrificed at will.

Looking ahead, we see the following possible directions/trends:

  1. K8S Unified Scheduling Framework: Google Borg has long proven that unified resource management is good for optimal matching and peak filling. Although K8S still has challenges in scheduling “off-line services”, K8S’s accurate positioning and flexible plug-in design should be the ultimate winner. Big data schedulers, such as KubeBatch, are currently a hot spot for investment.
  2. Scheduling algorithm diversification and intelligence: with the decoupling of various resources (e.g., storage and computing separation), scheduling algorithm can be optimized more deeply in a single dimension. AI optimization is the key direction (in fact, Monte Carlo Simulation has been used by Google Borg many years ago to predict resource requirements for new tasks).
  3. Scheduling support for heterogeneous hardware: Multi-core ARM architecture has become a hotspot in the field of general computing, and AI accelerator chips such as GPU/TPU have also become the mainstream. The scheduling system needs to better support a variety of heterogeneous hardware and abstract simple interfaces. In this respect, K8S plug-in design has obvious advantages.

2.3 Metadata Service Unification

Metadata service support above the big data platform and its various calculation the operation of the engine and frame, metadata service is an online service, has the characteristics of high frequency, high throughput, need to provide high availability, high stability, service ability, requires for compatibility, thermal upgrade, many cluster (copy) management ability. It mainly includes the following three functions:

  • DDL/DML business logic to ensure ACID features and data integrity and consistency
  • Authorization and authentication capabilities to ensure data access security
  • High-availability storage and query capabilities of Meta(metadata) ensure the stability of operations

** The first generation metadata system for the data platform is Hive MetaStore (HMS). ** In earlier versions, HMS metadata services were built into Hive. Metadata update (DDL) and DML job data read and write consistency were strongly coupled to Hive’s engine. Metadata storage was usually hosted in relational database engines such as MySQL.

As customers for the consistency of the data processing (ACID), open (engine, data sources), real-time, and large-scale extension ability of the demand is higher and higher, traditional HMS confined to ChanJiQun step by step, single tenant, Hive a single enterprise internal use, in order to guarantee the security of the data is reliable, high operational costs. These shortcomings are gradually exposed in mass production environments.

The second generation of metadata systems are represented by Apache IceBerg, an open source system, and MaxCompute, alibaba’s big data platform, a cloud native system.

IceBerg is an open-source big data platform that has emerged in the last two years as a “metadata system” independent of engine and storage. The core problem it solves is ACID for big data processing, and the performance bottleneck of metadata for tables and partitions after scaling. In terms of the implementation method, IceBerg’s ACID relies on the POSIX semantics of the file system, and the metadata of the partition is stored in file mode. Meanwhile, IceBerg’s Table Format is independent of the metadata interface of Hive MetaStore, so it costs a lot in the adoption of the engine. All engine modifications are required.

Based on the analysis of future hotspots and trends, open, managed and unified metadata services are becoming more and more important, and several cloud vendors are starting to provide DataCatalog services to support multi-engine access to lake and warehouse data storage tiers.

Comparing first-generation and second-generation metadata systems:

Looking ahead, we see the following possible directions/trends:

  1. Trend 1: With the further development of the integration of lake and warehouse, the unification of metadata and the construction of the ability to access metadata and data on the lake. Such as a unified metadata interface based on a set of account system, support the lake and warehouse metadata access capabilities. And the ability to combine multiple tabular ACID capabilities, which support the Delta, Hudi, and IceBerg tables can be a challenge for platform-based products as lake data writing scenarios become more abundant.
  2. Trend 2: The metadata permission system is shifting to the enterprise tenant identity and permission system, no longer limited to the limits of a single engine.
  3. Trend 3: Metadata models begin to move beyond relational paradigm structured models to provide richer metadata models, support for tagging, classification, and the ability to express custom types and metadata formats, support for AI computing engines, and more.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.