Summary: This article translates a series of technical articles by Databricks on the data Lake Delta Lake. It is well known that Databricks dominates many popular technologies in the open source big data community, including Apache Spark, Delta Lake, and ML Flow. Delta Lake, as the core storage engine solution for data lakes, brings many advantages to enterprises. In this series of technical articles, Delta Lake will be covered in detail.

preface

This article is a translation of a series of technical articles on Delta Lake by Databricks, a big data technology company. It is well known that Databricks dominates many popular technologies in the open source big data community, including Apache Spark, Delta Lake, and ML Flow. Delta Lake, as the core storage engine solution for data lakes, brings many advantages to enterprises. In addition, Aliyun has teamed up with Apache Spark and Delta Lake’s original Databricks engine team to launch an enterprise version of Aliyun’s fully hosted Spark product, DataBricks Data Insiders. The product is a native integration with the Enterprise version of the Delta Engine, providing high-performance computing power without additional configuration. Interested students can search for insight or ali cloud Databricks Databricks data into the website, or directly to https://www.aliyun.com/produc… Learn more.

Translator: Han Zongze (Zong Ze), technical expert of Alibaba Cloud Computing Platform Division, is responsible for the R&D work of open source big data ecological enterprise team.

Delta Lake Technology Series – Lakehouse

Integrate the best advantage of data lake and data warehouse

directory

• Chapter-01 What is Lake and Warehouse? • Chapter-02 Explore the inner workings of Lakehouse and Delta Lake • Chapter-03 Explore the Delta Engine

Introduction to this article

The Delta Lake series of e-books, published by Databricks and translated by the Big Data Ecosystem Enterprise Team of Alibaba’s Cloud Computing Platform Division, aims to help leaders and practitioners understand the full capabilities of Delta Lake and the context in which it exists. In this article, the Delta Lake series, Lakehouse, focuses on the Lakehouse.

subsequent

By the end of this article, you can understand not only what features Delta Lake offers, but how they can lead to substantial performance improvements.

What is a data lake?

Delta Lake is a unified data management system that brings data reliability and fast analysis capabilities to the Cloud Data Lake. Delta Lake can run on top of an existing data Lake and is fully compatible with the Apache Spark API. Inside Databricks, we’ve seen how Delta Lake can bring reliability assurance, performance optimization, and lifecycle management to a data Lake. Use Delta Lake to solve problems such as data formatting errors, data compliance deletions, or individual data modifications. At the same time, with Delta Lake, high quality data can be quickly written to the data Lake, which can be deployed through cloud services (secure and scalable) to improve the efficiency of data utilization.

Chapter-01 What is Lake Warehouse?

Over the past few years, LakeHouse has emerged independently in many user and application cases for Databricks as a new data management paradigm. In this article, we will explain the new paradigm and its advantages over previous schemes.

Data warehouses have a long history in decision support and business intelligence applications. Since its creation in the late 1980s, data warehouse technology has evolved, and the MPP architecture enables systems to handle much larger amounts of data.

Although warehouses are great for structured data, many modern enterprises must deal with unstructured data, semi-structured data, and data with high diversity, speed, and volume. Data warehousing is not suitable for many of these scenarios and is not cost-effective.

As companies began to collect large amounts of data from many different sources, architects began to envision a single system to accommodate the data generated by many different analysis products and work tasks.

About ten years ago, we started building a data lake — a repository of raw data in a variety of formats. Data lakes, while good for storing data, lack some key features: they do not support transactional processing, do not guarantee data quality, and lack consistency/isolation, making it nearly impossible to mix append and read data, as well as batch and streaming jobs. For these reasons, many of the functions of a data lake remain unrealized and, in many cases, the advantages of a data lake are lost.

The need for flexibility and high-performance systems for a wide range of data applications, including SQL analytics, real-time monitoring, data science and machine learning, has not diminished in many companies. Most of the latest advances in AI are based on models that better deal with unstructured data (like text, images, video, audio), but these are precisely the types of data that data warehousing is not optimized for. A common solution is to use a system that combines data lakes, multiple data warehouses, and other systems such as streams, time series, graphs, and image databases. However, maintaining this whole system is very complex (and relatively expensive). In addition, data professionals often need to move or replicate data across systems, which in turn leads to delays.

Lake Warehouse integrates the advantages of both data lake and data warehouse

Lakehouse is a new paradigm that combines the strengths of a data lake and a data warehouse, addressing the limitations of a data lake. Lakehouse uses a new system design that implements data structures and data management functions similar to those found in a data warehouse directly on the low-cost storage for the data lake. If you need to redesign your data warehouse now that cheap and highly reliable (in object storage format) storage is available, consider using Lakehouse.

LakeHouse has the following key features:

  1. Food support: Lakehouse In enterprise applications, many data pipes often read and write data at the same time. Often multiple parties read or write data simultaneously using SQL, LakeHouse guarantees consistency to support ACID transactions.
  2. Pattern enforcement and governance: Lakehouse should have a way to support pattern enforcement and evolution, supporting DW schema specifications such as STAR/Snowflake-Schemas. The system should be able to reason about data integrity and should have robust governance and auditing mechanisms.
  3. BI support: LakeHouse can use BI tools directly on source data. This reduces obsolescence and wait times, improves neatness, and reduces the cost of having to operate two copies of the data in both the data lake and the warehouse.
  4. Separation of storage and computation: In fact, this means that storage and computation use separate clusters, so these systems can scale to more concurrent users and larger data volumes. Some modern data warehouses also have this property.
  5. Compatibility: The storage formats used by Lakehouse are open and standardized, such as Parquet, and it provides a variety of APIs, including machine learning and Python/R libraries, so that a variety of tools and engines can access data directly and efficiently.
  6. Supports a wide range of data types from unstructured to structured: LakeHouse can be used to store, optimize, analyze and access many data types needed for new data applications, including image, video, audio, semi-structured data and text.
  7. Support for various work scenarios: including data science, machine learning and SQL analysis. These may depend on working scenarios supported by multiple tools, all of which rely on the same data repository.
  8. End-to-end streaming tasks: Real-time reporting is a daily requirement for many businesses. Support for convection processing eliminates the need for separate systems dedicated to serving real-time data applications.

These are the key characteristics of Lakehouse. Enterprise systems need more functionality. Security and access control tools are basic requirements. Particularly in light of recent privacy regulations, data governance capabilities including auditing, retention, and lineage have become critical, and data discovery tools such as data catalogs and data usage metrics need to be enabled. With LakeHouse, the above enterprise features only need to be deployed, tested, and managed in a single system.

Read the following research on Delta Lake: High Performance ACID Table Storage Based on Cloud Object Storage

Abstract:

Cloud object storage (such as AliCloud OSS) is some of the largest and most cost-effective storage systems available, and it is the primary choice for storing large data warehouses and data lakes. Limiting is that the way they are implemented as key-value stores makes it difficult to implement ACID transactions and high performance because metadata operations (such as listing objects) are expensive and conformance guarantees are limited. In this article, we introduced Delta Lake, an open source ACID table storage layer based on cloud object storage originally developed by Databricks. Delta Lake uses transaction logs in the Apache Parquet compressed format to provide ACID properties, time travel, and fast metadata operations for large table datasets (for example, the ability to quickly search queries across billions of partitions). It also leverages this design to provide advanced features such as automatic data layout optimization, update, caching, and audit logging. The Delta Lake table can be accessed from Apache Spark, Hive, Presto, Redshift, and other systems. Delta Lake is deployed among thousands of Databricks customers that process EB-level data on a daily basis, with the largest instance managing EB-level data sets and billions of objects.

The author: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman Van Hovell, Adrian Ionescu, Alicja Łuszczak, Micha? Szafra, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, Matei Zaharia

Inner workings of the lakehouse

Early case

The Databricks Unified Data Platform supports Lakehouse architecturally. Alibaba’s DDI service, which has been integrated with Databricks, implements a model similar to Lakehouse. Other hosted services (such as BigQuery and Redshift Spectrum) have some of the Lakehouse features listed above, but they are primarily for BI and other SQL applications. For companies that want to build and implement their own systems, see the open source file formats suitable for building Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi).

Combining the data lake and the data warehouse into one system means that data teams can move data more quickly because they can work with the data without having to access multiple systems. In these early days of Lakehouse, SQL support and integration with BI tools were usually sufficient to meet the needs of most enterprise data warehouses. Instantiated views and stored procedures are available, but users may need to adopt other mechanisms that are different from those found in traditional data warehouses. The latter is particularly important for a “lift scenario,” which requires that the semantics of the system be almost identical to those of an old commercial data warehouse. What is the support for other types of data applications? Lakehouse users have access to a variety of standard tools (Apache Spark, Python, R, machine learning libraries) to handle non-BI jobs such as data science and machine learning. Data exploration and refinement is the standard in many analytics and data science applications. Delta Lake is designed to allow users to gradually improve the data quality in LakeHouse until it is ready for use.

Although a distributed file system can be used for the storage layer, object storage is more suitable for Lakehouse. Object storage provides low-cost, highly available storage that performs well in massively parallel reads, a basic requirement of modern data warehousing.

From the BI to AI

Lakehouse is a new data management architecture that can radically simplify enterprise data infrastructure and accelerate innovation in an era of machine learning across all industries. In the past, most of the data involved in a company’s products or decisions was structured data from the operating system. Today, many products incorporate AI in the form of computer vision and speech modeling, text mining, and so on. Why use Lakehouse instead of Data Lake for AI? LakeHouse provides you with data versioning, governance, security, and ACID properties, even for unstructured data.

Currently Lakehouse has reduced costs, but their performance still lags behind dedicated systems (such as data warehouses) that have been put in place and deployed for years. Users may prefer certain tools (BI tools, IDE, Notebook), so Lakehouse needs to improve its UX and connectors with popular tools to attract more users. As the technology matures and develops, these problems will be solved. As technology advances, LakeHouse will close these gaps while retaining its core attributes of being simpler, more cost-effective, and better suited to a wide variety of data applications.

Chapter02 dives into the inner workings of Lakehouse and Delta Lake

Databricks has written a blog post outlining the growing number of businesses adopting the Lakehouse model. The blog has generated a lot of interest among tech enthusiasts. While many hail it as the next generation of data architecture, some consider a lake warehouse to be the same as a data lake. Recently, several of our engineers and founders wrote a research paper describing some of the core technical challenges and solutions that separate the lake-warehouse architecture from the data lake, The paper was accepted and presented at The International Conference on Very Large Databases (VLDB) 2020, “Delta Lake: High-performance Acid Table Storage Over Cloud Object Stores “.

More than a decade ago, the cloud ushered in a new direction for data storage. Cloud object storage like Amazon S3 has become some of the largest and most cost-effective storage systems in the world, which makes them more attractive data storage warehouses and data lake platforms. However, their nature as key-value stores makes the ACID transaction feature required by many companies difficult. Also, expensive metadata operations (such as listing objects) and limited consistency guarantees affect performance.

Based on the characteristics of cloud object storage, there are three schemes:

Data Lakes

Data Lakes stores a table as a file directory for a collection of objects (that is, a Data lake), typically using column storage (for example, Apache Parquet). This is a unique approach. Because a table is just a set of objects, it can be accessed through multiple tools without using another data storage system, but this can lead to performance and consistency issues. Performances of hidden data corruption due to transaction failure occur frequently, resulting in inconsistent queries, long wait times, and unavailability of basic administrative functions such as table versioning and audit logging.

Custom Storage Engines

The second approach is to customize a storage engine, such as a proprietary system built for the cloud, such as the Snowflake data warehouse. These systems can provide a single data source that avoids the conformance challenges of the data lake by managing metadata in separate and highly consistent services. However, all I/O operations need to be connected to this metadata service, which can increase cloud resource costs and reduce performance and availability. In addition, there is a lot of engineering work required to implement connectors for existing computing engines such as Apache Spark, TensorFlow, and PyTorch, which can be a challenge for data processing teams using a variety of computing engines. Unstructured data can exacerbate engineering challenges because these systems are often optimized for traditional structured data types. Most unacceptable of all, proprietary metadata services lock customers into a specific service provider, and if customers decide to adopt new services in the future, they will have to face consistently high prices and time-consuming migration costs.

Lakehouse (one lake and one warehouse)

Delta Lake is an open source ACID table storage layer on top of cloud object storage. It’s like looking for a car instead of a faster horse. Lake-Warehouse Integration is a new architecture, which combines the advantages of data lake and data warehouse. Not only does it have better data storage performance, but it also has a fundamental change in the way data is stored and used. The new system design supports Lakehouse: implementing data structures and data management functions similar to those in a data warehouse directly on the low-cost storage used for the data lake. If you want to design a new storage engine, this kind of inexpensive and reliable storage (in the form of object storage) is what you want.

Delta Lake maintains partial object information of the data table in ACID form using a pre-written log compressed into Parquet, which is also stored in the cloud object store. This design allows clients to update multiple objects at once, replacing a subset of objects with another object in a serializable manner, resulting in high parallel read/write performance. The log also provides significantly faster metadata manipulation for large tabular datasets.

Delta Lake also offers: time travel (data versioning support rollback), automatic optimization of small files, update support, caching and auditing logs. Together, these capabilities improve the manageability and performance of processing data in cloud object storage, ultimately opening the door to the Lakehouse architecture. The architecture combines the key capabilities of a data warehouse and a data lake to create a better, simpler data architecture.

Today, Delta Lake is used by thousands of Databricks customers and many organizations in the open source community, processing billions of bytes of structured and unstructured data every day. These use cases cover a variety of data sources and applications. The types of data stored include Change Data Capture (CDC) logs from enterprise OLTP systems, application logs, time series data, graphs, aggregate tables for reporting, and image or feature data for machine learning. These applications include SQL analysis work (the most common), business intelligence, stream processing, data science, machine learning, and graphical analysis. Overall, Delta Lake has proven to be a good fit for most data Lake applications that use structured storage formats such as Parquet or ORC and many traditional data warehouse workloads.

In these use cases, we found that customers often use Delta Lake to dramatically simplify their data architecture, running more workloads directly against cloud object storage. More often, they replace some or all of the functionality provided by message queues (such as Apache Kafka), data lakes, or cloud data warehouses (such as Snowflake, Amazon Redshift) by creating Lakehouse with data lake and transaction capabilities.

In the research of this article above, the author also provides the following introduction:

• Characteristics and challenges of object storage • Storage formats and access protocols for Delta Lake • Current characteristics, advantages, and limitations of Delta Lake • Core and dedicated use cases commonly used today • Performance experiments, including TPC-DS performance

In this article, you’ll get a better understanding of Delta Lake and how it enables DBMS-like performance and management capabilities for data in low-cost cloud storage. You’ll also see how Delta Lake’s storage format and access protocol help make it easy to operate, highly available, and able to provide high-bandwidth access to object storage.

Chapter03 explores the Delta Engine

The Delta engine will be tied to Apache Spark’s 100% compatible vectorized query engine, which optimizes Spark 3.0’s query optimizer and caching capabilities by leveraging modern CPU architectures. These features were introduced as part of Databricks Runtime 7.0. Taken together, these capabilities can significantly improve query performance on the data Lake, especially the data Lake supported by Delta Lake, making it easier for customers to adopt and extend the Lakehouse architecture.

Extended execution performance

One of the biggest hardware trends of the past few years is that CPU clock speeds have leveled off. The exact reasons for this are beyond the scope of this chapter, but it is important that we find new ways to process data faster than the original computing power can do. One of the most effective approaches is to increase the amount of data that can be processed in parallel. However, the data processing engine needs to be specifically designed to take advantage of this parallelism.

In addition, as the pace of business accelerates, there is less time left for R&D teams to provide good data modeling. Poor modeling for better business agility leads to poor query performance. Therefore, this is not an ideal state, and we want to find ways to maximize agility and performance.

A Delta Engine with high query performance is proposed

The Delta Engine improves the performance of Delta Lake’s SQL and Dataframe workloads with three components: an improved query optimizer, a caching layer that sits between the execution layer and the cloud object storage, and a native vector execution Engine written in C++.

The improved query optimizer extends the functionality already in Spark 3.0 with better statistics (cost-based optimizer, adaptive query execution, and dynamic runtime filters), resulting in an 18-fold increase in performance for star schema workloads.

The Delta Engine’s cache layer automatically selects the input data to be cached for the user and transcodes the code in a more efficient CPU format to better take advantage of the higher storage speeds of NVME SSD. Scan performance can be up to five times better for almost all workloads.

In fact, the biggest innovation of the Delta Engine is the native execution Engine, which addresses the challenges faced by today’s data teams. We call it Photon (known as an Engine within an Engine). The fully refactored DataBricks execution engine was built to maximize performance from the new changes in modern cloud hardware. It brings performance improvements for all workload types, while still being fully compatible with the open source Spark API.

Introduction to the Delta Engine

By linking these three components together, it will be easier for customers to understand how Databricks can bring together multiple pieces of code for improvements that can greatly improve the performance of workloads doing analysis on the data lake.

We are excited about the value that Delta Engine brings to our customers. It is of great value in terms of time and cost savings. More importantly, in the Lakehouse pattern, it enables data teams to design data architectures to improve uniformity and simplicity, and to make many new advances.

For more information on Delta Engine, check out the Spark + AI Summit 2020 keynote: Delta Engine: High-Performance Query Engine for Delta Lake.

subsequent

Now that you’ve learned about Delta Lake, its features, and how to optimize performance, there’s more to this series:

• Delta Lake Technology Series – Fundamentals and Performance • Delta Lake Technology Series – Features • Delta Lake Technology Series -Streaming • Delta Lake Technology Series – Customer Use Cases

The original link

This article is the original content of Aliyun, shall not be reproduced without permission.