Since the concept of Data Lake was introduced in 2011, its concept positioning, architecture design and related technologies have been rapidly developed and many practices have been carried out. The concept of Data Lake has also evolved from a single Data storage pool to the next generation of basic Data platform including ETL analysis, Data conversion and Data processing.

A data lake is a large warehouse of a variety of raw data in an enterprise that can be accessed, processed, analyzed, and transmitted. Data lake is a storage architecture, essentially storage, so usually the most classical object storage is used, for example, Tencent Cloud object storage COS is used as the foundation of the data lake.

Data lakes fetch raw data from multiple sources in the enterprise, and there may be multiple copies of the same raw data that meet a particular internal model format for different purposes. Therefore, the data being processed in the data lake can be any type of information, from structured data to completely unstructured data.

Therefore, it is very important for enterprises to construct data pipelines from various data sources and store various data stably and reliably into data lakes for storage.

This article gives a detailed answer to the lake access plan of COS data lake combined with Serverless architecture for the data lake access pipeline of data lake.

01. Data link analysis of data lake

To better understand how to build a data lake, we can first look at the data life cycle in the context of a data lake.

The above life cycle can also be called the different phases of data in the data lake. The data and analytical methods required for each stage are also different. In fact, there are two ways of data processing: batch and real-time. For example, if you want to store data and use SQL queries to access data, the upstream selection must support SQL interfaces. If you want to pull data directly from Kafka, the downstream data fetching needs Kafka Consumer to pull data.

The traditional data lake architecture is divided into two parts: in-lake and out-lake. In the link above, data storage is the axis. Data acquisition and data processing are actually in-lake, while data analysis and data delivery are actually out-lake.

  • The part entering the lake is the data source entrance of the whole data lake architecture. Due to the high convenience and expansibility of the data lake, it needs to access various data. It includes tables in the database (relational or non-relational), files in various formats (CSV, JSON, documents, etc.), data streams, data converted by ETL tools (Kafka, Logstash, DataX, etc.), data acquired by application APIS (such as logs, etc.);
  • The lake-out part refers to the data access and data search part of the data lake, which is more inclined to the data lake application. There are a wide range of scenarios. Various external computing engines can be used to provide rich computing mode support, such as the INTERACTIVE batch processing capability based on SQL. EMR provides Spark based computing capabilities, including stream computing capabilities and machine learning capabilities provided by Spark.

In conclusion, in the overall data lake link, the degree of customization is the highest, and the use cost and cost are the part of data entering the lake (index data acquisition and data processing before entering the lake). This is often the data connection that is the core of the data lake architecture implemented. Whether there is a better plan to achieve the data link through this piece is actually the key node of the data lake.

02. COS + Serverless architecture data lake solution

The overall capacity points and schemes of Serverless construction lake are shown in the figure below. Related solutions cover three capacity points of data entering the lake, data leaving the lake and data processing, and provide more capacity expansion for data entering the lake and data leaving the lake through Serverlcess encapsulation.

Taking the data lake-to-lake scheme as a breakthrough point, the COS data lake solution based on Serverless architecture is introduced in detail.

03. COS + Serverless access to the lake technical architecture

In fact, the lake-entry scheme in Serverless architecture is batch scheme, which uses the cloud native function trigger or Cron/APIGW to pull up data call, captures and records batch data information through functions, and closes the loop related structural transformation, data format transformation, data compression and other capabilities within functions.

Then call the Put Bucket interface to upload the pulled data. The related architecture and processing process are shown in the following figure:

04. Advantages of COS + Serverless access to the lake

1. Easy to use

Relying on Serverless computing, data into the lake will provide a key to create the lake, and all logic into the lake can be created through the visual interface operation.

2. Efficient

Each lake-entry module is run separately, deployed separately, and scaled separately. Provides more efficient logical management of lake entry module.

3. Stable and reliable

In case of availability zone failure, the cloud function module can automatically select the infrastructure of other availability zone to run, thus avoiding the failure risk of single availability zone. Event-triggered workloads can be implemented using cloud functions that leverage different cloud services to meet different business scenarios and requirements, making the data lake architecture more robust.

4. Keep your expenses low

The function incurs no cost when it is not executed, so for some business processes that do not need to be resident, the overhead is significantly reduced. The function is charged according to the number of requests and the running time of computing resources. Compared with self-built cluster deployment into the lake, the price is obviously better.

5. Cloud native

Serverless provides more cloud native lake access solutions. All resources are deployed on the cloud and used on the cloud, making it more convenient and efficient. 6. Can be customized

Users can use the template to quickly create a common scenario for entering the lake, or customize ETL processing of data flows according to their own services, which is more convenient and flexible.

05. COS + Serverless access to the lake

The current scheme has been integrated into the lake Serverless in COS application integration console, can directly access console.cloud.tencent.com/cos5/applic… Configure related capabilities.

Take TDMQ message backup as an example. Click Configure Backup Rule > Add Function to enter the related configuration page:

After configuration, you can directly manage related function contents on the console:

06. Summary of Serverless Data Lake scheme

In general, the COS data lake scheme based on Serverless architecture has higher usability and lower cost. Meanwhile, compared with self-built clusters, the construction scheme of data lake based on Serverless architecture is less difficult to manage, with single data flow, simple service governance and easy monitoring and query.

In the future, The Serverless and COS teams will continue to work on the Batch architecture and explore more possibilities of the Real-Time framework. Stay tuned. Click here to experience the COS + Serverless access to the lake immediately.

One More Thing

Experience Tencent Cloud Serverless Demo immediately and receive Serverless new user package 👉 Tencent Cloud Serverless novice experience

Welcome to: Serverless Chinese!