Abstract:What is a data lake? What does it do? Today, huawei cloud technology experts from the theoretical point of view, the problems will be removed from the cocoon, from the technical dimension.

What is a data lake

A data lake is a large warehouse that stores a wide variety of raw data for an enterprise that can be accessed, processed, analyzed, and transmitted.

Data lakes fetch raw data from multiple sources in the enterprise, and there may be multiple copies of the same raw data that meet a particular internal model format for different purposes. Therefore, the data being processed in the data lake can be any type of information, from structured data to completely unstructured data.

Companies have high hopes for data lakes to help users quickly access useful information that can be used for data analysis and machine learning algorithms to gain insights relevant to how the business is running.

The relationship between data lakes and enterprises

Data lake can bring many capabilities to enterprises, such as centralized management of data, on which enterprises can tap many capabilities that were not available before.

In addition, combining advanced data science and machine learning technologies, data lake can help enterprises build more optimized operating models and provide other capabilities, such as predictive analysis and recommendation models, which can stimulate the subsequent growth of enterprise capabilities.

There are many capabilities hidden in enterprise data, however, until important data can be used by people with insight into business data, it cannot be harnessed to improve business performance.

How can data lakes help businesses

For a long time, enterprises have been trying to find a unified model to represent all entities in the enterprise. This task is extremely challenging for a number of reasons, some of which are listed below:

1. An entity may have multiple representations in an enterprise, so there may not be a complete model to represent the entity uniformly.

2. Different enterprise applications may process entities based on specific business goals, which means that certain enterprise processes are adopted or excluded when working with entities.

3. Different applications may have different access patterns and storage structures for each entity.

These problems have plagued enterprises for years and have hindered standardization of business processes, service definitions, and nomenclature.

From the point of view of a data lake, we are looking at it in a different way. With data lakes, a better unified data model is implicitly implemented without fear of material impact on business applications. These business programs are “experts” at solving specific business problems. The data lake represents the entity as “plump” as possible based on the full amount of data captured from all the systems associated with the entity owner.

Because it is better and more complete in terms of physical representation, the data lake has indeed brought great help to enterprise data processing and management, providing enterprises with more insight on enterprise growth and helping enterprises achieve their business goals.

Advantages of data lakes

Enterprises generate huge amounts of data across their multiple business systems, and as they grow in size, they need to become smarter about how they process data across multiple systems.

One of the most basic strategies is to use a single domain model that accurately describes the data and represents the data that is most valuable to the overall business. These data refer to the enterprise data mentioned earlier.

An enterprise that has well defined enterprise data certainly has some way of managing it, so that changes to the definition of enterprise data are consistent and it is clear within the enterprise how the information is shared by the system.

In this case, the system is divided into data owner and data consumer. For enterprise data, there needs to be a corresponding owner, who defines how the data is captured by other consuming systems that act as consumers.

Once an enterprise has a clear definition of data and systems, it can leverage a large amount of enterprise information through this mechanism. A common implementation strategy for this mechanism is to provide a unified enterprise data model by building an enterprise-level data lake, which is responsible for capturing, processing, analyzing, and serving data to consumer systems.

Data lakes can help businesses in the following ways:

1. Implement data Governance and data lineage.

2. Implement business intelligence through the application of machine learning and artificial intelligence technologies.

3. Predictive analytics, such as domain specific recommendation engines.

4. Information tracking and consistency assurance.

5. Generate new data dimensions based on historical analysis.

6. Having a centralized data center that can store all enterprise data facilitates the realization of a data service optimized for data transmission.

7. Help organizations or businesses make more flexible decisions about business growth.

In this section, we discuss what capabilities a data lake should have. How data lakes work and how to understand how they work will be discussed and commented on.

How does the data lake work

To understand exactly what benefits a data lake can bring to an enterprise, it is important to understand how a data lake works and what components are required to build a fully functional data lake. Before diving into the details of the data lake architecture, it’s useful to understand the data life cycle in the context of the data lake.

At a high level, the data life cycle of data Lake is shown in the figure below.

The above life cycle can also be called the different phases of data in the data lake. The data and analytical methods required for each stage are also different. Data processing and analysis can be processed in either batch or near-real-time mode.

The implementation of a data lake needs to support both processes, because different processes serve different scenarios. The choice of processing mode (batch or near-real time) also depends on the amount of computation required for data processing or analysis tasks, as many complex calculations cannot be performed in near-real time processing mode, and in some cases, longer processing cycles are not acceptable.

The choice of storage system also depends on data access requirements. For example, if you want to store data for easy access through SQL queries, you must select a storage system that supports SQL interfaces.

If data access requires data views, data can be stored in a corresponding form. That is, data can be provided as views externally, providing convenient manageability and accessibility.

A recent and increasingly important trend is to provide data through services, which involves exposing data on a lightweight service layer. Each exposed service must accurately describe service functionality and provide data externally. This pattern also supports service-based data integration so that other systems can consume data provided by data services.

As data flows from the collection point into the data lake, its metadata is captured and managed in terms of data traceability, data lineage, and data security based on data sensitivity throughout its life cycle.

A data lineage is defined as the life cycle of data, including its origin and how it moves over time. It describes how the data changes during various processes, helps provide visibility into the pipeline of data analysis, and simplifies error tracing. Traceability is the ability to verify the history, location, or application of data items by identifying records. — Wikipedia

Differences between a data lake and a data warehouse

A data lake is often considered the equivalent of a data warehouse. In fact, data lakes and data warehouses represent different goals that enterprises want to achieve.

The key differences are shown in the table below.


From the diagram, the difference between a data lake and a data warehouse is clear. However, the roles of the two are complementary in the enterprise and should not be seen as a replacement for the data warehouse. After all, the roles of the two are very different.

Data lake construction method

Different organizations have different preferences, so they build data lakes differently. The construction method is related to business, process and existing systems.

A simple data lake implementation is almost equivalent to defining a central data source that all systems can use to meet all their data needs. While this approach may be simple and cost-effective, it may not be a very practical approach for the following reasons:

1. This approach will only work if organizations start building their information systems again.

2. This approach does not solve problems associated with existing systems.

3. Even if an organization decides to build a data lake in this way, there is a lack of clear responsibilities and separation of concerns.

4. Such systems often try to do everything at once, but eventually fall apart as the demands of data transactions, analysis, and processing increase.

A better strategy to build a data lake is to treat the enterprise and its information system as a whole, classify the data ownership relationship, and define a unified enterprise model.

While this approach may present process-related challenges and may require more effort to define system elements, it still provides the flexibility, control, and clarity of data definition and separation of concerns between different system entities in the enterprise.

Such a data lake can also have independent mechanisms to capture, process, analyze data, and provide data services to consumer applications.

Click to follow, the first time to learn about Huawei cloud fresh technology ~