The challenges facing the use of big data lookup

There are usually two types of data in an enterprise: operational data and analytical data. The former is used and generated during business operation to support business operation; The latter are used and generated during operations to support business decisions. The former is the source of the latter data.

Figure 1: From job data to analysis data

With the deep integration of digital technology and business scenes, people and things are widely connected by high-speed networks, the speed of information exchange is increasing, and the scale and complexity of data are reaching unimaginable levels. At this time, enterprises will face the following two prominent problems:

  1. Organizations often know which part of the business generates data, but can’t find it when they need it most. If data assets can not be organized and managed well, it is a “data swamp”, turning the once excellent assets into a burden for the enterprise.
  2. Data technology is still in rapid development and iteration. If there is no forward-looking design and system thinking, there will be data fragmentation between multiple big data engines and AI engines due to technical limitations. Business personnel need to copy data back and forth between different engines before data can be used for analysis, resulting in repeated storage and processing, which not only increases costs, but also greatly reduces performance.

In huawei, because the process of IT and big data terminal comprehensive cloud, huawei to solve the data cloud larger size, the computation complexity, business is unparalleled in the world, we explore together with our customers to resolve the most difficult way of data integration management of data assets can “clearly” “looking for fast”, based on a data analysis and calculation in multiple engines between the ability of free flow, In order to realize the fusion analysis of AI and data, this paper will introduce the data intelligence fusion metadata scheme based on the project practice.

This section describes Huawei metadata fusion solution

Important information metadata stored data (e.g., table name, field name, timestamp, version, table size, format, access control lists, etc.) and the correlation (i.e., the flow of data link), under cloudy, across business areas, different systems provide a centralized data management, can find find, quickly understand and analyze the data.

Figure 2: Metadata fusion scheme of a multinational Internet enterprise

As shown in the figure above, the fusion metadata scheme of a multinational Internet enterprise realizes “five unification” for big data, data warehouse, machine learning and other scenarios:

Unified catalog: Establish a unified and complete list of data assets, so that enterprises can master the data assets in a global perspective. As shown in Figure 2, unified Metastore Service uses unified data view to connect big data with AI engine, data analysis team and administrator, so that big data in production system can be seen in real time and what you see is what you get. At the same time, it supports timely synchronization of heterogeneous data source metadata by means of capture /Hook.

Unified rights: Establish unified rights management so that the right people can operate the right data assets. Metadata Admin, as shown in Figure 2, provides fine-grained permission management, not only at the table level, but also at the column and row level. You can manage permissions not only for data, but also for the AI model. The permission system is integrated with the IAM account system and authentication system on the cloud, enabling one authorization and controlling all usage scenarios, simplifying the administrator’s permission management.

Figure 3: Unified authority management

Unified index: Create unified metadata index and data index. Metadata indexes achieve linear scaling of metadata performance and support low latency and high concurrent access for large megabytes partitioned tables. Data index enables precise location of data during data analysis, reducing I/O and improving performance. Through data brain analysis, users’ daily data usage behaviors are calculated, and indexes and materialized views suitable for application scenarios are automatically recommended. At the same time, indexes and materialized views are created and incrementally refreshed at the user’s choice, further improving the hit rate of each data access.

Figure 4: Unified index

Unified transaction: Establish ACID (atomicity, consistency, isolation and persistence) transaction mechanism for big data, data warehouse, machine learning, and let warehouse developers, analysts, data scientists and other users work together in a reliable concurrent system. It provides users with multi-version and multi-branch management capabilities. Users can choose to use historical versions to perform data reproduction, model reproduction, or version rollback to repair data problems at any time. Meanwhile, based on the powerful fine-grained metadata management capability, multiple versions can reuse one underlying storage without storage expansion, and users can control the overall storage cost by controlling the version retention period.

Figure 5: Data & model with multiple versions and branches

Unified access record: establish blood management of data and AI model links, sort out the flow relationship between tables and tables and between tables and models. As shown in FIG. 2 “Blood Relationship,” “Access” and “Computational Cost”, the real-time perception in the computing engine can collect the information of each team’s access to data and model, so that the data can be traced, reproduced and compared in the whole processing process. In a typical data link, the cost of each data table and model over its life cycle (that is, how much computing storage resources are used) is very clear to the business consumer, who can strip out invalid tasks based on the input-output ratio. For example, a real-time report consumes a large amount of analysis storage resources, but the generation of reports every other day has no impact on services. Therefore, you can change the Flink real-time link to Spark offline link. After fully recording the information and integrating the business knowledge, merging metadata gives the enterprise a clear usage ledger and optimization plan.

Figure 6: Typical data links

Fusion metadata is essentially guidance and control over the use of data and is a process of systematic consideration rather than a single activity. Therefore, good metadata management requires a combination of business experience and technical development.

At present, Huawei cloud is also exploring how to balance performance and cost, reduce the threshold of usage, insight into the unknown and so on, combining its own and customers’ demands. We want to break down the “data wall” between storage and computing, multiple engines, and let one piece of data run through the whole process to solve the performance and consistency problems brought by “data moving”. Manage data and models like code, make data and AI development efficient and seamless communication, with the support of AI algorithm, let the value of data unlimited release; Empower the wisdom to automate data governance, reduce the cost of data research and development, and enable each system to “talk” to solve the phenomenon of “data islands”.

The metadata is the base of solving these problems, it will provide unified view for enterprise data and the data catalogue, for the data application, data engineers, scientists and business operations to provide data services, in the face of huge amounts of data business scenario, the endless data governance to explore on the road, to show a clear map corporate partners.

Click to follow, the first time to learn about Huawei cloud fresh technology ~