What has been the hottest technology direction in the IT industry in recent years? It must belong to ABC, namely AI + Big Data + Cloud, namely artificial intelligence, Big Data and Cloud computing.

In recent years, as the Tide of The Internet goes to the bottom and traditional enterprises carry out digital transformation one after another, basically all companies are considering how to further mine the value of data and improve the operation efficiency of enterprises. In this trend, big data technology is becoming more and more important. So, in the ERA of AI, big data is really OUT before you know it!

Compared with AI and cloud computing, big data has lower technical barriers and is more relevant to the business. My personal feeling is that in a few years, big data technology will become as basic a skill requirement as distributed technology is today.

A few days ago, I had a technical sharing of big data in my team, focusing on the knowledge of big data literacy and providing a study guide. In this article, I make a systematic arrangement based on the shared content, hoping that it will be helpful for students who are interested in the direction of big data. The content is divided into the following five parts:

1. The history of big data

2. Core concepts of big data

3. General architecture and technical system of big data platform

4. General processing process of big data

5. Data warehouse architecture under big data

The history of big data

Before explaining the concept of “big data”, LET’s take a look at the development history of big data for nearly 30 years, which has gone through five stages. What is the historical position of big data at each stage? What are the pain points?

1.1 Enlightenment Stage: the emergence of data warehouse

In the 1990s, business intelligence (better known as BI) was born, turning existing business data into knowledge that helps bosses make business decisions. For example, in the retail scenario, it is necessary to analyze the sales data and inventory information of the goods in order to make a reasonable purchase plan.

Obviously, business intelligence is inseparable from data analysis. It needs to aggregate the data of multiple business systems (such as trading system and warehouse system), and then conduct the range query of large data volume. But the traditional database is facing the single business increase, delete, change and check, can not meet this demand, so prompted the emergence of the concept of data warehouse.

The traditional data warehouse, for the first time, clarified the application scenario of data analysis, and adopted a separate solution to achieve, independent of the business database.

1.2 Technological change: Hadoop was born

Around 2000, the ADVENT of PC Internet era brought massive information, with two typical characteristics:

  • Data scale: Internet giants like Google and Yahoo generate hundreds of millions of behavioral data pieces a day.

  • Diversified data types: In addition to structured business data, there are massive user behavior data and multimedia data represented by images and videos.

It is clear that traditional data warehouses cannot support business intelligence in the Internet age. In 2003, Google published three original papers (commonly known as “Google 3 carriages”), including: distributed processing technology MapReduce, column storage BigTable, distributed file system GFS. These three papers laid the theoretical foundation of modern big data technology.

Google did not open source the source code for these three products, but only published detailed design papers. In 2005, Yahoo sponsored the open source implementation of Hadoop based on these three papers, which officially kicked off the era of big data.

Hadoop has the following advantages over traditional data warehouses:

  • It is completely distributed and can be clustered by cheap machines, which can fully meet the storage requirements of massive data.
  • Weakening data format and separating data model from data storage can meet the requirement of heterogeneous data analysis.

With the maturity of Hadoop technology, the concept of “data lake” was proposed at Hadoop World Conference in 2010.

A data lake is a system that stores data in raw format.

An enterprise can build a data lake based on Hadoop, using data as its core asset. As a result, the Data lake has begun to commercialize Hadoop.

1.3 Data Factory Era: The rise of big data platforms

Commercial Hadoop consists of ten technologies, and the data development process is very complex. In order to complete a data requirement development, it involves data extraction, data storage, data processing, data warehouse construction, multidimensional analysis, data visualization and so on. This high technology threshold will obviously restrict the popularization of big data technology.

At this time, big data platform (PaaS) emerges as The Times require. It is a full-link solution for RESEARCH and development scenarios, which can greatly improve the efficiency of data research and development and enable data to be processed as quickly as on an assembly line. Raw data can be turned into indicators and appear in various reports or data products.

1.4 Data Value era: Ali proposed the data center

Around 2016, it has been the era of mobile Internet. With the popularity of big data platforms, many application scenarios of big data have emerged.

Now some new problems: in order to achieve rapid business requirements, chimney type development mode led to different lines of business data is completely fragmented, this caused a large amount of data index of repeated development, not only low r&d efficiency and at the same time also waste storage and computing resources, makes big data application cost is higher and higher.

At this time, visionary Jack Ma called out the concept of “Data center”, and the slogan of “One Data, One Service” began to resound in the big Data field. The core idea of the data center is to avoid the double calculation of data, improve the sharing ability of data and enable business through data servitization.

02 Core concepts of big data

After understanding the development history of big data, let’s explain some core concepts of big data.

2.1 What is big data?

Big data is a massive, high-growth and diversified information asset, which requires new storage and calculation mode to have stronger decision-making power and process optimization ability.

Here are four typical characteristics of big data:

  • Volume: The Volume of massive data reaches PB or EB level.
  • Variety: Heterogeneous data types, including not only structured data, but also semi-structured and unstructured data such as log files, images, audio and video, etc.
  • Velocity: Fast data flow, fast data generation and processing.
  • Value: The Value density is low and the proportion of valuable data is small. Artificial intelligence and other methods are needed to mine new knowledge.

2.2 What is a data warehouse?

A data warehouse is a topic-oriented, integrated, time-changing, and relatively stable collection of data.

Simple understanding, data warehouse is a form of organization of big data, which is conducive to the maintenance and further analysis of massive data.

  • Topic-oriented: Organizes data by topic or business scenario.
  • Integrated: Collect data from multiple heterogeneous data sources for extraction, processing and integration.
  • Time varying: Critical data needs to be marked with time attributes.
  • Relatively stable: data is rarely deleted or modified, but only added.

2.3 Traditional data Warehouse vs new generation data warehouse

With the advent of the era of big data, there are bound to be many differences between traditional data warehouse and the new generation of data warehouse. The following are the similarities and differences between the two generations of data warehouse from a multidimensional perspective.

General architecture of big data platform

There are dozens of technologies related to big data mentioned above. The following is a general architecture of big data platform to understand the whole technical system.

3.1 Data Transfer Layer

  • Sqoop: supports two-way data migration between RDBMS and HDFS. It is usually used to extract data from service databases (such as MySQL, SQLServer, and Oracle) to HDFS.
  • Cannal: An open source data synchronization tool of Alibaba, it can realize incremental data subscription and near-real-time synchronization by listening to MySQL binlog.
  • Flume: collects, aggregates, and transmits massive logs and saves the generated data to the HDFS or HBase.
  • Flume + Kafka: enables real-time Streaming log processing and real-time log parsing and application through Spark Streaming and other Streaming processing technologies.

3.2 Data Storage Layer

  • HDFS: Distributed file system. It is the basis of data storage and management in distributed computing. It is an open source implementation of Google GFS and can be deployed on cheap commercial machines with high fault tolerance, throughput, and scalability.
  • HBase: distributed, column-oriented NoSQL KV database. It is the open source implementation of Google BigTable and uses HDFS as its file storage system. It is suitable for real-time query of big data (such as IM scenarios).
  • Kudu: a big data storage engine that combines HDFS and HBase distributed databases and supports both random read and write and OLAP analysis (to solve the pain point that HBase is not suitable for batch analysis).

3.3 Resource Management Layer

  • Yarn: Hadoop resource manager. It centrally manages and schedules Hadoop cluster resources and provides server computing resources (CPU and memory) for MR tasks. Yarn supports multiple frameworks such as MR, Spark, and Flink.
  • Kubernates: Open-source by Google, it is a containerized orchestration engine for cloud platforms. It provides containerized management of applications and can be migrated between different cloud and operating system versions. Currently, Spark and Storm already support K8S.

3.4 Data computing layer

Big data computing engine determines computing efficiency and is the core part of big data platform. It can be roughly divided into offline computing framework and real-time computing framework after the development of the following four generations.

3.4.1 Offline computing framework

  • MapReduce: A computing model, framework, and platform for parallel processing of big data (a clever design that brings computing closer to data and reduces data transfer).
  • Hive: a data warehouse tool that manages data stored in HDFS, maps structured data files to a database table, and provides complete SQL query functions (translated Hive SQL into MapReduce tasks). It is suitable for offline and non-real-time data analysis.
  • Spark SQL: Introduces the special data structure RDD (Elastic distributed Data set), converts SQL into RDD calculation, and stores the intermediate result of calculation in memory. Therefore, Spark SQL has higher performance than Hive and is suitable for data analysis scenarios that require high real-time performance.

3.4.2 Real-time computing framework

  • The Spark Streaming: The real-time streaming data processing framework (divided into small batches by time slice and s-class delay) can receive real-time input data from data sources such as Kafka, Flume, and HDFS. After processing, the results are saved in HDFS, RDBMS, HBase, Redis, and Dashboard.

  • Storm: Real-time streaming data processing framework, real streaming processing, every piece of data triggers calculation, low latency (ms latency)

  • Flink: More advanced real-time streaming data processing framework with lower latency than Storm and higher throughput, plus support for out-of-order and adjusted latency.

3.5 Multidimensional analysis layer

  • Kylin: A distributed analysis engine that queries huge Hive tables in subseconds and saves the result of multi-dimensional combination calculation into cubes in HBase by predictive calculation (space for time). When users perform SQL queries, the SYSTEM converts SQL queries into Cube queries, enabling fast query and high concurrency.
  • Druid: A highly fault-tolerant, high performance open source distributed system for real-time data analysis. You can perform arbitrary aggregation analysis on billion row tables in seconds.

General processing flow of big data

After understanding the general architecture and technical system of the big data platform, let’s take a look at how to use big data technology to process offline data and real-time data.

The figure above shows a general big data processing process, which mainly includes the following steps:

  • Data collection: This is the first step of big data processing. Data sources mainly fall into two categories. The first category is the relational database of each business system, which is periodically extracted or real-time synchronized through tools such as Sqoop or Cannal. The second type is all kinds of buried point logs, which are collected in real time through Flume.
  • Data storage: Once collected, the next step is to store the data in HDFS or, in the case of real-time log streaming, to a subsequent streaming engine via Kafka.
  • Data analysis: This step is the most important part of data processing, including offline processing and streaming processing. The corresponding computing engines include MapReduce, Spark, and Flink. The processing results are saved to pre-designed data warehouses or storage systems such as HBase, Redis, and RDBMS.
  • Data application: including data visualization, business decision-making, or AI and other data application scenarios.

Data warehouse architecture under big data

Data warehouse is a form of data organization from the perspective of business. It is the basis of big data application and data center. The warehouse system generally adopts the layered structure as shown below.

It can be seen that the data warehouse system is divided into four layers: source data layer, data warehouse layer, data mart layer, data application layer. Adopting such hierarchical structure is similar to the idea of hierarchical software design, which is to simplify complex problems. Each layer has a single responsibility and improves maintainability and reusability. The specific functions of each layer are as follows:

  • ODS: Source data layer, source table.
  • DW: Data warehouse layer, which contains dimension table and fact table. Data breadth table is formed after cleaning the source table, such as city table, commodity category table, back-end buried point list, front-end buried point list, user breadth table, and commodity breadth table.
  • DM: data mart layer, which summarizes data with light granularity and is jointly built by various business parties, such as user group analysis table and transaction whole link table.
  • ADS: Data application layer, which generates various data tables according to actual application requirements.

In addition, the data tables of each layer will adopt uniform naming rules for standardized management, and the table name will carry hierarchical, subject domain, business process and partition information. For example, for an exposure meter in the interchange domain, the name might look like this:

Write in the last

The history, core concepts, common architecture and technical system of big data have been systematically summarized above. If you want to learn more about big data technology, you are advised to refer to this article and the study guide below.

I will continue to bring more in-depth sharing of big data in the future. If you are interested, please follow me.

About the author: 985 master, former Engineer of Amazon, now 58-year-old technical director

Welcome to pay attention to my personal public number: IT career advancement