Students often ask me, “There are many big data components based on the Hadoop ecosystem, how can WE learn it? After all, our energy is limited, and we need to focus on it. I think the following components are very important, as they are basic components, and most people need to be able to learn them.

  • hadoop
  • Hbase
  • Hive
  • Spark
  • Flink
  • Kafka

Hadoop

It is the basic component of big data, many components need to rely on its distributed storage, computing; It mainly includes Hdfs, MR, and Yarn. You need to find some good materials (my home page has information access), learn about their usage, and understand the principles behind them after you are familiar with them.

You need to know the hadoop installation method, CDH/HDP, etc., the role of several background processes after startup, namenode high availability, and high availability based on ZK, Namenode to save metadata fs_image, datanode, JobHistoryServer, etc.

  • Hdfs: you need to know the basic file operations, to command the Linux command, and use it every time you have a chance to knock the naturally, upload, download, recycle bin, copy more fault-tolerant (hadoop3 remedy delete code fault tolerance, don’t use multiple copies, to save space), file segmentation, distributed storage and file storage format, compression.
  • MR: Master the principles of the classic programming model map -> Shuffle -> Reduce, the basic programming methods of MR, Map cutting, Reduce output, MapJoin optimization, memory tuning, etc.
  • Yarn: As a resource scheduling tool, yarn allocates resources (in the unit of Container, which refers to a certain amount of CPU and memory), sets common parameters (such as the maximum number of resources used by each application, resources used by each Container, and virtual memory ratio), and ensures that tasks do not affect each other. A resource scheduling pool is introduced to separate resources. The task scheduling mode (FIFO, FAIR, capacity) and other YARN schedulers are detailed.

Hbase

This is a distributed database based on Hdfs column storage. It is widely used in enterprises. You need to learn its usage, principle, and optimization. An Hbase cluster fails

Basic knowledge, hbase principles, logical rows and columns, column storage, and underlying data Hfile based on HDFS with high concurrency and horizontal expansion. Functions of HMaster and HRegionServer of hbase, ZK, data read and write processes of clients, caches such as MemStore and bucketCache, region splitting, table pre-partitioning, Rowkey design, and data fault-tolerant HLog.

Hive

Hive is a data warehouse tool based on Hadoop. It maps structured data files to a database table, provides simple SQL query functions, and converts SQL statements to MapReduce or Spark tasks. Its advantage is that it has low learning cost and can quickly implement simple MapReduce statistics through SQL-like statements without developing special MapReduce applications. It is very suitable for statistical analysis of data warehouse.

  1. Hive create table, partition table, external table;
  2. Udf development and use, Hiveserver2 access Hive;
  3. Associate hbase and use Spark as the execution engine.
  4. The join optimization;
  5. Data skew;
  6. Common window functions;
  7. Build data warehouse based on Hive;

Spark

Apache Spark is a fast general-purpose computing engine designed for large-scale data processing. Spark, developed using Scala, has the advantages of Hadoop MapReduce and can perform real-time microbatch processing. Unlike MapReduce, however, the output of a Job can be stored in memory, eliminating the need to read and write HDFS. Therefore, Spark is more suitable for MapReduce algorithms that require iteration, such as data mining and machine learning.

  1. Spark architecture, DAG scheduler, standalone/yarn/Kubernetes resources running, memory management;
  2. Spark Core includes SparkContext, RDD, Action, and Transform operators.
  3. SparkStreaming, checkpoint and manual maintenance offsets, consuming Kafka data sources;
  4. SQL, DataFrame and RDD conversion;
  5. Structured Streaming, which is a bit weak, usually uses Flink for real-time processing based on event.
  6. ML (machine learning), which contains most commonly used algorithms, classification, regression, clustering, and so on;
  7. Graphx, graph computing, is used in some industries, such as social relationship mining, fraudulent transactions, etc.
  8. SparkR, data scientists use R for large-scale distributed computing, which has been criticized for its high memory consumption;
  9. PySpark, which is very inclusive, provides an Api for Python programming.

Flink

Flink is a real-time data processing component commonly used by enterprises, Flink state management and state consistency (long article). Apache Flink is a framework and distributed processing engine for stateful computation on both borderless and bounded data streams. Flink runs in all common cluster environments and can perform calculations at memory speed and at any scale.

And that includes,

  1. Flink architecture, submit application on Yarn;
  2. DataSet/DataStream/Table Api;
  3. Flink flow calculation, Flink connection kafka data source;
  4. Fault tolerance, CheckPoint/Savepoint, two-phase commit, etc.
  5. Source/Sink operation (hbase, mysql, redis);
  6. WaterMark mechanism;
  7. Window function Window;
  8. Join operation of different data streams;
  9. Side output (out of order, shunt);
  10. Flink asynchronous IO;

Kafka: Apache Kafka® is a distributed streaming platform. * Allows you to publish and subscribe to streaming records. This is similar to message queues or enterprise messaging systems. Can store streaming records, and has good fault tolerance. Streaming records can be processed as soon as they are generated.

  1. Producer, consumer, consumer group, broker, leader, flower election, etc.
  2. Topic, partition, multiple copies, data synchronization, high water level, etc.
  3. Peak flow resistance, data life cycle;
  4. Acks for data writing guarantee the semantics of data consumption, global and local order of data;
  5. Kafka performs sequential write, zero copy, partition concurrency, transmission compression, time wheel, etc.
  6. Kafka does away with ZK;

The data warehouse

Data warehouse is a subject-oriented, Integrate, non-volatile and time-variant data set, which is used to support management decisions. The construction of data warehouse in the data platform has two links: One is the construction of data warehouse, the other is the application of data warehouse. History of warehouse architecture

  1. Classical three normal forms, dimension table and fact table;
  2. Star model, snowflake model, constellation model;
  3. Paradigm modeling and dimension modeling;
  4. Data warehouse layer, ODS/DWD/DWS/DIM, ADS layer, DM layer;
  5. Data lakes, which are more flexible and cost effective than data warehouses;

impala

Impala is Cloudera’s new query system that provides SQL semantics to query petabytes of big data stored in HDFS and HBase. Existing Hive systems also provide SQL semantics. However, Hive uses the MapReduce engine to perform batch processing, which is difficult to meet query interactivity. By contrast, the Impala’s biggest feature and selling point is its speed.

clickhouse

ClickHouse is an open source column-storage-based database for real-time data analysis by Yandex (Russia’s largest search engine) that processes data 100-1000 times faster than traditional methods.

kylin

Apache Kylin™ is an open source, distributed analytical data warehouse that provides SQL query interfaces over Hadoop/Spark and multidimensional analysis (OLAP) capabilities to support very large scale data. It was originally developed by eBay and contributed to the open source community. It can query huge tables in subseconds.

Apache Kylin™ enables users to implement subsecond queries on very large data sets in just three steps.

  • Defines a star or snowflake model on the dataset
  • Build cubes on the defined tables
  • Query results using ODBC, JDBC, or RESTFUL apis using standard SQL can be obtained in sub-second response time

**docker / Kubernetes **

Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services that facilitates declarative configuration and automation. Kubernetes has a large and rapidly growing ecosystem. Kubernetes’ services, support and tools are widely available.

kudu

Kudu, Cloudera’s new column storage system, is a member of the Apache Hadoop ecosystem and is a specific incubator for fast analysis of rapidly changing data, filling a void in the Hadoop storage layer.

cdh/hdp

Apache Hadoop’s open source license allows anyone to modify it and distribute/sell it as an open source or commercial version. As a result, there are many distributions of Hadoop, including Huawei distribution (for a fee), Intel distribution (for a fee), Cloudera distribution CDH (free), Hortonworks version HDP(free), all of which are derived from Apache Hadoop.

Of course, there are many other components, such as SQoop, Oozie, etc., which have their own application scenarios.