This article is what I plan to share big data in the company recently. I was used to the All-English Keynote, so the original title was called “Introduction to Bigdata”, but the English inscription of wechat always felt awkward, so I chose the Chinese name.

The purpose of this article is to introduce those who are not familiar with big data but are interested in it. You can ignore it if you’re an old hand, or you want to see something different.

When we learn something new, the first step should be to give it a clear definition. That’s how you know what you’re learning, what to learn, and what to leave alone.

However, although big data is very popular, it is actually a less clear concept, different people may have different understanding.

We don’t want to talk about the four V’s and six C’s. We don’t even want to talk about the evolution of the architecture and the various tuning methods, which you may not understand, remember or use.

We do not pay attention to the cool application layers of AI and Machine Learning, but to see what the foundation of the house of big data looks like. Limited to space, a lot of technical details to stop, interested students can understand as needed, this is also the meaning of entry.

one

First question, big data, big data, how much Shouting? Or to put it another way, when do we need big data related technologies?

This is still a question that has no right answer. Some people may think tens of gigabytes is enough, while others may think tens of gigabytes is fine. When you don’t know how to shout, or when you don’t know whether to use big data technology, you usually don’t need it yet.

And when your data is too much to be stored on a single machine or several machines, even if saved, it is not easy to manage and use; When you find that using traditional programming methods, even if multi-process and multi-thread coroutines are all used, the data processing speed is still not ideal; When you encounter other practical problems caused by too much data, you may want to consider whether you should try big data related technologies.

two

It is easy to abstract two typical application scenarios of big data from the example above:

  • Large amount of data storage, to solve the problem of loading.

  • Large amount of data calculation, solve the problem of slow calculation.

Therefore, the foundation of big data is composed of two parts: storage and computation.

three

We have two requirements for storage on a single machine, or when the amount of data is not that large:

  • File storage

  • Storage in database form

Storage in the form of files is the most basic requirement, such as logs generated by each service, data crawled by crawlers, multimedia files such as pictures and audio, etc. That corresponds to the original data.

Database-type storage is usually processed data that can be used directly by business programs, such as visitor IP, UA, etc., processed from the access log file and stored in a relational database, so that it can be directly displayed on the page by a Web program. The corresponding data is easy to use after processing.

Big data is just a lot of data, and so are these two needs. While not necessarily rigorous, the former can be called offline storage and the latter online storage.

four

Offline storage Hadoop Distributed File System (HDFS) is basically the de facto standard. As the name suggests, this is a distributed file system. In fact, “distributed” is also a general method to solve the problem of big data. Only a distributed system that supports infinite horizontal expansion can theoretically solve the problem brought by the infinite growth of data volume. Of course infinity is in quotes.

This is a simple HDFS architecture diagram, which is still not very intuitive, but there are only a few key points:

  • Files are split in blocks and stored on different servers, with each block making multiple copies on different machines.

  • There are two roles: NameNode and DataNode. The former stores metadata, that is, where each block is stored, and the latter stores the actual data.

  • To read and write data, first obtain the metadata of the corresponding file from NameNode, and then obtain the actual data from the corresponding DataNode.

As you can see, HDFS implements distributed performance by centrally recording metadata. When the data volume increases, you only need to add new Datanodes, and the capacity of a single machine is no longer limited.

In order to ensure high availability of data, for example, when a server suddenly breaks down and does not come back, HDFS solves this problem by redundancy (usually three copies). This is also the most commonly used method of high availability in distributed systems, although the cost can be high.

High availability at the system level makes sense, so in addition to high availability of data, high availability of metadata is also critical. Same idea – backup. HDFS provides Secondary NameNode to provide metadata redundancy. Of course, it is better to use the NameNode HA mode, through the active/standby group of Namenodes to ensure uninterrupted metadata read and write service.

Similarly, extensibility just takes into account horizontal scaling of data. What about metadata? When the volume of data is large enough, metadata can be very large, similar to the index bloat problem we encounter in traditional relational databases. The solution is NameNode Federation. The simple idea is to split the original set of Active/Standy NameNode into multiple groups, each of which manages only a portion of the metadata. Split it up and provide services as a whole, similar to how we mount hard disks in Linux. These NameNode groups are independent of each other. HDFS of version 2. X implements transparent access to multiple NameNode groups through the configuration of the client through the abstraction of ViewFS. 3.X HDFS implements a new Router Federation to ensure transparent access to multiple groups of Namenodes on the server.

As you can see, horizontal scaling of metadata is exactly the same as horizontal scaling of real data, split and distributed.

five

Online storage corresponds to offline storage, which can be understood by referring to traditional databases such as MySQL and Oracle. HBase is most commonly used in big data.

There are many data classification standards. HBase can be classified into NoSQL, column, and distributed databases.

It is a NoSQL database because HBase does not provide many features of traditional relational databases. For example, reading and writing data in THE form of SQL is not supported. Although third-party solutions such as Apache Phoenix can be integrated, it is not supported native. Secondary indexes are not supported, and only sequential rowkeys are used as primary keys. Although Coprocessor can be used to create Secondary indexes, Apache Phoenix also provides the function of creating Secondary indexes in SQL statements, but it is not supported native. Schema is less structured and deterministic. It only provides column families for column classification management. There is no limit on the number and type of columns in each column family. In these ways, HBase doesn’t even look like a DataBase, more like a DataStore.

It is a column database because the underlying storage is organized in column families. If the same column families of different rows are put together, the different column families of the same row are not put together. This makes things like column-based filtering much easier.

It is a distributed database because it provides a powerful ability to scale horizontally. This is why HBase has become a mainstream solution for online big data storage.

HBase provides a large amount of data storage and access because it is based on the HDFS. All data is stored in the HDFS as files. This naturally has the ability to scale horizontally, as well as high availability and other features.

Having solved the problem of data storage, all that remains is to provide a set of database-like API services on this basis, and to ensure that this SET of API services can be easily scaled horizontally.

As simple as this architecture diagram is to go online, let’s list a few key points:

  • Each node has a program called RegionServer that reads and writes data

  • Each table is divided into multiple regions and evenly distributed on each RegionServer

  • Another HMaster service is responsible for creating, modifying, and deleting Region tables.

RegionServer is similar to DataNode, HMaster is similar to NameNode, and Region is similar to Block.

If datanodes are added to the HDFS layer and RegionServer is added to the HBase layer, HBase can be easily scaled horizontally to meet more data read and write requirements.

If you’re careful, you’ll notice there’s no metadata in the diagram. In HDFS, metadata is controlled by NameNode. Similarly, metadata in HBase is controlled by HMaster. Metadata in HBase stores regions in a table and which RegionServer provides services.

HBase uses a clever way to save metadata. For example, application data is saved in HBase tables. In this way, metadata can be operated as a common table, which is much easier to implement.

The granularity of HBase metadata is much larger than that of blocks in HDFS. Therefore, the amount of metadata data does not become a performance bottleneck and horizontal expansion of metadata is not necessary.

As for high availability, HDFS is already available on the storage layer, and multiple HMasters are required on the service layer.

six

So much for storage. Now let’s do the calculation.

Similar to storage, computation can be divided into two types, regardless of data size:

  • Offline computing, or batch computing/processing

  • Online computing, or realtime computing, stream computing/processing

Another way to distinguish batch from stream processing is to work with data boundaries. Batch processing corresponds to bounded data, which has a limited and definite size. On the other hand, the data processed by streams are unbounded. The data amount has no boundary, and the program will never stop executing forever.

In the field of big data, batch computing is generally used for scenarios that are insensitive to response time and latency. Some are insensitive to the business logic itself, such as day-level reports, while others are forced to sacrifice response time and latency due to the sheer volume of data. In contrast, flow computing is used for response time and delay sensitive scenarios, such as real-time PV computing.

Batch calculation latency is generally large, in the order of minutes, hours or even days. Stream processing generally requires data processing to be completed in milliseconds or seconds. In between, it’s worth noting, is quasi-real-time computing, with delays ranging from seconds to tens of seconds. Quasi – real-time computing is naturally intended to strike an acceptable balance between latency and the amount of data being processed.

seven

The oldest example of batch computing in big data is MapReduce. Together, MapReduce and the previously mentioned HDFS make up Hadoop. Hadoop has been the de facto standard infrastructure for big data for years. From this point, we can also see that the classification of big data technology by storage and calculation is the most basic approach.

As a distributed computing framework (recall that distribution is the default way to solve big data problems), MapReduce derives its programming model from the Map and Reduce functions in functional programming languages.

Because HDFS already cuts up data in blocks, the MapReduce framework can easily distribute the initial processing of data in the form of multiple Map functions to different servers for parallel execution. The results processed in the MAP stage are distributed to different servers in the same way to perform the reduce operation in the second stage in parallel to obtain the desired results.

Taking “Hello World” — “Word Count” in the field of big data as an example, to calculate the number of occurrences of each Word in a total of 10 tons of data of 100 files, there may be 100 Mapper to do Word segmentation for the data assigned to them in parallel in the map stage. Then, the same word “shuffle” is aggregated and summed into the same reducer. These reducers are also executed in parallel, and finally each reducer is independently output, which together is the final complete result.

As shown in the preceding figure, shuffle processes the output of the Map phase as the input of the Reduce phase, which is a key process connecting the preceding and the following phases. This process is automatically implemented by the MapReduce framework, but it involves a lot of I/O operations. I/O operations are usually the most performance-consuming part of the data processing process. Therefore, Shuffle becomes the focus of performance tuning.

It can be seen that it is the idea of distributed computing and the method of parallel processing with multiple servers and cores that enables us to complete the processing of massive data with sufficient speed.

eight

As a distributed computing framework, the MapReduce framework provides basic big data computing capabilities. However, with the use of more and more scenarios, also slowly exposed some problems.

Resources coordination

Similarly, distributed computing needs to have a unified management and coordination role. More specifically, this role is responsible for coordinating computing resources, allocating computing tasks, monitoring task progress and status, and rerunning failed tasks.

The early MapReduce, or MR1, did all of this, with a central role called JobTracker, There is also a TaskTracker on each node that collects local resource usage and reports it back to JobTracker as a basis for allocating resources.

The main problem with this architecture is that JobTracker has so many responsibilities that it can easily become a bottleneck for the entire system once the cluster reaches a certain size and has so many tasks.

That led to the second generation of Refactoring MapReduce, MR2, and I gave it a new name YARN (Yet-other-Resource-Negotiator).

The two core functions of JobTracker, resource management and task scheduling and monitoring, are separated into ResourceManager and ApplicationMaster respectively. ResouerceManager allocates computing resources (but also initializes and monitors ApplicationMaster), making it easier to become a bottleneck and highly available when multiple instances are deployed. ApplicationMaster is assigned to each App and organizes resource applications, scheduling execution, status monitoring, and rerouting of all jobs. In this way, the heaviest burden of work is spread out among the AM’s, and the bottleneck does not exist.

Development costs

To use the MapReduce framework, you need to write a Mapper class that inherits some parent classes, and then write a Map method that does the specific data processing. Reduce is similar. As you can see, the development and debugging costs are not cheap, especially for positions such as data analysts, where programming skills are less prominent.

The natural idea was to SQL, and there is probably no more general language to implement basic data processing than SQL.

The early SQL-ification of MapReduce was mainly implemented by two frameworks. One is Apache Pig and one is Apache Hive. The former is relatively small, the latter is the choice of the vast majority of companies.

Hive basically interprets your SQL statements into MapReduce tasks to execute.

For example, let’s create a test table like this:

create table test2018(id int, name string, province string);

Then use the explain command to view the execution plan for the following SELECT statement:

explain select province, count(*) from test2018 group by province;

You can see that Hive parses the SQL statement into a MapReduce job. This SQL is simple, but if it is complex, it might be parsed into a number of consecutive dependent MapReduce tasks.

In addition, since it is SQL, Hive naturally provides abstractions such as libraries and tables to better organize your data. Even traditional data warehouse technologies work well with Hive. This is another big topic that we don’t talk about here.

Computing speed

The results of each MapReduce phase need to be removed from the disk and then read out for the next phase. Since it is a distributed system, there are also many cases where data needs to be transmitted over the network. Disk IO and network IO are both very time consuming operations. The latter can be solved through locality of data, which greatly slows down the execution of the program by assigning tasks to the machine where the data resides.

One of the best solutions to this problem is Apache Spark. Spark is a distributed computing framework based on memory. All computations are performed in memory and are spilled to disk only when the memory is insufficient. Therefore, disk operations can be minimized.

Spark also organizes the execution process based on the Directed Acyclic Graph (DAG). Knowing the sequential dependencies of the entire execution steps makes it possible to optimize the physical execution plan, eliminating unnecessary and repetitive I/O operations.

Another important point to note is that the MapReduce programming model is so simple that in many cases uncomplicated operations require several MapReduce tasks to complete, which is a significant drag on performance. Spark, on the other hand, offers a wide variety of operations and greatly improves performance.

Programming model

The spark-rich operations mentioned in the previous paragraph improve performance, but another benefit is that the development complexity is much lower. In contrast, the expressiveness of the MapReduce programming model is very weak, after all, many operations would be very cumbersome to apply map and then Reduce.

In the Case of grouping averages, in the MapReduce framework, you need to write two classes as described above, or even two MapReduce tasks.

In Spark, just say ds.groupby(key).avg(). Really is no contrast, no harm.

nine

There’s no doubt that everyone wants the data to be calculated as early as possible, so real-time computing, or stream computing, has been a hot topic of study and use.

As mentioned earlier, Hadoop is the standard infrastructure for big data, providing HDFS as a storage system and MapReduce as a computing engine, but it does not provide support for flow processing. As a result, there are many competitors in the stream processing space, rather than the dominant situation that MapReduce had in earlier years.

Here’s a quick look at three of the most popular Streaming frameworks: Spark Streaming, Storm and Flink.

Each of these frameworks has its own strengths and weaknesses. Space is limited, so let’s pick a few typical dimensions for comparison.

  • Programming paradigm

Generally speaking, programming paradigms, or more generally, writing programs, can be divided into two categories: Imperative and Declarative. The former needs to write clearly “how” step by step to be closer to the machine, while the latter only needs to write “what” to be closer to the human. As mentioned in the WordCount example, MapReduce is imperative and Spark is declarative.

As you can see, imperative is more verbose but gives the programmer more control, whereas declarative is more concise but may lose some control.

  • Latency

The concept of delay is simple. It is the time from the time a piece of data is generated to the time it is processed. Note that the time here is the actual experience time, not necessarily the actual processing time.

  • Throughput

Throughput is the amount of data processed per unit of time. Together with the above delays, they are generally considered to be the two most important performance metrics for a stream processing system.

Let’s look at the three flow processing frameworks from these dimensions.

Spark Streaming

Spark Streaming is based on the same computing engine as Spark for offline batch processing mentioned earlier, which is essentially called mirco-batch, but this batch can be set very small and has a near-real-time effect.

  • Programming paradigm

Like Spark for offline batch processing, it is declarative and provides rich operations with very simple code.

  • delay

Because it is micro-batch, the latency is lower than the one-by-one real-time processing engine, usually in the second order. Of course, you can set batch to a smaller size to reduce latency, but at the cost of reduced throughput.

The Spark Streaming delay cannot be tuned to meet the requirements of some scenarios because it is batch-based. To address this issue, Spark 2.3, which has yet to be officially released, will support Continuous Processing, providing delays comparable to native Streaming. Continuous Processing eliminates the mirco-batch pseudo-stream processing and uses the same long-running task as native Streaming to process data. Users will be configured to choose between micro-batch and continuous processing for streaming.

 

  • throughput

Because it is micro-batch, the throughput is considerably higher than that of a strictly real-time processing engine. From this we can also see that micro-batch is a choice with pros and cons.

Storm

Storm’s programming model is somewhat similar to MapReduce’s, defining Spout for processing input and Bolt for processing logic. Multiple SpOuts and Bolts are connected to each other and depend on each other to form the Topology. Both Spout and Bolt need to define classes and implement specific methods like MR.

  • Programming paradigm

It’s clearly imperative.

  • delay

Storm is strictly native streaming, processing data one at a time, so the latency is very low, usually in milliseconds.

  • throughput

Similarly, the throughput of mirco-batch engines such as Spark Streaming is much lower than that of mirco-batch engines such as Spark Streaming, as data is processed one by one and message by message ACK is adopted to ensure falt tolerance.

It should be added that Storm’s Trident update has a very big change and uses mirco-Batch mode similar to Spark Streaming, so latency is higher (worse) and throughput is higher (better) than Storm, And the programming paradigm has shifted to a declarative approach that costs less to develop.

Flink

Flink actually wants to be a general-purpose computing engine like Spark, which is more ambitious than Storm. But Flink takes the opposite approach to Spark to support its own niche. Flink, as a native streaming framework, regards batch processing as a special case of stream processing to achieve the unification of logical abstraction.

  • Programming paradigm

Typical declarative.

  • delay

Flink, like Storm, handles one item at a time, ensuring very low latency.

  • throughput

In real-time processing systems, low latency often leads to low throughput (Storm), and high throughput leads to high latency (Spark Streaming). These two performance metrics are also common trade-offs and often require trade-offs.

But Flink manages to have both low latency and high throughput. The key is that, compared with Storm, Flink uses checkpoint to tolerate errors, thus minimizing the impact on throughput.

ten

So far, we have an overview of both batch and stream processing. Different application scenarios can always choose to find a suitable solution.

However, there are situations where we have to implement both batch and stream solutions in the same business:

  • The flow processing program is faulty and the recovery time exceeds the saving time of the flow data, resulting in data loss.

  • Similar to the multi-dimensional monthly activity calculation, accurate calculation requires saving all user ids to do and remove duplicate, resulting in too much storage overhead, so we can only use approximate algorithms such as Hyperloglog.

  • .

In these scenarios, stream processing is often used to ensure real-time performance, and batch processing is added to ensure data correctness. This has led to the creation of an Architecture called Lambda Architecture.

The stream processing layer and the batch processing layer run the output results independently, and the query layer selects which results to display to the user according to the time. For example, real-time but not exact results for the most recent day are streamed, and batch results for more than one day are not real-time but accurate.

The Lambda Architecture does solve the problem, but the development and maintenance costs of both streaming and batch applications, coupled with the query services that precede them, are not trivial.

Therefore, in recent years, Kappa Architecture has also been proposed, bringing another unified idea, but it also has obvious shortcomings. Limited to space, I will not repeat it here.

eleven

Computing frameworks are one of the most hotly contested directions in big data, and in addition to the schemes mentioned above, there are Impla, Tez, Presto, Samza, and many more. Many companies like to make wheels and all claim that their wheels are better than others. And from the development course of each frame, there is a very obvious meaning of mutual reference.

Among the many options, Spark’s ambitions and capabilities as a general-purpose distributed computing framework have been well demonstrated and proven, and are still evolving rapidly: Spark SQL supports Hive like pure SQL data processing, Spark Streaming supports real-time computing, Spark MLlib supports machine learning, and Spark GraphX supports graph computation. The unification of the underlying execution engines for stream processing and batch processing also makes Spark affordable to develop and maintain under the Lambda Architecture.

Therefore, Spark is our first choice for big data computing due to technical risk, usage and maintenance costs. Of course, if Spark is not suitable for some application scenarios, you can use other computing frameworks as supplements. For example, Storm and Flink are worth considering if they are sensitive to latency.

twelve

As mentioned in the beginning, big data is a very broad field, and beginners can easily get lost. I hope that through this article, we can have a general understanding of the basis of big data. Due to the limited space, many concepts and techniques are covered in a short time, and students who are interested can expand to learn more. I believe that with this foundation, it will be easier to learn other technologies in the field of big data.

If this article is still in your eyes, you can rest assured to follow the public account. Adhere to the original, can not ensure the frequency of update, but will try to write some different Angle.