From Storm to Spark Streaming, and then to Flink, streaming computing has made great progress. Spark Streaming, which relies on the Spark platform, has gone its own way. It uses Spark batch architecture to realize real-time processing framework through batch processing. In order to learn more about Spark Streaming, Feimar.com invited Wang Fuping, a former senior engineer of Baidu Big Data, to share advanced features of Spark Streaming and NDCG computing practice in an online live broadcast.

Here are the main content of the live broadcast:

Spark Streaming

1. What is Spark?

Spark is a batch processing framework that has the advantages of high performance and rich ecology.

How did we do big data analytics before Spark? In fact, before Spark, we used MapReduce framework based on Hadoop for data analysis. Up to now, traditional MapReduce tasks have not completely disappeared from the market, and MapReduce is fairly stable in some scenarios with very large data volumes.

2. Spark Streaming what is?

Spark Streaming is a framework for batch processing of data by time. The advantages of the Spark platform make spark Streaming easy to develop and widely used.

Spark Streaming is implemented based on spark’s batch processing concept, so it can directly use the tool components provided by the Spark platform.

Using the above figure, we can treat the input of Spark Streaming as a data stream and batch the data over time, depending on our own business situation.

3. Example of WordCount:

Here’s an example of WordCount. You can see that in just a few lines of code, a WordCount is implemented. Because The Spark platform is directly connected to Hadoop, data can be easily saved to the HDFS or database. With only one Spark platform, real-time tasks and offline analysis tasks can be performed, which is convenient.

Spark Streaming advanced features

Window features: 1.

Based on the simple WordCount example above, let’s upgrade and assume that we need to count the number of words that have appeared in the previous minute every ten seconds, which is not possible with simple WordCount. In this case, we will use the Window mechanism provided by Spark Streaming.

The Window feature of Spark Streaming has three parameters: Batch Internal, Window width, and Sliding Internal. According to the requirement just now, the window length is 60s, the window sliding interval is 10s, and the batching interval is 1s. It should be noted that the batching interval must be divisible by the window length and window sliding interval.

Creating a Window stream is actually quite simple. The following two diagrams show how to create a Window stream and its related computation functions.

The following image calculates the request failure rate during the 30s window. Let’s look at its parameters. The window time is set to 30s and the slide interval is set to 2s. The whole code is very simple, with just one more line of code, you can implement the window stream, which can then do some normal calculations.

To read this function briefly, first create a window stream, then count the number of failures in the task, divide it by the total number of failures, to get the request failure rate.

2. Sql feature:

The second feature of Spark Streaming is the Sql feature, which comes naturally after Spark Streaming encapsulates data as DataFrame.

To fully use writing SQL, we first need to register temporary tables. Our registered temporary table can also be joined with multiple temporary tables we created, which is more practical.

With SQL, custom functions give us a lot of extensibility, and there are two ways to define UDFs: loading JAR packages and dynamically defining UDFs.

4. The CheckPoint mechanism:

Spark uses CheckPoint to save processing status and even current processing data. If a task fails, Spark uses CheckPoint to restore data. When we process data, data reliability is very important to ensure that data is not lost. Spark’s CheckPoint mechanism helps ensure data security.

There are two main CheckPoint mechanisms:

So how do you implement CheckPoint?

There are three conditions:

Let’s compare two graphs with and without WAL. WAL stores data to the HDFS, backs up the task logic, and performs processing. When a job fails, WAL reads the data stored in the HDFS based on CheckPoint data to restore the job. In practice, however, this has drawbacks, on the one hand, it reduces the performance of the receivers, on the other hand, it only guarantees at-least-once, not exactly Once.

Spark Streaming optimizes Kafka to address WAL’s shortcomings and provides a Kafka Direct API for performance improvements.

Iii.NDCG index calculation

1. What is NDCG?

The following two images are concrete examples of NDCG computation.

2.NDCG streaming implementation in Spark:

How do we implement NDCG computation with Spark Streaming? First we did a data survey.

Start NDCG calculation.

3. Performance guarantee of NDCG:

We develop a data task, not a static work, to ensure the stability of the data, according to the situation of the data, make a capacity estimate, to ensure the performance of the data. Capacity estimation is an essential step.

Our most common capacity adjustment.

In the calculation process of NDCG index, we will also encounter some problems, that is, NDCG supports the combined calculation of four dimensions, which are more complex and complex.

In this case, multidimensional analysis relies on our OLAP engine, which we currently use Druid.

The above three parts are the main content of this live broadcast. At the end, Teacher Wang also answered the questions raised by everyone. What are the questions? Let’s take a look.

1. To read a batch of data every 5s, you need to traverse the daily data for various calculation analysis, and the calculated results need to be cached as a reference for the next calculation. How to achieve this?

Miss wang: This is a real-time task and there are several ways to do it if you need to store state data. The first is spark Streaming, which has a mechanism for storing state data. The second way is that you can store state data in some KV database like Spark, or you can do it yourself. The key is how.

2. Is there a recommended way to board a ship?

Miss wang: You will learn how to write a stream in Java8. You will learn how to write a stream in Java8. You will learn how to write a stream in Java8. Then you can understand the computational logic behind Spark Streaming. The only good thing about Spark Streaming is that it’s distributed.

3. Spark Streaming is most likely to be replaced by what technology?

Mr. Wang: Each platform has its own advantages and disadvantages. At present, although Flink is popular, Storm still exists, Spark also has its own scene, and Flink also has its own advanced mechanism, so each platform has its own advantages.

In the end, Professor Wang recommended the most classic book about Scala — Programming in Scala. This live broadcast for Spark Streaming is concise and targeted, and I believe you will gain a lot. If you want to know more details, you can pay attention to the service number: FMI Pegasus, click the menu bar pegasus live, you can learn.