What information kafka messages contain

A Kafka Message consists of a fixed-length header and a variably long body. The header consists of a one-byte magic(file format) and a four-byte CRC32(to determine if the body body is healthy). When magic is 1, there is an extra byte of data between magic and CRc32: Attributes (which holds attributes such as compression, compression format, etc.); If magic has a value of 0, there is no attributes attribute

The body is a n-byte message body that contains the specific key/value message

How to check kafka offset

For versions above 0.9, the latest Consumer Client can be used. Consumer.seektoend ()/consumer.position() can be used to get the current latest offset:

Hadoop shuffle process

1. Shuffle on the Map terminal

The Map side processes the input data and produces intermediate results, which are written to local disk instead of HDFS. The output of each Map is written to the memory buffer first. When the written data reaches a preset threshold, the system starts a thread to write the buffer data to the disk, a process called spill.

Before spill data is written, data is sorted by the partition to which it belongs, and data in each partition is sorted by key. The objective of the partitions is to divide records into different Reducer to achieve load balancing. The Reducer will read its own data according to the partitions. Then run combiner(if it is set). Combiner is essentially a Reducer, which reduces the amount of data written to disk by doing a Reducer on the files that are to be written to disk. At last, data is written to the local disk to generate a spill file. The spill file is stored in the directory specified by {mapred.local.dir} and will be deleted after the Map task is complete.

Finally, each Map task may generate multiple spill files. Before each Map task is completed, a multi-path merging algorithm is used to merge these spill files into one file. At this point, the Map shuffle process is complete.

Shuffle on the Reduce end

Shuffle on Reduce consists of three phases: Copy, sort(Merge), and Reduce.

First, the output files generated by the Map end should be copied to the Reduce end, but how does each Reducer know which data it should process? Because when the Map server performs partitions, it actually specifies all data to be processed by the Reducer (the partitions correspond to the Reducer), the Reducer only needs to copy data from its corresponding partitions. Each Reducer handles one or more partitions, but must first copy the data of its own partitions from the output of each Map.

Next comes the sort stage, also known as the merge stage, because the main job of this stage is to perform merge sort. Data copied from the Map end to the Reduce end is ordered, so it is suitable for merge sort. Eventually, a large file is generated on the Reduce side as input to Reduce.

Finally, there is the Reduce process, where the final output is produced and written to HDFS.

Spark cluster computing mode

Spark has a variety of Standalone modes, such as Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, Standalone mode, and Standalone mode. The Standalone mode is sufficient for most cases and is also convenient to deploy if the enterprise already has a Yarn or Mesos environment.

Standalone (Cluster mode) – Typical Mater/slave mode, but also shows that the Master has a single point of failure; Spark supports ZooKeeper to implement HA

On YARN (cluster mode) : Runs on the YARN resource manager framework. Yarn manages resources, and Spark schedules and calculates tasks

On MESOS (Cluster mode) : Runs on the MESOS resource manager framework. Mesos is responsible for resource management, and Spark is responsible for task scheduling and computing

On Cloud (cluster mode) : For example, AWS EC2, which makes it easy to access Amazon S3; Spark supports multiple distributed storage systems, such as HDFS and S3

HDFS reads and writes data

Read:

1. Communicate with Namenode to query metadata and find the Datanode server where the file block resides

2. Select a Datanode (nearest principle, then random) server and request to establish socket flow

3. Datanode starts to send data (read data from disk and put data into stream, and perform packet verification in unit)

4. The client receives the packet in unit, caches it locally, and then writes it to the target file

Write:

1. The root namenode sends a request to upload a file, and the Namenode checks whether the target file exists and the parent directory exists

2. Namenode returns whether the file can be uploaded

3. Client requests the datanode server to which the first block should be transferred

4. Namenode returns three DATanode server ABCs

5. The client requests one of the three DN A to upload data (essentially an RPC call to establish A pipeline). A will continue to call B after receiving the request, and then B will call C to complete the establishment of the real pipeline and return to the client step by step

6. The client starts to upload the first block to A (reads data from the disk and puts it into A local memory cache). Each packet sent by A will be put into A reply queue waiting for the reply

7. When a block transfer is complete, the client requests namenode to upload the second block to the server.

Which is better in RDD, reduceBykey or groupByKey? Why

ReduceByKey: reduceByKey will merge each Mapper locally before sending results to reducer, similar to combiner in MapReduce. In this way, after a reduce operation is performed on the Map end, the amount of data is greatly reduced, reducing transmission and ensuring faster result calculation on the Reduce end.

GroupByKey: groupByKey aggregates the values in each RDD to form an Iterator. This operation occurs on the Reduce end. Therefore, all data will be transmitted over the network, resulting in unnecessary waste. At the same time, outofMemoryErrors may be caused if the amount of data is very large.

Based on the above comparison, it can be found that reduceByKey is recommended when performing reduce operations with a large amount of data. Not only does this increase speed, but it also prevents memory overruns caused by using groupByKey.

How to obtain the difference set of data in Spark SQL

Doesn’t seem to support

The understanding of spark2.0

Simpler: ANSI SQL with more reasonable apis

Faster: Use Spark as the compiler

More intelligent: Structured Streaming

How does RDD partition wide and narrow dependencies

Wide dependency: Partition quilt of the parent RDD. Using multiple partitions of the RDD, such as groupByKey, reduceByKey, and sortByKey, generates wide dependency and shuffle

Narrow dependencies: Each partition of the parent RDD is only one partition of the RDD. Operations such as map, filter, and union will generate narrow dependencies

Spark Streaming two ways to read Kafka data

The two methods are:

Receiver-base

This is done using Kafka’s high-level Consumer API. The data the Receiver gets from Kafka is stored in the Spark Executor’s memory, and the Spark Streaming started job processes that data. However, with the default configuration, this approach can result in data loss due to underlying failures. To enable high reliability and zero data loss, Spark Streaming must enable Write Ahead Log (WAL). This mechanism synchronously writes the received Kafka data to a write-ahead log on a distributed file system such as HDFS. So, even if the underlying node fails, you can use data from the pre-write log to recover.

Direct

In Spark1.3, Direct is introduced to replace Receiver for receiving data. In this way, Kafka is periodically queried to obtain the latest offset of each topic+partition, and the range of offset of each batch is defined. When the data-processing job is started, Kafka’s simple Consumer API is used to fetch data in the offset range specified by Kafka.

Does Kafka’s data reside in memory or disk

At the heart of Kafka is the idea of using disks, not memory, which everyone would assume is faster than disk. I’m no exception. After looking at the design ideas of Kafka, reviewing the data and running my own tests, I found that the sequential read and write speed of the disk was equal to that of memory.

Linux also optimizes disk read and write, including read-ahead and write-behind, and disk caching. If the memory overhead of JAVA objects is high and the GC time of JAVA becomes very long as the heap memory increases, using disk operations has several advantages:

The disk cache is maintained by the Linux system, which saves a lot of programmer work.

The sequential disk read/write speed exceeds the random memory read/write speed.

The JVM’s GC is inefficient and has a large memory footprint. Using disks can avoid this problem.

The disk cache is still available after the system is cold booted.

How to solve kafka data loss

The producer side:

Macroscopically, to ensure the reliability and security of data, must be based on the number of partitions to do a good job of data backup, set up the number of copies.

The broker side:

Topic sets up multiple partitions, which are adaptive to the machine where the partitions are located. The number of partitions should be greater than the number of brokers in order for the partitions to be evenly distributed among the brokers where they are located.

A partition is the unit of parallel kafka reads and writes. It is the key to increasing Kafka’s speed.

The Consumer end

The case for a message loss on the consumer side is simple: if the offset is committed before the message processing is complete, data can be lost. Since Kafka Consumer automatically commits shifts by default, it is important to ensure that the message is properly processed before committing shifts in the background. Therefore, heavy processing logic is not recommended. If the processing takes a long time, it is recommended to put the logic in another thread. In order to avoid data loss, two suggestions are given:

Enable.auto.mit =false Disables the automatic commit shift

Manually commit the shift after the message has been fully processed

A technical article every day, welcome to pay attention to us. Gongzhonghao: jianglogo521

We’ll continue with the interview questions tomorrow

Interview Question Analysis Preview (ii) Interview question preview:

Redis performance optimization, whether increasing the number of CPU cores in a single machine will improve the performance

Why choose Kafka for data collection

What are the problems encountered in the project? Is there any data loss? How to solve them

How does RDD divide stages

Difference between reduceBykey and groupByKey in RDD

Whether restart kafka causes data loss

Spark Streaming restart Whether data will be lost

Something about checkpoint