Dry goods | 13 selected Flink interview questions

I believe that friends will not feel strange to Flink, asRanked no. 1 for three consecutive years, it is the most active Apache open source project in the worldFlink’s popularity in China has been high. In recent years, driven by the community, Flink technology stack has been applied by more and more companiesIn the recruitment of big data, it is becoming more and more important to focus on Flink. In this article, Gerb takes you to summarize what Flink is often asked about during the interview. If you found this item helpful, one item ◡✧ after reading it (✧ one item ✧)

1. Application architecture

Q: How do companies submit real-time tasks? How many Job managers do they have?

Answer:

1. We submit tasks in YARN Session mode. Each submission creates a New Flink cluster and provides a yarn-session for each job. Tasks are independent of each other, facilitating management. The cluster created after the task is complete disappears. The online command script is as follows:

bin/yarn-session.sh -n 7 -s 8 -jm 3072 -tm 32768 -qu root.*.* -nm *-* -d
Copy the code

Apply for 7 taskManagers, each 8 cores, each taskManager has 32768M memory.

2. The cluster has only one Job Manager by default. But to prevent a single point of failure, we configured high availability. Our company generally configures one primary Job Manager and two standby Job Managers, and then uses ZooKeeper to achieve high availability.

2. Pressure measurement and monitoring

Question: How to do stress testing and monitoring?

Answer: We generally encounter pressure from the following aspects:

First, if the speed of data flow is too fast, and the downstream operator cannot consume it, it will produce back pressure. The monitoring of back pressure can be visualized using the Flink Web UI(localhost:8081) and can be known once an alarm is reported. In general, the problem of back pressure may be caused by the fact that the sink operator is not well optimized. For example, if you want to write ElasticSearch, you can write ElasticSearch in batches, you can increase the size of ElasticSearch queue, etc.

Second, set the maximum delay time to watermark. If you set it too high, it may cause memory stress. You can set the maximum delay time to be small and then send the late elements to the side output stream. Update results later. Or use a state back end like RocksDB, which will open up off-heap storage but slow I/O speed, a tradeoff.

Third, there is a sliding window if the length is too long, and the sliding distance is very short, the performance of Flink will decline very badly. We mainly use time sharding to store only one “overlapping window” for each element, thus reducing the writing of state in window processing. (Details link: Flink sliding window optimization)

Fourth, RocksDB is used in the state back end, and the problem of being overloaded has not been encountered

3、为什么用 Flink

Question: Why Flink instead of Spark?

Answer: The main considerations are flink’s low latency, high throughput, and better support for streaming data application scenarios; In addition, Flink can deal with out-of-order data well, and can guarantee the consistency of exactly-once state.

4. What do we learn

Q: How to understand Flink’s checkpoint

Answer: Checkpoint is the core function of Flink to implement the fault tolerance mechanism. It can periodically generate snapshots based on the states of each Operator/task in the Stream according to the configuration, and store these state data regularly and persistently. When Flink program crashes accidentally, You can selectively recover from these snapshots when you re-run the program to correct program data anomalies caused by failures. It can be stored in memory, file system, or RocksDB.

5, exactly-once guarantee

Question: How does Flink guarantee exactly-once if the lower-level store does not support transactions?

Answer: end-to-end exactly-once has a high requirement for sink. There are two ways of idempotent write and transactional write. Idempotent writing scenarios rely on business logic and are more commonly used with transactional writing. There are two types of transactional write: write-ahead Logging (WAL) and two-phase commit (2PC).

If the external system does not support transactions, you can use pre-write logs to save the result data as status and write it to the sink system once the checkpoint completion notification is received.

6. State mechanics

Q: What about the Flink state mechanism?

Answer: Many operators built into Flink, including source and data storage sink, are stateful. In Flink, the state is always associated with a particular operator. Flink snapshots the status of each task in the form of checkpoint to ensure the consistency of status during fault recovery. Flink manages state and checkpoint storage through a state back end, which also has different configuration options.

7, a large number of key weight

Question: How to lose weight? Consider a real-time scene: Double Eleven scene, sliding window length is 1 hour, sliding distance is 10 seconds, 100 million users, how to calculate UV?

Answer: Using a data structure like Scala’s SET or Redis’s set is obviously not possible, because there may be hundreds of millions of keys, not enough to fit in. So consider using a Bloom Filter.

8. Compare Checkpoint with Spark

Q: How is Flink’s checkpoint mechanism different and better than Spark’s?

Solution: Spark Streaming checkpoint is only a checkpoint of data and metadata for driver fault recovery. The checkpoint mechanism of Flink is much more complicated. It adopts lightweight distributed snapshot, which realizes the snapshot of each operator and the data in flow.

9. Watermark

Question: Explain more about Flink’s Watermark mechanism.

Solution: When using EventTime to process Stream data, you will encounter data out of order. It takes a certain amount of time for Stream processing to occur from the Event, through the Source, to the Operator. In most cases, the data transferred to the Operator is in the time order of events. However, out-of-order data may occur due to network latency. In particular, when Kafka is used, data between multiple partitions cannot be ordered. So when you do a Window calculation, you can’t wait indefinitely, you have to have a mechanism to make sure that the Window is triggered to do the calculation after a certain amount of time, and that particular mechanism is the Watermark. Watermark is used to handle out-of-order events.

In the process of Window processing of Flink, if it is confirmed that all data have arrived, it can perform Window calculation operations (such as summarizing and grouping) on all data of Window. If all data have not arrived, it will continue to wait for all data in the Window to arrive before processing. In this case, WaterMarks are needed, which measures the progress of data processing (expressing the integrity of data arrival), ensures that event data (all) arrives in the Flink system, or calculates correct and continuous results as expected in the case of out-of-order and delayed arrival.

10. How to implement exactly-once

Question: How is the exactly-once semantics implemented in Flink and how is the state stored?

Answer: Flink relies on checkpoint mechanism to implement exactly-once semantics. To achieve end-to-end exactly-once, external source and sink are required to meet certain conditions. Storage of state is managed through state backends, and different state backends can be configured in Flink.

11, the CEP

Question: Where does Flink CEP programming store data when the state has not arrived?

Answer: In streaming, of course, CEP supports EventTime, so it also supports data lateness, which is the processing logic of watermark. CEP handles the sequence of events that failed to match similarly to late data. In the processing logic of Flink CEP, unsatisfied and late data will be stored in a Map data structure. That is to say, if we limit the time for judging the event sequence to 5 minutes, 5 minutes of data will be stored in memory, which in my opinion is also one of the great damage to memory.

12. Three time semantics

Question: what are the three time semantics of Flink, and tell the application scenarios respectively?

Answer:

Event Time: This is the most common Time semantics in practice and refers to the Time when the Event was created, often in conjunction with watermark
Processing Time: indicates the local system Time of each operator that performs time-based operations. It is machine-specific. Application scenario: there is no event time, or the real-time requirements are too high
Ingestion Time: Indicates the Time when the data enters Flink. Scenario: If there are multiple Source operators, each Source Operator can assign Ingestion Time using its own local system clock. Ingestion Time in the data record will be used for various subsequent time-based operations

13. Processing of data peaks

Question: How does the Flink program handle data spikes?

Answer: Using large-capacity Kafka puts data into message queues as data sources and then consumes it using Flink, but this affects a bit of real time.

summary

The Flink interview questions are not difficult, but enough to test a data engineer’s basic knowledge of Flink. Keep up with technology if you want to have a good career! This article is over here, later will share more dry goods content for you, please look forward to! The more you know, the more you don’t know, I’m Alice, and I’ll see you next time!

eggs

In order to encourage you to learn to summarize more, You can post your mind map here. If you need friends, you can pay attention to the personal wechat public account of the blogger [Simman Fungus] and reply “Mind map” in the background to get it.

There are also a large number of surface by gift