The article directories
Hello everyone, I'm Yunqi!
Before, I did technology sharing of Flink and Realtime Compute in Ali Cloud for my team mates. Today, I sorted out the PPT content and shared it with you (multi-picture warning) ?.
Flink originated as a research project at the Technical University of Berlin, Stratosphere, until it was donated to the Apache Software Foundation in April 2014...
Remember, back in 2015, Filnk was almost unknown, much less widely used ?.
Alibaba is one of the first companies in the world to use Flink to develop big data computing engine. It was introduced into the company in 2015, but at the earliest Flink can only support data processing in small traffic Internet scenarios. Ali thought Flink had so much potential that he decided to revamp it and named the in-house version Blink, which means "Blink, everything is calculated!
On November 11, 2017, Blink successfully supported the real-time computing task of all transaction data of the whole group (Alibaba, Aliyun and Cainiao), and also verified that Flink could transform the scene supporting large-scale data computing of enterprises ?.
At present, many domestic Internet companies have fully embraced Flink. This sharing is to analyze the knowledge points (capabilities, limitations, typical scenarios, differences) related to real-time computing Flink and Alibaba Cloud Realtime Compute.
What is Apache Flink?
What is the lifeblood of Apache Flink in one sentence?
My answer would probably be that Apache Flink designed its system with the understanding that a batch is a special case of flow.
In terms of Apache Spark and Apache Flink, the two most popular streaming computing engines, who will eventually become No1?
In terms of "low latency", Spark is in Micro Batching mode with a delay of 0.5 to 2 seconds, while Flink is in Native Streaming mode with a delay of microseconds.
It was obviously Apache Flink who was a relative latecomer. So why is Apache Flink so "fast"? The root cause is that Apache Flink was designed from the very beginning to think that "a batch is a special case of Streaming" and that the entire system is Native Streaming designed to trigger calculations with each incoming piece of data. Compared with the Micro Batching mode, which requires time to accumulate data, it has an absolute advantage in architecture.
So why are there two modes of flow computation?
The fundamental reason is that the cognition of flow computing is different, which is two different cognitive products: "flow is a special case of batch" and "batch is a special case of flow".
First of all, I think Flink application development requires an understanding of Flink's Streams, State, Time and other basic processing semantics, as well as Flink's multi-level API that combines flexibility and convenience.
To those who are bounded by a bounded stream is to have unlimited data Streams. And bounded stream is restricted to the size of the stead fast data collection, namely the limited data streams, the difference between the two infinite data streams of data will be deduction and continue to increase with time, the state of ongoing is calculated and there is no end, relatively limited data flow data size is fixed, computing will eventually completed and is in a state of end.
In Spark's worldview, everything is made up of batches. Offline data is a batch, while real-time data is made up of an infinite number of small batches.
In Flink's world view, everything is made up of streams. Offline data is a flow with boundaries, while real-time data is a flow without boundaries, which is called bounded flow and unbounded flow.
State: The State is the data information during calculation, which plays an important role in fault-tolerant recovery and Checkpoint. Stream computing is essentially Incremental Processing, so continuous query is required to maintain the State. In addition, to ensure exact-once semantics, data needs to be written to the state; Persistent storage, on the other hand, guarantees Exactly once in case the entire distributed system fails or dies, which is another value of state.
Streaming computing can be divided into stateless and stateful cases. Stateless calculations observe each independent event and output results based on the last event. - For example, the flow processing application receives a temperature reading from the sensor and issues a warning when the temperature exceeds 90 degrees.
Stateful computation outputs results based on multiple events. Here are some examples:
- All types of Windows. For example, calculating the average temperature over the past hour is a stateful calculation
- All state machines for complex event processing. For example, a warning is issued if two temperature readings that differ by more than 20 degrees within a minute are received, which is stateful calculation
- All operations associated with flows, as well as with static or dynamic tables, are stateful calculations
Time can be divided into Event Time, Ingestion Time and Processing Time. Flink's infinite data flow is a continuous process, and Time is an important basis for us to judge whether the business status is lagging behind and whether the data Processing is timely.
EventTime, because we are counting logs based on when they were generated.
- There are different application scenarios at different semantic times
- We tend to care more about EventTime
API is usually divided into three layers, which can be divided into SQL/Table API, DataStream API and ProcessFunction from top to bottom. The EXPRESSION ability and service abstraction ability of API are very strong, but the closer to the SQL layer, the expression ability will gradually weaken, and the abstraction ability will be enhanced. On the other hand, the ProcessFunction layer API is very expressive and can do a variety of flexible and convenient operations with relatively little abstraction.
In fact, most applications do not require the underlying abstractions described above and instead program against Core APIs such as DataStream (bounded or unbounded streaming data) and DataSet API (bounded data sets). These apis provide common building blocks for data processing, like user-defined transformations, joins, aggregations, Windows, and so on. The DataSet API provides additional support for bounded datasets, such as loops and iterations. The data types handled by these apis are represented by their respective programming languages in the form of classes.
- First, Flink has a unified framework to deal with both bounded and unbounded data flows.
- Second: flexible deployment, Flink bottom support a variety of resource schedulers, including Yarn, Kubernetes, etc. Flink's own Standalone scheduler is also flexible in deployment.
- Third: extremely high scalability. Scalability is very important for distributed systems. Alibaba Double 11 large screen uses Flink to process massive data, and the peak value of Flink can reach 1.7 billion/second.
- Fourth: Extreme streaming performance. Compared with Storm, Flink's biggest feature is that it completely abstractions state semantics into the framework, supports local state reading, avoids a lot of network IO, and greatly improves the performance of state access.
Here are three common applications of Flink:
- Number of real-time warehouse
When the downstream is building a real-time warehouse, the upstream may need a real-time Stream ETL. In this process, the data will be cleaned or expanded in real time. After the cleaning, the data will be written to the whole link of the downstream real-time data warehouse, which can ensure the timeliness of data Query and form real-time data collection, real-time data processing and real-time Query in the downstream.
- Search engine recommendations
The search engine takes Taobao as an example. When a seller launches a new product, the background will generate a real-time message flow, which will be processed and expanded through the Flink system. The processed and expanded data is then indexed in real time and written to the search engine. In this way, when taobao sellers put new products online, they can achieve search engine search in seconds or minutes.
- User behavior analysis in mobile applications
- AD hoc queries of real-time data in consumer technology
After triggering some rules, Data Driven will process or give early warning, and these early warning will be sent to the downstream to generate business notification. This is the application scenario of Data Driven, and Data Driven is more applied to the processing of complex events in the application.
- Real-time recommendations (e.g. product recommendations while the customer is browsing the merchant's page)
- Pattern recognition or complex event processing (e.g. fraud identification based on credit card transaction records)
- Exception detection (e.g. computer network intrusion detection)
Here are some of the advantages of Apache Flink:
As a distributed processing engine, Flink performs multiple local state operations in a distributed scenario and generates only one globally consistent snapshot. If a globally consistent snapshot needs to be generated without interrupting the operation value, it involves distributed state fault tolerance.
If there are a lot of beads on the necklace, you don't want to count them all over again, especially if the three are trying to work together at different speeds (for example, if you want to keep track of how many beads you counted in the previous minute, think of the one-minute scroll window.
A better idea is to tie a colored rubber band loosely at intervals around the necklace to separate the beads. When the bead is plucked, the rubber band can also be plucked; Then you arrange for an assistant to record the total while you and your friend dial the rubber band. In this way, when there are wrong numbers, there is no need to start from the beginning. Instead, you alert the others to the error, and you both repeat the number starting at the last rubber band, while the assistant tells everyone the starting number.
Flink periodically maintains consistent checkpoints of state during the execution of the flow application
In the event of a failure, Flink will use the most recent checkpoint to consistently restore the state of the application and restart the processing process after encountering a failure
- The first step is to reboot
- The second step is to read the state from checkpoint and reset the state. When the application is restarted from checkpoint, its internal state is exactly the same as when the checkpoint is complete
- Step 3: Start spending and deal with all the data between checkpoints to malfunction This checkpoint save and restore mechanism can provide a "precision" for the application (exactly - once) consistency, because all the operator will save all the checkpoints and restore the state, so that all the input stream will be reset to the location of the checkpoint is complete
- One simple idea is to pause the application, save the state to a checkpoint, and then resume the application again
- Flink's improved implementation of distributed snapshot based on Chandy-Lamport algorithm separates checkpoint preservation from data processing without suspending the entire application
- Flink's checkpoint algorithm uses a special form called a barrier to separate data on a stream at different checkpoints
- State changes caused by data coming before the boundary are included in the check point of the current boundary. Any changes made based on the data after the boundary are included in subsequent checkpoints
When a data source receives a Checkpoint barrier N, it stores its state. For example, when a data source reads a Kafka partition, the state of the data source is its current location in the Kafka partition. This state is also written into the table above.
Downstream Operator 1 will start working on data belonging to Checkpoint barrier N, When Checkpoint barrier N follows the flow to Operator 1,Operator 1 reflects all data belonging to Checkpoint barrier N in the state. If a Checkpoint barrier N is received, a snapshot is created for the Checkpoint.
Distributed snapshots are used for status fault tolerance. When a node fails, it can be recovered at its previous Checkpoint. Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink Job Manager Flink For example, Checkpoint N + 1, Checkpoint N + 2, etc., can be synchronized. By using this mechanism, Checkpoint can be continuously generated without blocking the operation.
Complete tables can be fault-tolerant.
The scope of the operator state is limited to the operator task. This means that all data processed by the same parallel task can access the same state, which is shared for the same task. Operator state cannot be accessed by another task of the same or different operator.
Keying state is maintained and accessed based on keys defined in the input data stream. Flink maintains a state instance for each key value and partitions all data with the same key into the same operator task that maintains and processes the corresponding state for that key.
Flink provides three basic data structures for operator states... . Keyed State supports four data types...
MemoryStateBackend / FsStateBackend / RocksDBStateBackend
The JVM Heap state can be read or written using Java Object Read/writes for every value that needs to be read. But when Checkpoint calls the local state of each operation to the Distributed Snapshots, it needs to be serialized.
The local state backend of the Runtime allows the user to read the state through disk, which is equivalent to maintaining the state on disk, at the cost of serialization and deserialization every time the state is read. When a snapshot is required, the application is serialized, and the serialized data is directly transferred to the central shared DFS.
Flink actually uses Watermarks to realize the function of Event-time.
Watermarks is also a special event in Flink, the essence of which is that when an operand receives Watermarks with a timestamp "T" it means that it will not receive new data. The advantage of using Watermarks is that you can accurately predict the deadline for receiving data.
For example, suppose that the difference between the expected time of receiving data and the time of outputing results is delayed by 5 minutes, then all Windows operators in Flink search for data from 3 o 'clock to 4 o 'clock, but need to wait another 5 minutes until the collection is complete because of the delay 4: Data at 05:00, at which point the data collection at 4pm is judged to have been completed, and then the data results at 3pm to 4pm will be produced. The results for this time period correspond to the Watermarks section.
Watermark is the "window closing time" of the previous window, once the closing time is triggered, all data within the window based on the current time is added to the window. As long as the water level is not reached then no matter how long the actual time advances the window will not trigger the closing.
The difference between Savepoint and Checkpoint is that Checkpoint is generated periodically by Flink for a stateful application using distributed snapshots. Savepoint is generated manually. Savepoint records the state of all operators in a streaming application.
Savepoints are generated by manually inserting Checkpoint barriers into all pipelines to generate distributed snapshots. These distributed snapshot points are savepoints. Savepoint can be stored anywhere, and when changes are complete, they can be recovered and executed directly from Savepoint.
Flink vs. Blink
The disadvantages of Spark Streaming and others come to the surface when designing a low-latency, exactly once, unified Streaming and batch engine capable of supporting complex computationally large enough.
Spark Streaming is essentially a microbatch-based computing engine. One inherent drawback of this engine is the high scheduling overhead per microbatch, and the lower the latency we require, the higher the additional overhead. This results in Spark Streaming actually not being particularly well suited for second or even subsecond calculations.
Kafka Streams was built from a journalling system that was designed to be lightweight and simple enough to use. This is hardly enough to satisfy our need for complex computations with large volumes.
Storm is a data stream processor with no batch processing capability. Besides, Storm only provides a very low-level API, requiring users to implement a lot of complex logic themselves.
To put it simply, Blink is alibaba's enterprise version of a computing engine based on open source Flink. As mentioned above, although Flink has many innovations in theoretical model and architecture, there are still many problems in engineering implementation.
Since 2015, Alibaba team has focused on solving the problems of Blink runtime stability and scalability.
After having a stable runtime, I began to focus on improving Blink's ease of use. Therefore, from the end of 2016 to now, Alibaba team has made great efforts to develop Blink real-time computing SQL to serve various complex businesses through SQL as a unified API.
From standardizing the semantics and standards of Streaming SQL, to implementing a series of most important SQL operators such as UDX, Join, aggregation, window, etc., it has almost single-handed created a complete Streaming SQL, and pushed these work back to FLink community. Recognized by the Flink community.
See you next Time