Introduction to Data Flow

Data Streaming

Jakob Jenkov

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: steinliber

Proofread by: Endone, Xionglong58

Data flow is a data distribution technique in which data producers write data records to an ordered data stream from which data consumers can read in the same order. Here is a simple data flow diagram that explains data producers, data flows, and data consumers:

There are many variations on data flows

On the surface, data flow as a concept looks simple. The data producer stores the records in the data stream, which is later read by the consumer. However, there are many details under the surface that affect the performance, behavior, and function of the entire data flow system.

Each data flow product makes a set of assumptions about the usage scenarios it supports and the processing technologies it supports. These assumptions determine how data flows are designed, which affects the type of stream processing behavior you can implement. This data flow tutorial explores many of these design choices and discusses their impact on users as products based on these design choices.

Data flow decouples producers and consumers

Data flow decouples data producers and data consumers from each other. The data producer simply writes the data to the data stream and does not need to know about the consumer reading the data. Consumers can add or remove them independently of producers. Consumers can also start and stop, or pause and resume their consumption without producers needing to know about it. This decoupling simplifies the implementation of data producers and consumers.

Data flow as a mechanism for data sharing

In larger distributed systems, data flow is a very efficient mechanism for storing and sharing data. As mentioned earlier, the data producer only needs to send the data to the data flow system. The producer does not need to know anything about the consumer. Consumers can be added and removed without affecting producers.

Large companies like LinkedIn make extensive use of data streams internally. Uber also uses data streaming internally. Many enterprise companies are adopting or have already adopted data flow internally. So do many startups.

Persist data flows

A data stream can be persistent, in which case it is sometimes referred to as a log or daily. The advantage of persistent data flow is that the data in the data flow can be preserved when the data flow service is shut down, so no data records are lost.

Persistent data flow services can typically hold more historical data than data flow services that only keep records in memory. Some data stream services can even keep historical data all the way back to the first record written to the data stream. Other data streams typically hold historical data for a short time, such as a few days.

With a complete history preserved in the persistent data flow, consumers can replay all of these records and reconstruct their internal state based on the history. If an error is found in the consumer’s code, you can correct the error code and replay the data stream to recreate its internal database.

Data flow use cases

Data flow is a very general concept that can be used to support many different use cases. In this section, I’ll introduce some common data flow use cases.

Data flows are used in event-driven architectures

Data flows can often be used to implement event-driven architectures. Event producers write events as records to some data flow system from which event consumers can read events.

Data streams are used in smart cities and the Internet of things

Data streams can be used to transfer data from sensors installed in smart cities, sensors in smart factories, or other iot devices. Values such as temperature and pollution levels can be periodically sampled from the device and written into the data stream. Data consumers can read samples from the data stream as needed.

Data streams are used to periodically sample data

Sensors and iot devices in smart cities are just two examples of data sources that can be sampled regularly and obtained through data streams. But there are many other types of data that can be sampled and streamed regularly. For example, currency exchange rates or stock prices can also be sampled and streamed. The vote count can also be sampled and streamed periodically.

Data flows are used for data points

In the vote count example, you could decide to stream each individual response to the poll rather than the total number of periodic samples. In cases where some totals are made up of single data points (like polls), it makes more sense to stream a single data point rather than a calculated total. It depends on the use case and other factors, such as the fact that individual points of data are anonymous or contain private information that should not be shared.

Records, messages, events, samples, etc

Records in a data flow are sometimes referred to as messages, events, samples, objects, or other terms. What terminology is used depends on the specific use case of the data flow and how producers and consumers process and respond to the data. This is usually reasonable, and it makes sense to map the data in the data flow by the terms used in use cases.

It is important to note that use cases also affect what is represented by a given record. Not all data records are created equal. Events and sample values are different, so they cannot be used in the same way. I’ll cover this in more detail later in this (or other) tutorial.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.

There are many variations on data flows

Data flow as a mechanism for data sharing

Data flow use cases

Data streams are used in smart cities and the Internet of things

Data flows are used for data points

Related Posts