This article is published by netease Cloud


In the era of data, big data computing has penetrated into all walks of life. Business precipitates data and data computing generates new business value. Big data computing is constantly promoting business development in this way. Behind the carnival of e-commerce Double 11, merchants and consumers are also inseparable from the value contribution brought by big data computing, especially “real-time computing”, which is increasingly widely used.

In the real world, data is continuously generated, collected and calculated in real time

We need to do data calculation, mining product commercial value, the first problem to solve is the problem of data. In the real world, data is often generated continuously over time. For example, when users browse goods, a series of mouse clicks will generate a series of background data. Using mobile phone navigation while driving, GPS positioning is updated every once in a while, and log data is constantly generated. Users browse news feed, search for songs, surveillance cameras regularly collect pictures and upload them to cloud storage, live video and other scenarios, and the data generated behind them are all continuously generated. The continuous business data is collected in real time, forming the data flow.

As soon as streaming data is collected, it can be used for calculation immediately, and the results of calculation can be put into business application. This is real-time computing. Real-time data calculation in fact already entered every aspect of people’s life, such as weather forecasts, people used to the habit is every day to receive a weather forecast information, now you can live to see the weather forecast, the weather forecast for the same point in time will over time close to more and more accurate, this is monitoring the effect of changing data acquisition and real-time data calculation.

Tailored to interests, real-time computing allows products to understand users more and more

There are more and more sources of real-time data, and the amount of data is increasing exponentially every year. This is good for real-time computing itself, allowing for more application scenarios, better application effects, and possibly leading to some revolutionary changes. So what else can big data real-time computing do?

During activities such as Kaola Online Shopping Double 11 and 618 Online Shopping Festival, there will be a netease Multi-screen to display the latest total sales amount, the sales proportion of each commodity category, order growth trend, geographical location of active users, etc., with various dimensions of information constantly dancing on one screen. The impact of each user’s order is updated on the big screen in real time. This kind of visual real-time application effect, in addition to adding an e-commerce carnival atmosphere, is easier to find the value of data, guide market operation, assist business decisions.



Financial risk control is another typical real-time computing application scenario. For financial business that risk sensitive business, can only put the data visualization is not enough, it needs to flow calculation system can use a few risk model matching rules, to the magnitude of the user behavior data, real-time analysis of abnormal events, judgment, risk level, and make the corresponding risk control measures, automation to make alarm notice, change the business process. The benefits of financial risk control through real-time computing are faster, more accurate and wider. Many other event-driven computing scenarios, such as risk control, can be solved by real-time computing.

The use of real-time computing in the field of recommendations is also well advanced. Whether it is news recommendation, music recommendation or book recommendation, basically thousands of people have received the push content is customized according to personal interests and preferences. And the user’s interest preference, often through real-time data calculation is constantly updated. Take news push as an example, when users click one push message, the product behind it actually makes real-time analysis of users’ behaviors, updates users’ interest preferences in real time, constantly finds out users’ new interest points, gets to know users better and better, and finally pushes them more interested content. Take music recommendation as an example. If a user collects several sad songs in a certain period of time, the system can identify this information through real-time data analysis, and at the same time push some targeted songs to comfort the user. This scenario can only be solved by real-time computing, and it can best reflect the value of real-time computing.

More and more real-time computing scenarios will be developed, and the sense that “everything is in flux” will grow in the future.

From “save before calculating” to “save while calculating”, real-time computing is no longer afraid of “big” data

Real-time computing is so good, what should be done at the implementation level, and what are the difficulties and challenges that must be addressed?

First of all, from the overall architecture, data computing, there are no more than three things: data input → calculation → data output. The traditional computing model, taking the database as an example, first stores data in a data table, users execute query statements to trigger the calculation of the database, and finally the database completes the calculation and outputs the results. This “save first, calculate later” model does not work in the real-time computing scenario of big data. The amount of data we need to calculate is huge. The source data for a single calculation could be a single day’s worth of data, or hundreds of billions of records. If every time some new data is added, all the data are recalculated, which is very expensive, the final effect will be very “slow”, not real-time effect. Reasonable approach is to “count while remaining”, meaning the data into the real-time computing systems, does not necessarily need to be stored, can be directly involved in the calculation, and the calculation here is new in the current data on the calculation result of the historical data before doing “incremental calculation”, not to repeat the same data involved in calculation, calculation is done, and keep the calculation results, For business use, the data store is much less stressful. At the same time, “big” means high data concurrency, and tens of millions of new data may need to be calculated every second. Such calculation is not sustainable for a single machine. Therefore, real-time computing of big data should solve a series of technical problems under the distributed system architecture.

The challenges of distributed real-time computing are many. Data from sampling, to calculate the whole process, to the output must be low latency, in addition to compute node itself USES “incremental computing” model, also requires the upstream data transmission module of high capacity, and have the ability of data caching, under heavy traffic scene can have the effect of buffer, the downstream output module also need to make data compression, such as batch output optimization, In order to ensure the real-time output results. The premise of low latency puts forward higher requirements for other characteristics of real-time computing system. Such as double 11 0 am, a large number of consumers at the same time place the order to pay, this is the instantaneous pours into real-time computing systems data volume is huge, the system needs to have strong ability of parallel processing of the data of instantaneous flow rate distribution to the hundreds of thousands of computing nodes, and will bring together these node calculation results calculate an overall result, Low latency is guaranteed under high throughput conditions.



From “batch computing” to “incremental computing”, the biggest challenges are accuracy and ease of use

As critical as low latency is the challenge of accuracy. “Incremental computing” models are different from traditional “batch computing” models, so past technical experience cannot be copied, otherwise there will be accuracy issues. It is necessary to consider how the newly entered data is added to the old calculation results. In some scenarios, it is even necessary to remove some calculation values from the old calculation results to ensure the accuracy of the final results.

The failure of a node in a distributed system is very common, and the failure recovery capability of a real-time streaming computing system is also very important, because when a failure occurs, the system must recover quickly, otherwise the output update of the system may come to a standstill, and real-time performance will not be possible. At the same time, the failure can not destroy the “incremental calculation” model, otherwise the degradation to the “batch calculation” model can not get the real-time calculation results, and the accuracy of the results is difficult to guarantee.

In fact, netease Big Data encountered and overcame the above technical difficulties in the process of implementing Sloth, a self-research flow computing platform. Sloth, as a platform-based product, has done a lot of work in terms of product ease of use and multi-tenant isolation. Ease of use is one of the more controversial aspects of real-time computing.



It’s harder for developers to write a distributed program than it is to write a stand-alone program, and it’s harder to write a distributed real-time computing program. The good news is that there are some open source streaming computing engines that can be used by developers to develop streaming computing tasks without having to worry about how tasks are distributed to multiple compute nodes or how data is transferred between compute nodes. We only need to focus on the development of computing logic and control the degree of parallelism in different computing stages.

To calculate the number of words in an article, for example, the content of a distributed computing program may include three parts. First, several computing nodes are used to divide each line of text into words. The second step is to use other compute nodes to count the number of words (considering the large amount of data, it is necessary to use more than one node to do the calculation). The third step is to aggregate the partial count calculated by each upstream node into a total count by a compute node. Such a simple scenario would require about 200 lines of code to develop. In actual business scenarios, the data flows through far more than three computing nodes, and the calculation type is much more complex than the basic summation, so even with the flow computing engine, the development of distributed real-time computing program is still relatively difficult. Furthermore, even after the development is complete, a lot of time is invested in debugging, computing framework maintenance, and so on. Once the computing requirements change, all the work needs to be iterated over again, which is a painful process. How to make streaming programs easier to write is the challenge that real-time computing platforms need to complete.

Leaving aside how real-time streaming computing systems solve the problem of ease of use, let’s look at how similar problems have been solved over the course of computer science. People wanted programming to be easy, so more and more high-level programming languages were invented; People wanted to make it easier to compute data, and then there were databases, and then there was SQL — structured Query Language; In the era of big data, when people were still struggling with offline batch computing, they encountered the complicated problem of relying on computing engine programming. Finally, this problem was solved by applying SQL language to distributed offline computing system. And now the rapid development of real-time computing now, whether SQL language can also be used to solve this problem? The answer is yes. But there are many details to hammer out.

The data flow in real-time flow computing can be understood as a dynamic data table

As mentioned above, there are differences between offline batch computing model and real-time incremental computing model. When SQL language works with batch computing and streaming computing respectively, its semantics also need to change. The main difference between batch computing and streaming computing is that the former computes limited data, while the latter computes unlimited data, which is continuously collected into the system. When an SQL query is performed on a batch of offline data, the calculation is complete and the results are output. When an SQL query triggers a calculation, it does not end because the data is constantly flowing in. According to the semantics of offline SQL, the calculation does not output results until the SQL is finished. This is obviously not the desired effect of streaming SQL, so the essence of streaming SQL should be to define a series of streaming tasks. At the same time, these tasks are performed while output calculation results.

Offline SQL deals with static data tables, while streaming SQL deals with data flows, and whether the computational semantics of SQL (such as summing, averaging, joining of data tables, etc.) are justified on data flows. Understanding this requires a conceptual shift: Offline SQL transforms a static table into another static table; The data flow in real-time streaming computing can be understood as a dynamic data table (a dynamic data table with growing data). The table looks different at different times, and when we run SQL we get different results, and when we string these different results together like a movie slide show, we get a dynamic table — what streaming SQL does is convert one dynamic table into another dynamic table, This makes the computational semantics of streaming SQL easier to understand. The problem of real-time streaming computing system is narrowed down to “how to realize dynamic data table calculation”.

Automatic optimization of stream SQL engine is the main technology breakthrough direction at present

The usability of real-time streaming computing system can be solved by USING SQL language. The production practice of netease streaming computing platform Sloth also confirms this theory. Users no longer need to learn the programming interface of various computing engines, no longer need to debug distributed computing programs, no longer need to maintain their own streaming computing system, just need to run on the original offline PLATFORM SQL transfer to real-time streaming computing platform, you can complete complex real-time computing logic.

Client work, greatly reducing the real-time flow computing platform work is bound to increase, which is the difficult part is how to convert the SQL query to the actual calculation logic, realize the calculation of a support streaming SQL engine, similar to the role of the database engine, and as discussed before, the calculation of the engine logic must meet the “n” model. At the same time, to enable real-time computing results to be applied to various business scenarios, the computing engine needs to be able to interconnect with various storage roles, such as data, message queues, and offline storage.

Double 11 big screen is only an application scenario of real-time streaming computing of big data. In the future, there will be more and more real-time computing scenarios, such as real-time text computing, image and voice computing, online machine learning, and real-time computing of the Internet of Things. As the types of real-time data and real-time streaming computing scenarios grow exponentially, real-time computing engines will face considerable challenges. Sql-based streaming computing descriptions are also evolving to include more and more features unique to streaming computing, such as output triggering, expiration data processing, and multi-rule data window partitioning. Automatic optimization of stream SQL engine is also a major technical breakthrough direction, I believe that real-time stream computing will be applied more deeply and widely in the future with the progress of technology.

Netease has several enterprise-class big data visualization analysis platforms. The self-service and agile analysis platform for business personnel adopts PPT mode to make reports, which is easier to learn and use. It has powerful exploration and analysis functions, and truly helps users to gain insight into data and discover value. Netease has several enterprise-class big data visualization analysis platforms. The self-service and agile analysis platform for business personnel adopts PPT mode to make reports, which is easier to learn and use. It has powerful exploration and analysis functions, and truly helps users to gain insight into data and discover value.

Netease has

Enterprise big data visualization analysis platform. The self-service and agile analysis platform for business personnel adopts PPT mode to make reports, which is easier to learn and use. It has powerful exploration and analysis functions, and truly helps users to gain insight into data and discover value.

Click here – free trial.



Understand netease Cloud:

The official website of netease Cloud is www.163yun.com/

New user package: www.163yun.com/gift

Netease Cloud community: sq.163yun.com/