I. Common methods of big data processing

 

Collection → Cleaning → Processing: Offline Data Analysis Based on MapReduce

 

 

 

 

  • How to build our real-time processing system step by step (Flume+Kafka+Storm+Redis)
  • Real-time processing of the website user access log, and statistics of the website’s PV, UV
  • The real-time analysis of PV, UV dynamic display in our front page

 

 

 

Second, real-time processing system architecture

 

 

 

 

  • The Flume cluster
  • Kafka cluster
  • Storm cluster

 

 

Flume+Kafka integration

 

 

 

 

 

 

Flume cluster configuration

 

 

 

  • Flume Agent01

 

 

 

 

 

 

 

 

 

  • Flume Agent02

 

 

 

 

 

 

 

 

 

  • Flume Consolidation Agent

 

 

 

 

 

 

 

 

 

 

Kafka configuration

 

 

 

 

 

 

 

 

 

Kafka+Storm integration

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5. Storm+Redis integration

 

 

 

 

 

 

 

 

Log analysis

 

 

 

 

  • IP: indicates the IP address of the user
  • Mid: unique ID. This ID is stored in the cookie of the browser for the first time. If present, no more. As a unique browser identifier. Mobile terminal or pad directly fetch machine code.

 

 

 

 

  • The first bolt is used to preprocess the data, that is, to extract the IP and MID we need, and obtain provincial information according to THE IP query.
  • The second Bolt is used to count PV and UV, and periodically write PV and UV data into Redis.

 

 

Write the first Bolt: ConvertIPBolt

 

 

 

 

 

 

 

 

 

 

Write a second Bolt: a StatisticBolt

 

 

 

 

 

 

 

 

 

 

 

 

Write the Topology

 

 

 

 

 

 

 

 

 

 

 

Vi. Data visualization

 

 

Data visualization currently we need to complete two parts of the work:

 

  • Develop a Web project that queries data in Redis while providing pages to visit

  • Develop or find a front-end UI that meets our needs and present the data queried in the Web project

 

For Web project development, depending on the capabilities of the technology stack, the language and technology chosen will also be different, as long as it can achieve our ultimate data visualization goals, in fact, it is ok. In this project we are going to show PV and UV data, which is not very difficult, so you can choose Java Web, like Servlet, SpringMVC, etc., or Python Web, like Flask, Django, etc. Flask is something I personally like because it’s very fast to develop, But since I’ve been using Java, I’ll stick with SpringMVC.

 

As for the UI, MY front-end ability is general, ordinary development has no problem, but to make the above map type UI interface to display data, it is really a little powerless. Fortunately, there are a lot of third-party UI frameworks, such as Highcharts and Echarts for chart display. Among them, Echarts is open source by Baidu and has rich Chinese documents, which is very easy to use. Therefore, I choose Echarts as the UI here. And it just happens to have a map-like UI component that meets our needs.

 

Because it is not difficult, the specific development process is not mentioned here, interested students can refer to the source code I provide directly, here we will directly look at the effect.

 

Because in fact in this project case, this piece of code is also very little, the use of SpringMVC development, as long as the JavaEE three-layer architecture up, the introduction of dependency, the development behind is really not difficult; The project itself is much easier to build and code with Flask or Django.

 

After starting our Web project, enter the address to access the data display screen:

 

 

As you can see, the ECharts UI looks pretty good and really meets our needs. The two dots of different colors on each province indicate that there are two kinds of data we need to show at present, namely PV and UV, which are also reflected in the upper left corner, and the depth of color can reflect the relationship between the amount of PV or UV.

 

From this screen, click on UV in the upper left corner to indicate that the UV data is not viewed, so we will only see the PV situation:

 

 

Of course, you can also just view the UV situation:

 

 

When you hover over a province, you can view the PV or UV value of that province. For example, if you hover over “Guangdong”, you can see that the PV value is 170. The same is true for other provinces:

 

 

 

So the data is viewable, and how to reflect dynamic?

 

There are two schemes for dynamic refresh of page data, one is periodic refresh of page, and the other is periodic asynchronous request of data from the back end.

 

At present, I use the first method, and the page is refreshed regularly. If you are interested, you can also try the second method. You only need to develop the RELEVANT API that returns JSON data at the back end.

 

Seven,

 

So far, from the construction of the whole real-time processing system of big data to the final data visualization processing work, we have completed, we can see that the whole process involves a lot of knowledge, but I personally think, as long as the core principles are firmly mastered, for most cases, Environment setup and business based development can be well addressed.

 

Writing this article, on the one hand, is to summarize some of their own practice, and on the other hand, I also hope to share some relatively good project cases with you. In short, I hope to be helpful to you.

 

I have uploaded the code involved in the project case to GitHub, which is divided into two codes, one is storm project code, the other is data visualization code, as follows:

 

Storm – statistic:

https://github.com/xpleaf/storm-statistic

Dynamic – the show:

https://github.com/xpleaf/dynamic-show