Abstract

In today’s highly developed Internet, intelligent terminal devices such as iPad and mobile phones can be seen everywhere, and there are many apps and websites running in them. It is very important to collect terminal data for analysis and improve the quality of software, such as PV/UV statistics and user behavior data statistics and analysis. Although the scene is simple, but the amount of data, the system throughput, real-time, analysis ability, query ability have higher requirements, it is not easy to set up. Today we will introduce the data collection and analysis scheme based on Aliyun table storage and related big data products.

Click here to see the original article

TableStore

TableStore is a professional-level distributed NoSQL database independently developed by Ali Cloud. It is a semi-structured data storage platform with high performance, low cost, easy expansion and full hosting based on shared storage, supporting efficient calculation and analysis of Internet and Internet of Things data.

At present, both inside Alibaba Group and external public cloud users, there are thousands of systems in use. It covers heavy-throughput offline applications as well as heavy-stability, performance-sensitive online applications. The specific features of table storage can be seen in the following image.

Data acquisition and analysis system based on TableStore

A typical data acquisition, analysis and statistics platform mainly consists of the following five steps in data processing:

For the specific implementation of the above process, there are many cases that can be referred to on the Internet. After the data is collected on the client side, if the amount is relatively small, we may directly do a transparent transmission on the back-end API, and then persist to the RDBMS type database. Data analysis can be carried out through Sql. If there is a large amount of data, some middleware is needed to assist in collecting and uploading, and then write the data to the online and offline systems respectively. For example, upload the data to Flume, Flume can collect and aggregate the data, and then use Flume as a message producer and publish the message data to Kafka through Kafka Sink. Kafka acts as a message queue that connects to backend online and offline computing platforms. As shown below:

Flume and Kafka were introduced for a number of reasons, such as their ability to handle large volumes of data, do data aggregation, and ensure that data is not lost, but the most important reason is their high throughput capability. After Spark Streaming/Storm analysis is completed, the resulting data needs to be stored by other storage components, such as HBase/MySQL. If you introduce MySQL, you might also need to introduce Redis for hot data caching, which makes it more complicated.

We try a new solution based on TableStore and other big data products of Ali Cloud. Let’s first look at the architecture diagram:

Critical path analysis in the figure: 1. Web page, APP and other clients first collect data through buried point system, and then write data into the original data table of TableStore through the SDK stored in tables. MaxCompute directly reads TableStore data from the original table, then QuickBI reads MaxCompute directly writes table data, and then QuickBI creates cloud data source. Select * from TableStore, select * from TableStore, select * from TableStore, select * from TableStore, select * from TableStore, select * from TableStore, select * from TableStore, select * from TableStore, select * from TableStore 4. The data in TableStore can be incrementally synchronized to Blink/Flink for analysis. After the analysis, the data will be written back to the result data table of TableStore, and DavaV will read the data in the result data table for display.

Analysis of the advantages of the new architecture: 1. The TableStore, which reads and writes data directly from the client side, does not need to introduce the API layer for data transparent transmission, which reduces the complexity and reduces a lot of server costs for large applications. 2. TableStore has been connected with rich big data components, including ali Cloud’s big data products and open source big data products. Data synchronization and reading and writing are very easy. 3, real-time analysis and offline analysis of the results of data written back to TableStore, DataV directly read the results of data display, because TableStore has high performance and high throughput characteristics, no need to introduce Redis and other cache components, can simplify the whole system.

Direct Reading and writing safety issues:

AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessKey and AccessId: AccessId The answer is no, we doSTSToken grants access to the TableStore, the process is shown in the figure below:

The SDK provided by TableStore supports access by STS authorization. For example, please refer to TableStore NodeJs SDK, which uses STSToken. To access TableStore by STS, you need to control the authorization policy and do not authorize interfaces that are not needed by the client.

Cross-domain Access of the browser to TableStore: If you access the TableStore directly from the browser, the same origin policy of the browser may cause cross-domain problems. Because the EndPoint domain name of TableStore is different from the domain name of the user’s Web site. There are two ways to solve this problem: first, the Web end does not directly access the TableStore, instead of requesting their own Web Server end, the Web Server end then uses the TableStore SDK to initiate a request, which is actually the back end access, the problem is solved but also lost our advantage of reading and writing directly; Secondly, the TableStore server directly supports JS cross-domain request in some way. We are supporting this way, which is currently in the development stage. The supported way is that CORS protocol supports cross-domain. But there are also fast support, if you have a browser to directly access the TableStore demand, you can directly contact us, support is also very fast.

conclusion

Because of its high performance, high throughput and high reliability, table storage is very suitable for data acquisition scenarios with high requirements for back-end throughput. The client data directly reads and writes table storage, which also saves the middle-layer data flow service for the back-end, reduces complexity and saves costs. In addition, table storage provides rich computing, analysis, and presentation tools that cover almost all scenarios of data collection and analysis. The peripheral components described in this article cover only a part of the data collection and analysis. For more examples and instructions, please refer to the Table storage User guide.