Abstract:

An overview,

Because our company is a B2B Saas service, it will inevitably produce the problem of accurate billing, so after the selection and comparison of several sets of solutions, we finally determined the following way to calculate the flow of our own business platform. Because it involves specific operations, the premise of reading this article is that I assume that you have already had the experience of using Aliyun related products.

The current traffic composition is mainly as follows:

1. Page response traffic of users accessing service platform products, that is, the size of HTTP returned pages. 2. Traffic of users accessing the products of images on the service platform. All images are stored in OSS and exposed to the public network through CDN. 3. Response traffic generated by user API requests, that is, the size of the Json return value page.

Our goals:

1. Our billing granules are calculated on a daily basis, so we just need to make sure that we can get the previous day’s data the next day. The requirements for real time are not so high. 2. All data must participate in calculation and support multi-channel log sources. 3. You need to support recalculation. In some cases, you need to obtain traffic data in a certain period and obtain the same data result as before (the premise is that the original logs are the same). 4. Do not incur a lot of costs. It is best not to buy a server or spend too much money. 5. All traffic results are stored in the Mysql database by year, month, day, and hour for real-time table lookup and display to customers.

Second, the flow statistics process

Get access logs

Now all access based on HTTP, so whether the apache or nginx is own access log function, only need to put these text log cleaning a under into the database, or a more simple way, the journal through ali cloud services are collected, then delivery to a number of ODPS list regularly.

Here give a log service portal: https://www.aliyun.com/product/sls

Compared with collecting and cleaning logs by myself, I still suggest using aliyun’s log service, which is fast and stable in query, low in charge, excellent in customer support and fast in iteration.

It is worth mentioning that the digital Plus platform supports one-click import of CDN logs. If your business is exactly good and the whole CDN is accelerated, you only need to collect CDN logs to carry out more accurate traffic statistics.

The format of the access log we collected looks something like this:

[12/Nov/2017:00:09:15 +0800] 112.28.172.8-0"http://www.xxx.com/Article/86892.html" "GET http://www.xxx.com/Public/css/style.css" 200 393 21427 HIT Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "text/css"
Copy the code

Responsetime, Referer, method, access URL, HttpCode, RequestSize, responsesize, Cache hit status, UA header, file type.

The most important of these are two fields: URL and responsesize. The URL field is the URL path of the entire access and can be used to analyze file types and client platforms. Responsesize refers to the data size of the entire request response. You can also use a combination of httpCode and cache hits to make more accurate calculations. I will not cover this part of the following introduction. After understanding, you can adjust the ODPS SQL statement.

Import ODPS logs

If we need to import log data into ODPS, then we need to complete the creation of ODPS table, which can be completed by entering the digital plus platform. It is similar to the way we create MYSQL and MSSQL tables in normal development, but there are some differences in data type. The ODPS table also supports an important and useful concept called partitioning.

When creating an ODPS table, you can specify one or more fields to partition certain data. This is usually done by time. For example, log time is formatted to the hour level for partitioning, so that the data of an hour can be placed in a specific partition. The purpose of setting up partitions is mainly to facilitate the operation, overwrite and rerun in the later period.

So there are several ways to import logs to ODPS: 1. If your logs in Ali cloud log service, directly configure ODPS delivery, the specific query document is relatively simple, not worded. 2. If your log data is in the self-built system, then you can consider using the “data plus platform – data integration” function to complete the data import and synchronization, support a large number of database types, please refer to the document for details: https://help.aliyun.com/document_detail/53008.html, 90% of all database support on market.

Here we have an ODPS data table named cdN_logs, which holds all CDN access logs. The name is just for demonstration, you can customize it. My construction sentence goes like this:

CREATE TABLE cdn_logs (
	station_node STRING COMMENT 'nodes',
	unix_time BIGINT COMMENT 'Access Time',
	method STRING COMMENT 'Request method',
	procotol STRING COMMENT 'Request protocol',
	domain STRING COMMENT 'domain',
	uri STRING COMMENT 'Requested URI',
	uri_parameter STRING COMMENT 'Request Parameters',
	imgsrc_type BIGINT COMMENT 'Image compression type ID',
	return_code BIGINT COMMENT 'Return code',
	rt BIGINT COMMENT 'delay',
	request_size BIGINT COMMENT 'Request size',
	response_size BIGINT COMMENT 'Response size',
	hit_info STRING COMMENT 'Hit information',
	client_ip BIGINT COMMENT 'Real user IP',
	proxy_ip STRING COMMENT 'Intermediate proxy IP',
	remote_address BIGINT COMMENT 'Client IP',
	bacend_server STRING COMMENT 'back to the source IP,
	refer_procotol STRING COMMENT 'Reference protocol',
	refer_domain STRING COMMENT 'Reference domain name',
	refer_uri STRING COMMENT 'uri',
	refer_parameter STRING COMMENT 'Reference parameters',
	content_type STRING COMMENT 'Request Return Type',
	user_agent STRING COMMENT 'UA',
	port BIGINT COMMENT 'Customer port',
	transfer_success BIGINT COMMENT 'Transfer successful',
	last_modify STRING COMMENT 'Last modification time of file on server',
	user_info STRING COMMENT 'User Information',
	traceid STRING COMMENT 'Access Error Check trace ID',
	sent_http_via STRING COMMENT 'HTTP via the head',
	reserve_1 STRING COMMENT 'Reserved field 1',
	reserve_2 STRING COMMENT 'Reserved field 2',
	reserve_3 STRING COMMENT 'Reserved field 3'
)
PARTITIONED BY (
	log_date STRING COMMENT 'Date partition',
	domain_id BIGINT COMMENT 'ID partitions'
)
LIFECYCLE 100000;
Copy the code

4. Configure offline computing tasks

We already have one basic data table: CDN_logs. In addition to this, we need to create two more tables: Cdn_logs_format table – Further cleaning data obtained after formatting the CDN_logs table, which needs to be close to the service for DIY operations. It is recommended to plan the life cycle of this table based on the number of logs. If the number of logs is small, set it to permanent; otherwise, set it to periodic. The construction sentences are as follows:

CREATE TABLE cdn_logs_format (
	unix_time BIGINT COMMENT 'Access Time',
	uri_parent STRING COMMENT 'Level 1 directory name',
	method STRING COMMENT 'Request method',
	procotol STRING COMMENT 'Request protocol',
	domain STRING COMMENT 'domain',
	uri STRING COMMENT 'Requested URI',
	return_code BIGINT COMMENT 'Return code',
	rt BIGINT COMMENT 'delay',
	response_size BIGINT COMMENT 'Response size',
	client_ip BIGINT COMMENT 'Real user IP',
	remote_address BIGINT COMMENT 'Client IP',
	bacend_server STRING COMMENT 'back to the source IP,
	refer_domain STRING COMMENT 'Reference domain name',
	refer_uri STRING COMMENT 'uri',
	content_type STRING COMMENT 'Request Return Type',
	user_agent STRING COMMENT 'ua'
)
PARTITIONED BY (
	log_date STRING COMMENT 'Date partition'
)
LIFECYCLE 100000;
Copy the code

Aggr_product_traffic table — the final table of traffic statistics. This table has few results and is recommended to have a permanent lifetime.

CREATE TABLE aggr_product_traffic (
	clientcode STRING COMMENT 'Client code',
	year BIGINT COMMENT 'years',
	month BIGINT COMMENT 'month',
	day BIGINT COMMENT 'day',
	hour BIGINT COMMENT 'hour',
	mi BIGINT COMMENT 'minutes',
	traffic BIGINT COMMENT 'flow',
	create_at BIGINT COMMENT 'Timestamp when the summary was created',
	targetdate STRING COMMENT 'Summary result time'
)
COMMENT 'Product Flow Summary'
PARTITIONED BY (
	aggr_date STRING COMMENT 'Partition time',
	product STRING COMMENT 'Product Code'
)
LIFECYCLE 100000;
Copy the code

After the preparation work is completed, we need to configure the corresponding offline task in the number of Gary, before the configuration of offline task can use the number plus “data development IDE” test, here it needs to be noted that the test is a cost, but very cheap, a bottle of Cola can play for a day.

Before we begin, we need to understand a few key words:

Virtual node: meaningless node, no actual function, virtual to facilitate the understanding of the overall logical relationship, can not be built.

Task node: Specifically, a specific operation, such as running SQL statements or synchronizing data.

Scheduling period: This is very important because it depends on the granularity of your task. It can be set to minutes, hours, days, weeks, or months, most of which are sufficient.

Offline tasks: It is important to understand that all offline tasks running on the digital plus platform are non-real-time, and all ODPS table queries are also non-real-time. Although SQL statement queries are similar to real-time databases, you cannot directly connect to ODPS to display business data in real time, so at the end of the process, We generally need to put a data synchronization node, used to complete the operation of ODPS data synchronization to their own database, into their own database is not any you knead ~

Enter the “data development” module of Digitplus platform and create tasks on the left side. Here we choose workflow tasks to facilitate understanding of the overall logic. The following is my implementation logic, which may not be suitable for everyone, please adjust it by yourself (the picture is not clear, please enlarge it) :

I formatted CDN logs through SQL statements combined with some small skills. As we are a Saas service provider to B, we have our own rules on file storage, such as:

http://www.xxx.com/client123/logo.png

http://www.xxx.com/client999/logo.png

It represents the file urls of two clients, and we need to find the client identifiers through string cutting, namely client123 and client999 in the connection. Here is my SQL statement:

ALTER TABLE cdn_logs_format DROP if exists PARTITION (log_date='${bdp.system.bizdate}');
INSERT into TABLE cdn_logs_format PARTITION (log_date='${bdp.system.bizdate}')
SELECT unix_time
	, TOLOWER(split_part(uri, '/', 2)) AS uri_parent
	, method, procotol, domain, uri, return_code
	, rt, response_size, client_ip, remote_address, bacend_server
	, refer_domain, refer_uri, content_type, user_agent
FROM cdn_logs
WHERE log_date = '${bdp.system.bizdate}'
Copy the code

The core of the above statement is actually three parts.

The first part is to maintain a specific partition of the CDN_logs_format table, determine whether there is dirty data, and delete the contents of the partition if there is. ${bdp.system.bizDate} This is a built-in variable that will eventually be replaced with a date when run.

The second part is the split_part method that cuts out the first-level directory in the URL, which is the customer identity part of our business.

The third part is to INSERT the query content into the CDN_logs_format table.

After formatting the basic logs, we need to merge and summarize the access requests from the same customer using the following SQL statement:

ALTER TABLE aggr_product_traffic DROP IF EXISTS PARTITION (product = 'emp', aggr_date = '${bdp.system.bizdate}');

INSERT INTO TABLE aggr_product_traffic PARTITION (product = 'emp', aggr_date = '${bdp.system.bizdate}')
SELECT uri_parent AS clientcode
	, datepart(TO_DATE(log_date, 'yyyymmdd'), 'yyyy') AS year
	, datepart(TO_DATE(log_date, 'yyyymmdd'), 'mm') AS month
	, datepart(TO_DATE(log_date, 'yyyymmdd'), 'dd') AS day
	, datepart(TO_DATE(log_date, 'yyyymmdd'), 'hh') AS hour
	, datepart(TO_DATE(log_date, 'yyyymmdd'), 'mi') AS mi
	, SUM(response_size) AS traffic
	, UNIX_TIMESTAMP(TO_DATE('${bdp.system.bizdate}'.'yyyymmdd')) AS create_at
	, to_char(TO_DATE('${bdp.system.bizdate}'.'yyyymmdd'), 'yyyy-mm-dd') AS targetdate
FROM cdn_logs_format
WHERE log_date = '${bdp.system.bizdate}'
GROUP BY uri_parent,log_date
Copy the code

Group by + sum = group by + sum = group by + sum So far we have completed the processing of basic logs and the merging of traffic data through two tasks respectively.

5. Operation results are synchronized to our own database

This step does not show screenshots, very simple, configure the data source, one to one corresponding database field, fool operation, so easy! Then ask your fellow developers to make a nice diagram, and when it comes out, your client will see something like this:

Six, afterword.

Because it is only a side of the scene direction, so it may be too shallow to help students who often use the digital plus platform. However, for students who have not been exposed to Digitplus products, in fact, there are many digitplus products that are quite interesting. The only reason is that big data has little overlap in fields and scenes, which makes it difficult for things shared to resonate with each other.