I. Business Scenario analysis:

Because of the complexity of the business, developers need to face the different sources of different types of data, need to put these data extraction to the data platform, according to the designed data model to extract the key business field, the formation of a two-dimensional table, so that the follow-up in the big data platform/data warehouse of statistical analysis, correlation calculation.

This article introduces how to use MaxCompute to convert jSON-formatted log data with a specific case.

1. Data source: The application writes data to the log file in the specified directory on the ECS host in real time.

2. Data format: In the log file, the format of each log is shown as follows (data is simplified and desensitized in the example). Each log contains device information and one or more sessions. The following is an example of each log:



3. Data processing requirements: Collect log data, parse and convert log data, and perform statistical analysis of converted log data in MaxCompute. The log data is in JSON format and contains multiple service fields. You can perform subsequent service statistics (such as PV/UV statistics by time segment, device type, device ID, and member information) in MaxCompute only after the service fields are displayed. So the key requirement of this article is how to parse the key information of JSON-formatted data into a two-dimensional table containing business fields.

Ii. Solutions:

In the solution of this paper, ali Cloud log service +MaxCompute product combination is chosen to meet the above business requirements, in which the log service only completes the function of log collection and delivery, without data parsing and transformation.

1. Log collection: Obtain log data from the log service to the LogStore (for details, see the help document of the log service).

2. Use the post function (help document) of the log service to archive logs to one original log table of MaxCompute. All information about each log is written into the content field of the original log table.



3. Use MaxCompute to parse and extract the original data.

1) Use the built-in function get_json_object for data extraction

select
get_json_object(content,'$.DeviceID') as DeviceID,
get_json_object(content,'$.UniqueIdentifier') as UniqueIdentifier,
get_json_object(content,'$.GameID') as GameID,
get_json_object(content,'$.Device') as Device,
get_json_object(content,'$.Sessions\[0].SessionID') as Session1_ID,
get_json_object(content,'$.Sessions\[0].Events\[0].Name') as Session1_EventName,
get_json_object(content,'$.Sessions\[1].SessionID') as Session2_ID,
get_json_object(content,'$.Sessions\[1].Events\[0].Name') as Session2_EventName
from log_target_json where pt='20180725' limit 10Copy the code

The extraction results are as follows:

Solution Summary: The above processing logic is to extract the service fields of a log into row fields. The information in each JSON record is fixed and can be mapped to table fields. For example, after extracting the information of Session1 and Session2, they are treated as different column fields. However, if the number of sessions contained in each log is dynamic and not fixed, this processing logic cannot meet the requirements. For example, the next log contains three sessions. If the information of each session is to be extracted, Add Session3_ID, Session3_EventName logic to the SQL that requires parsing. What if the next log contains 100 sessions? That’s a lot harder to deal with.

This can be done using UDTF custom functions.

2) Develop MaxCompute UDTF function to process logs

According to the data characteristics, a log contains multiple session information and belongs to the relationship of 1: N. When the log is converted to a TWO-DIMENSIONAL table of the data warehouse, the session information needs to be parsed to the smallest granularity, and 1 row is converted to N rows to extract all session information. The business objectives are as follows:

In MaxCompute, the conversion of 1-row record processing to multi-row record is implemented using UDTF.

Using the JAVA UDTF as an example, we parse each JSON record in the Content field to retrieve and return the business fields that need to be extracted. The PROCESSING logic of the UDTF here goes down to level 3 of JSON, looping through the least granular data and returning multiple records.


package com.aliyun.odps;

import com.aliyun.odps.udf.UDFException;
import com.aliyun.odps.udf.UDTF;
import com.aliyun.odps.udf.annotation.Resolve;
import com.google.gson.Gson;

import java.io.IOException;
import java.util.List;
import java.util.Map;

@Resolve("string->string,string,string,string,string,string,string,string")
public class get_json_udtf extends UDTF {
    @Override
    public void process(Object[] objects) throws UDFException, IOException {
        String input = (String) objects[0];
        Map map = new Gson().fromJson(input, Map.class);

        Object deviceID = map.get("DeviceID");
        Object uniqueIdentifier = map.get("UniqueIdentifier");
        Object gameID = map.get("GameID");
        Object device = map.get("Device");

        List sessions = (List) map.get("Sessions");
        for (Object session : sessions) {
            Map sMap = (Map) session;
            Object sessionID = sMap.get("SessionID");
            List events = (List) sMap.get("Events");
            for (Object event : events) {
                String name = (String) ((Map) event).get("Name");
                String timestamp = (String) ((Map) event).get("Timestamp");
                String networkStatus = (String) ((Map) event).get("NetworkStatus"); forward(deviceID, uniqueIdentifier,gameID,device, sessionID,name,timestamp,networkStatus); }}}}Copy the code
Note: write about the UDF itself, packaging, upload, create the Function such as knowledge, please refer to the official documentation at https://help.aliyun.com/document_detail/27867.html.
After the program is written, you need to package, upload the UDTF, and create UDF functions:
In the MaxCompute client (odPSCMD), upload this resource:
Add the jar maxcompute_demo – 1.0 – the SNAPSHOT. The jar – f;
Then create function from the command line:
Create function get_json_udtf as com.aliyun.odps.get_json_udtf using maxcompute_demo-1.0-snapshot.jar ‘;

View the function after creation:
Test verification:
To query the table containing the original log, use the created get_jSON_UDtf to query the content field:
The query result is as follows. UDFT processes each JSON record and generates multiple records, which meets expectations:
At the same time, if you need to solidify the processing logic, you can also use the insert into syntax to query the parsed results into a new table, and implement periodic data conversion through job scheduling.


Iii. Summary:

In this paper, through a big data analysis scenario of log analysis, taking a common JSON log processing requirement as an example, this paper introduces how to collect logs into MaxCompute by log service, and at the same time use MaxCompute built-in function /UDF to parse and convert json format log data. Key service fields are extracted and log tables are generated for subsequent analysis.