Introduction to the

This tutorial introduces the data discovery and data exploration functions of DLF products by using the Data Lake Construction (DLF) product to analyze the sample data of Taobao users’ behavior. Course contents include: 1. Service opening: opening AliCloud account and DLF/OSS related services 2. Sample data set download and import: download sample data (CSV file) and upload to OSS3. DLF data discovery: use DLF to automatically identify file Schema and create metadata table 4. DLF data exploration: use DLF data exploration to analyze user behavior, including user activity and funnel model

The data shows that

The data set for this test comes from the Taobao user behavior data set used in Aliyun Tiancheng Competition. In order to improve the performance, we have made certain tailors. The dataset stores user behavior and product sample data in CSV format. Data range: December 1, 2014 – December 7, 2014 Data format:

Open DLF and OSS services (if already available, skip directly to Step 2)

1.1 Log in the AliCloud account and click to enter the DLF console page.

1.2 Open DLF and its dependent OSS services and complete the authorization. (If it has been opened, you can skip it directly)

If the DLF service has not been opened before, the user will be prompted to open the service. Click “Open Service” to enter the service opening page.



After the service has been opened, return to the DLF console page. You will be prompted to open the OSS service and grant the DLF access to the dependent data sources. Click the button to complete the OSS opening and authorization. Go back to the DLF console page and click Refresh to check the status. The page automatically jumps to the main page of the DLF console.



1.3 After opening, enter the homepage of DLF Control Station:

Import the data to be analyzed in the OSS

2.1 Click the link to download the sample code to the local disk.

The user_behavior_data folder contains the item and user folders, which respectively contain the respective CSV datagrams. This analysis is mainly focused on the User file, and the data content is as follows.



2.2 Upload files to OSS to enter the OSS console, upload files to use an existing Bucket, or create a new Bucket.

Upload and decompress the user_behavior_data folder. After uploading, the directory structure is shown below. Item and User are the data folders for the two tables.

Extract metadata on the DLF

In DLF, a metadata Database can be understood as a Database in a relational Database, and its next level is a Table.

inThe DLF consoleEnter the meta database page to create the meta database.



Enter the database name. And select the one that has the user behavior analysis just now



As shown in the figure below, the meta database was created successfully.



3.2 Found metadata table information in OSS file

Enter theDLF metadata extractionPage, click “New Extraction Task”



Fill in the data source configuration and click Next



Need to use RAM role, in the process of the default for opening stage has authorized “AliyunDLFWorkFlowDefaultRole”. Run mode select “Manual Execute”. The extraction strategy selects the “fast mode” to accomplish metadata discovery the fastest.



After checking the information, click “Save and execute now”.



The system will jump to the metadata extraction list page, and the new task will be created and run automatically.



After about 10 seconds, the task is finished. After about 10 seconds, the task is finished. When you hover over the question mark icon in the status bar, you will see that two metadata tables have been successfully created.



Click on the “Metadata Database” link in the floating layer to view the relevant table information in the library directly.



Click on the “Metadata Database” link in the floating layer to view the relevant table information in the library directly.

Click the table details to see and confirm whether the extracted table structure meets the expectations.

At this point, we have completed the process of automatically discovering the data lake CSV file Schema through the DLF. Next, we start to analyze the data in the data lake.

User behavior data analysis

4.1 Overview of data analysis

From the DLF console page, click the menu “Data Exploration” – “Spark SQL” to enter the Data Exploration page.



The process of data analysis is mainly divided into three steps:

  1. Preview and review the data information.
  2. Simple data cleaning.
  3. Analyze user activity, funnel model and product heat.

4.2 Preview and Check the Data Enter the following statement in the query box to view the data information in the file.

SELECT * FROM 'dlf_demo'. 'user' LIMIT 10; select * from `dlf_demo ` . `item ` limit 10; SELECT COUNT(DISTINCT user_id) FROM 'dlf_demo'. 'user'; SELECT COUNT(DISTINCT item_id) FROM 'dlf_demo'. 'item'; SELECT COUNT(*) FROM 'dlf_demo'. 'user';

The data content is as follows:



4.3 Data Preprocessing

Some processing of the original data has improved the readability of the data and the performance of the analysis. Modifies the behavior_type to a more understandable string. Flat the log + time format into two separate columns, plus weekly information to facilitate separate day and hour level analysis; Filter out unnecessary fields. The data is stored in a new table, USER_LOG, of the table type PARQUET, partitioned by date. We will do user behavior analysis based on the new table in the future.

-- Data conversion takes about 40 seconds -- create new table USER_LOG, Parquet format, partition by date -- import data, split the date and hour, and convert the behavoIOR to a more readable value -- 1-click; 2-collect; 3-cart; 4-pay create table `dlf_demo ` . ` user_log ` USING PARQUET PARTITIONED BY (date) as select user_id, item_id, CASE WHEN behavior_type = 1 THEN 'click' WHEN behavior_type = 2 THEN 'collect' WHEN behavior_type = 3 THEN 'cart' WHEN behavior_type = 4 THEN 'pay' END as behavior, item_category, time, date_format(time, 'yyyy-MM-dd') as date, date_format(time, 'H') as hour, date_format(time, 'u') as day_of_week from `dlf_demo ` . ` user ` ; SELECT * FROM 'dlf_demo'. 'user_log' LIMIT 10;

4.4 Analysis of User Behavior Firstly, we calculated the conversion of all users from click to add/bookmark to purchase based on the funnel model.

Funnel analysis took 13 seconds

SELECT behavior, COUNT(*) AS total FROM `dlf_demo ` . ` user_log ` GROUP BY behavior ORDER BY total DESC

The results are as follows:



And then we do a statistical analysis of user behavior every day of the week

SELECT date, day_of_week COUNT(DISTINCT(user_id)) as uv, SUM(CASE WHEN behavior = 'click' THEN 1 ELSE 0 END) AS click, SUM(CASE WHEN behavior = 'cart' THEN 1 ELSE 0 END) AS cart, SUM(CASE WHEN behavior = 'collect' THEN 1 ELSE 0 END) AS collect, SUM(CASE WHEN behavior = 'pay' THEN 1 ELSE 0 END) AS pay FROM `dlf_demo ` . ` user_log ` GROUP BY date, day_of_week ORDER BY date

The analysis results are as follows :(Because the data set has been trimmed, there is distortion for the results of working days and non-working days)



Finally, combining with the product table, we analyze the most popular product category in the data set

SELECT item.item_category, item_category, item_category, item_category, item_category COUNT(*) AS times FROM `dlf_demo ` . `item ` item JOIN `dlf_demo ` . ` user_log ` log ON item.item_id = log.item_id WHERE log.behavior= 'pay' GROUP BY item.item_category ORDER BY times DESC LIMIT 10;

The results are as follows:



4.5 Download the analysis results

DLF provides the function to download the analysis results in the form of CSV files. To enable this function, the save path (OSS path) of the analysis results needs to be set in advance. When set, the query results will be saved in this path. Users can download the file directly through the “Download” button on the DLF page, or access and download the file directly through OSS. Click the “Path Set” button on the page to set.



You can select an existing folder or a new folder.



4.6 save the SQL

By clicking the “Save” button, you can save the SQL used in this analysis, and then open it directly in the “Save Query” for further calls and modifications.

Through a simple Taobao user behavior analysis case, this paper introduces and practices the metadata discovery and data exploration functions of Data Lake Construction (DLF) products.

If you have any questions, or want to further discuss the data lake technology, please join the data lake technology group, pay attention to the data lake technology circle.

reference

tianchi Taobao user behavior data sets: https://tianchi.aliyun.com/da…

Spark SQL date type conversion: https://spark.apache.org/docs…

You will receive a polo shirt if you complete the scene during the weekday of June 24, 2021 solstice July 2, 2021. First come first served to receive address: https://developer.aliyun.com/…