Introduction to the

This tutorial introduces the data discovery and data exploration functions of DLF product through the analysis of sample data of Taobao user behavior by using data Lake Construction (DLF) product. 1. Service opening: Opening Ali Cloud account and DLF/OSS related services 2. DLF Data discovery: Use DLF to automatically identify the file Schema and create metadata tables. 4. DLF data exploration: Use DLF data exploration to analyze user behaviors, including user activity and funnel model

The data shows that

The data set of this test comes from the data set of Taobao user behavior used in Aliyuntianchi Competition. In order to improve performance, we have made some tailoring. The data set stores user behavior and sample product data in CSV format. Data range: December 1, 2014 — December 7, 2014

Enable DLF and OSS services (if yes, go to Step 2)

1.1 Log in to Aliyun account and click to enter the DLF console page.

1.2 Enable the DLF and its dependent OSS services and complete authorization. (If it has been opened, it can be directly skipped)

If the DLF service has not been enabled, the system prompts the user to enable the service. Click “Service opening” to enter the service opening page.After the service is enabled, return to the DLF console page. It prompts you to open OSS services and grant DLF access to dependent data sources. Click the button to complete OSS opening and authorization. Back to the DLF console page, click Refresh to check the status. The DLF console page is displayed.

1.3 After opening, enter the home page of DLF Console:

Import data to be analyzed in OSS

2.1 Click the link to download the sample code to the local disk.

The folder user_behavior_data is obtained, which contains the item and user folders, and contains their CSV files. This analysis mainly focuses on the user file, and the data content is as follows.2.2 Uploading files to the OSS Log in to the OSS console and upload files using an existing Bucket or create a new Bucket.

Upload the user_behavior_data folder after the decompression. After uploading, the directory structure is as follows, item and user are the data folders of the two tables.

Extract metadata on DLF

3.1 Creating a Metadata Table A DLF metadata Table can be regarded as a Database in a relational Database.

inThe DLF console, go to the metadata database page and create the metadata database.Fill in the database name. And select the one with the user behavior analysis just storedAs shown in the following figure, the metadata database is created successfully.

3.2 Discovering metadata table information in the OSS file

Enter theDLF metadata extractionPage, click “New Extraction task”Fill in the data source configuration and click NextNeed to use RAM role, in the process of the default for opening stage has authorized “AliyunDLFWorkFlowDefaultRole”. Run mode Select Manual. Extraction strategy Select “Fast mode” to complete metadata discovery at the fastest speed.

After verifying the information, click save and Execute now.

The metadata extraction list page is displayed, and the new task is created and automatically runs.

About 10 seconds later, the task is complete. About 10 seconds later, the task is complete. Hover over the question mark icon in the status bar to see that two metadata tables have been successfully created.Click the “metadata Database” link in the floating layer to directly view the relevant table information in the library.

Click the “metadata Database” link in the floating layer to directly view the relevant table information in the library.

Click the table details to see and confirm whether the extracted table structure meets expectations.

At this point, we have completed the process of automatically discovering the data lake CSV file Schema through DLF. Next, we started to analyze the data in the data lake.

User behavior data analysis

4.1 Overview of Data Analysis On the DLF console page, click “Data Exploration” – “Spark SQL” to enter the data exploration page.The process of data analysis is mainly divided into three steps:

  1. Preview and review the data information.

  2. Simple data cleaning.

  3. Conduct user activity, funnel model and product heat analysis.

4.2 Preview and Check The Data Enter the following statements in the query box to view the data in the file.

Select * from 'dlf_demo'. 'user' LIMIT 10; select * from `dlf_demo ` . `item ` limit 10; Select COUNT(DISTINCT user_id) from 'dlf_demo'. 'user'; Select COUNT(DISTINCT item_id) from 'dlf_demo'. 'item'; Select COUNT(*) from 'dlf_demo'. 'user';Copy the code

The data content is as follows:

4.3 Data Preprocessing We have processed the original data to improve the readability of the data and improve the performance of analysis. Change the behavior_type to a string of better-understood information. Flatten the log + time format into two separate columns, plus weekly information to facilitate separate date – and hour-level analysis; Filter out unnecessary fields. The data is stored in a new table user_log, Parquet, partitioned by date. Later we will do user behavior analysis based on the new table.

Create a new table user_log, parquet format, partition by date -- import data, split date and hour, and convert behavoior to a more readable value -- 1-click; 2-collect; 3-cart; 4-pay create table `dlf_demo ` . ` user_log ` USING PARQUET PARTITIONED BY (date) as select user_id, item_id, CASE WHEN behavior_type = 1 THEN 'click' WHEN behavior_type = 2 THEN 'collect' WHEN behavior_type = 3 THEN 'cart' WHEN behavior_type = 4 THEN 'pay' END as behavior, item_category, time, date_format(time, 'yyyy-MM-dd') as date, date_format(time, 'H') as hour, date_format(time, 'u') as day_of_week from `dlf_demo ` . ` user ` ; Select * from 'dlf_demo'. 'user_log' limit 10;Copy the code

4.4 User Behavior Analysis First of all, based on the funnel model, we analyzed the conversion of all users from click to add/favorites to purchase.

Funnel analysis takes 13 seconds

SELECT behavior, COUNT(*) AS total FROM `dlf_demo ` . ` user_log ` GROUP BY behavior ORDER BY total DESC
Copy the code

The results are as follows:

Then we did a statistical analysis of user behavior for each day of the week

-- User behavior analysis 14 seconds SELECT date, day_of_week COUNT(DISTINCT(user_id)) as uv, SUM(CASE WHEN behavior = 'click' THEN 1 ELSE 0 END) AS click, SUM(CASE WHEN behavior = 'cart' THEN 1 ELSE 0 END) AS cart, SUM(CASE WHEN behavior = 'collect' THEN 1 ELSE 0 END) AS collect, SUM(CASE WHEN behavior = 'pay' THEN 1 ELSE 0 END) AS pay FROM `dlf_demo ` . ` user_log ` GROUP BY date, day_of_week ORDER BY dateCopy the code

The analysis results are as follows :(due to the clipping of the data set, the results of working days and non-working days are distorted)Finally, we combine the commodity table to analyze the most popular commodity category in the data set

SELECT item.item_category, SELECT item.item_category, COUNT(*) AS times FROM `dlf_demo ` . `item ` item JOIN `dlf_demo ` . ` user_log ` log ON item.item_id = log.item_id WHERE log.behavior= 'pay' GROUP BY item.item_category ORDER BY times DESC LIMIT 10;Copy the code

The results are as follows:4.5 Downloading Analysis Results The DLF allows you to download analysis results as CSV files. To enable this function, you must set the path for saving analysis results (OSS path) in advance. After the configuration, the query results are saved to this directory. Users can download the file directly through the download button on the DLF page, or access and download the file directly through OSS. Set by clicking the “Path Settings” button on the page.

You can select an existing folder or create a new folder.4.6 Saving SQL You can click The “Save” button to save the SQL used in this analysis, and open it in the “Saved Query” for further invocation and modification.

Summary This paper introduces and practices the metadata discovery and data exploration functions of data Lake Construction (DLF) product through a simple taobao user behavior analysis case.

If you have any questions or want to further explore data Lake technology, please join the data Lake technology group and follow the data Lake technology circle.

reference

tianchi Taobao user behavior data sets: tianchi.aliyun.com/dataset/dat…

The Spark SQL date type conversion: spark.apache.org/docs/latest…

You will receive a Polo shirt after completing the scene during the recent Data Lake event from June 24 to July 2, 2021. First come first served to receive address: developer.aliyun.com/adc/series/…