Pay attention to my public number [big data], more dry goods waiting for you

1.1. Hierarchical structure of data warehouse

Advantages of layering: simplify complex problems, clear data structure (easy to manage), increase data reuse, isolation of original data (decoupling)

The hierarchy function
ods The raw data layer stores the original data, keeping the original appearance without processing
dwd The detail data layer cleans the data in the ODS layer (removes empty values, dirty data, data that exceeds the limit range)
dws The service data layer is lightly aggregated
ads Apply data layer specific requirements

All the tables in the storehouse are external tables

1.2. Basic format of buried point behavior data (basic fields)

Public fields: fields that are included in almost all Android phones. Business fields: fields reported by buried sites, with specific business types. Here is an example, which represents the uploading of business fields. Behavioral Data Launch Log/Event Log Table Key fields:

{" ap ":" XXXXX ", / / project data source app PC "cm" : {/ / public fields "mid" : "", / / a unique identifier (String) equipment" uid ":" ", / / (String) user id "vc" : "1", // (String) versionCode, program version number "vn": "1.0", // (String) versionName, program versionName "l": "zh", // (String) system language "sr": "", // (String) channel number, which channel the application came from. "OS" : "7.1.1, / / Android version" ar ": (String)" CN "/ /" md ": (String) area" BBB100-1 ", / / phone models "ba" : (String) "Blackberry", / / (String) mobile phone brand "sv" : "V2.2.1", / / (String) sdkVersion "g" : "", / / (String) gmail" hw ": // (String) height-width: "1506047606608", // (String) height-width: "1506047606608", // (String) height-width: "1506047606608", "Wi-fi," / / "ln" (String) network model: 0, / / (double) LNG longitude "la" : 0} / / (double) lat latitude, "et" : [{/ / event "ett" : "1506047605364", // client event generation time "en": "display", // event name startup and event log are based on the event name "kv": {// event result, with key-value self-defined "goodsid": "236", "action": "1", "extend1": "1", "place": "2", "category": "75" } } ] }

It can be divided into different log tables depending on the event label

1.3. Table introduction of each layer

1.3.1, ods layer

1) ODS_START_LOG starts the log table

  • There is only one field, LINE (which holds JSON), partitioned by date DT, table format: LZO

2) ODS_EVENT_LOG event log table (same format as startup log table)

  • There is only one field, LINE, partitioned by date DT, table format: LZO

1.3.2, DWD layer

1) DWD_START_LOG start table

  • Key fields: mid_id, user_id, dt(partitioned field, partitioned by date) (actually this is a common field in the startup table and event table)
  • Line from ods_start_logGet_json_object mid_id (line, '$. Mid)The way to get the field

1.3.2.1 Custom UDF/UDTF (application in project)

  • Custom UDF functions (parsing public fields, one in one out)
  • Custom UDTF functions (parsing specific event fields, one in, many out)

    • Custom UDF: Inherit the UDF, overriding the evaluate method
    • Custom UDTF: inherits from GenericUDTF and overwrites three methods: initialize(the column name and type of the custom output), process (return the result to a forward(result)), and close
  • Why to customize the UDF/UDTF, because the custom function, can bury their own Log print Log, error or data exception, convenient debugging.

1.3.2.2 Event log basic list

DWD_BASE_EVENT_LOG basic list of events

  • 1) Key fields:

    • Public fields: mid_id, user_id, dt(partition field) and event_name, event_json, server_time
  • 2) Get public field and server_time from ods_event_log line with UDF, Event_name and Event_json with UDTF.

1.3.2.3 Commodity click table

Dwd_display_log merchandise click list

  • Key fields: public field + unique field
  • Get public fields and server_time directly from dwd_base_event_log, get specific fields from event_json of dwd_base_event_log,where event_name = "display"
  • get_json_object(event_json,'$.kv.action') action

1.3.2.4 Schedule of other specific events

similar

Show that Table annotation
dwd_newsdetail_log Product details page sheet
dwd_loading_log Product list page table
dwd_ad_log Advertising table
dwd_notification_log Message notification table
dwd_active_foreground_log User foreground active table
dwd_active_background_log User background active table
dwd_comment_log Comments on the table
dwd_favorites_log Collect the table
dwd_praise_log Thumb up table
dwd_error_log Error log table

A total of 11 specific event lists can be obtained from an event base list dwd_base_event_log

Second, demand analysis

2.1. Active topics of users

2.1.1 DWS layer daily activity list

Daily active device analysis

2.1.2 Schedule of DWS Layer Periods

Weekly active equipment analysis

2.1.3 Monthly Activity List of DWS Layer

Monthly active device analysis

2.1.4 Table of daily, weekly and monthly active devices in ADS layer

Active device analysis

2.2. Users add new themes

2.2.1 DWS layer daily new list

2.2.2 Table of Daily Newly Added Devices in ADS Layer

2.3 Retention Themes

2.3.1 User Retention Introduction

2.3.2 User Retention Analysis

2.3.3 Schedule of daily retention of DWS layer

2.3.4 Table of retained users in ADS layer

2.3.5 Table of retained user rate of ADS layer

2.4 Silent users

2.5 Number of backflow users this week

2.6 Number of lost users

2.7 Number of active users in the last 3 consecutive weeks

2.8 Number of active users for three consecutive days in the last seven days

2.9. Demand logic

2.9.1 How to analyze user activity?

Count the occurrence of different device IDs in the startup log.

2.9.2 How to analyze new users?

ADD TABLE (LEFT JOIN) ADD TABLE (LEFT JOIN) ADD TABLE (LEFT JOIN) ADD TABLE (LEFT JOIN)

2.9.3 How to analyze 1-day retention of users?

Today’s Active User Retention Rate = Retained Users/Added Last Day

2.9.4 How to analyze silent users?

(7 days ago, and only once) Group the daily list by device ID. The number of login is 1, and the login is one week ago.

2.9.5 How to analyze this week’s backflow users?

A left join was active last week. A left join was active last week

2.9.6 How to analyze lost users?

(Login time is 7 days ago) Group the calendar by device ID and have not logged in within 7 days.

2.9.7 How to analyze the number of active users in the last three consecutive weeks?

The weekly activity is grouped according to device ID, and the number of statistics is greater than 3 times.

2.9.8 How to analyze the number of active users for three consecutive days in the last seven days?

  • 1) Search the active users in the last 7 days and rank the active dates of users
  • 2) Calculate the difference between user active dates and rank
  • 3) Group the same users and differences, and count the number of differences
  • 4) Take out the data with the same difference value of more than or equal to 3, and then repeat it (what weight should be removed??). , i.e., active users for 3 consecutive days or more