The article directories

  • 0x00 Tutorial content
  • 0x01 Project Analysis
          • 1. Project background
          • 2. Learn and gain
          • 3. Data source introduction
          • 4. Overall project process
          • 5. Final data structure
  • 0x02 Programming implementation
          • 1. Build the Maven project
          • 2. Preparation before coding
          • 3. Obtain source data
          • 4. Parse log source data
          • 5. Clean logs
  • 0x03 Thought Review
  • 0 XFF summary

0x00 Tutorial content

  1. Project analysis
  2. Programming to realize

Basic knowledge and environment preparation: 1. Build Maven and learn Scala programming; 2. Scala plug-in has been installed for IDEA

0x01 Project Analysis

1. Project background

When we browse the website, a lot of our information will be collected in the background of the website, there are many ways to collect, as for how to collect, later have the opportunity to write a tutorial in detail, this tutorial is mainly to explain the collection of data, the data for a cutting process. About the collection of information, can refer to the article, to understand: website analysis data (that is, user behavior data) three ways to collect detailed explanation if there is time, and then organize the corresponding practical tutorial out.

Session cutting: a pre-processing of user behavior analysis.

2. Learn and gain

After this tutorial, you can learn how to implement the whole process of session cutting at work. How do you understand the data before and after the cut? More familiar with Scala APIS and other development skills.

3. Data source introduction

So far, we have collected three pieces of data: the website user click log, the label of the user Cookie and the label of the website domain name

I. Website users click the log (stored in HDFS) in the following format:

IP # type | server time | cookie | | url pageview | 2017-09-04 12:00:00 | cookie1 | 127.0.0.3 | https://www.baidu.com click | 2017-09-04 12:00:02 | cookie1 | 127.0.0.3 | https://www.baidu.com pageview 12:00:01 | 2017-09-04 | cookie2 | 127.0.0.4 | https://www.baidu.com Click 12:00:04 | 2017-09-04 | cookie1 | 127.0.0.3 | https://www.baidu.com pageview | 2017-09-04 12:00:02 | cookie2 | 127.0.0.4 | http://news.baidu.com click 12:00:03 | 2017-09-04 | cookie2 | 127.0.0.4 | http://news.baidu.com 12:00:04 pageview | 2017-09-04 | cookie2 | 127.0.0.4 | http://music.baidu.com/?fr=tieba pageview | 2017-09-04 12:45:01 | cookie1 | 127.0.0.3 | https://tieba.baidu.com/index.html click | 2017-09-04 12:45:02 | cookie1 | 127.0.0.3 | https://tieba.baidu.com/index.html click | 2017-09-04 HHHH 12:45:03 | cookie1 | 127.0.0.3 | https://tieba.baidu.com/index.html | 2017-09-04 3333 ss 12:45:03 | cookie1 | 127.0.0.3 | https://tieba.baidu.com/index.html | 2017-09-04 12:45:03 | cookie1 | 127.0.0.3 | https://tieba.baidu.com/index.htmlCopy the code

Explain that this can be collected in the background of the website, when we visit a page once, this is a behavior, we call this behavior pageView behavior; When we click on a button on a web page, there is also a click action; Scrolling through a web page, selecting text, and so on are also related behaviors… Every action generates information, and as to what that information is, you can see what’s in the data above, and that information can be collected in the background. For the sake of convenience, I’ll just take pageView and Click as examples, which are inseparable from the Session mechanism. The data here is the first line of the corresponding column means, in addition to the first line, other line separate column is “|”. Because this behavior information can be a lot, we store it in HDFS.

Session: In computers, especially in network applications, it is called “Session control”. The Session object stores properties and configuration information required for a specific user Session. This way, variables stored in the Session object will not be lost when the user jumps between Web pages of the application, but will persist throughout the user Session. When a user requests a Web page from an application, the Web server automatically creates a Session object if the user does not already have a Session. When a session expires or is abandoned, the server terminates the session. One of the most common uses of the Session object is to store user preferences. For example, if the user indicates that he does not like to view graphics, that information can be stored in the Session object. For more information about using Session objects, see “Managing Sessions” in the “ASP Applications” section. Note that session state is reserved only in browsers that support cookies. — From Baidu Encyclopedia

Simple description: As for the session, browser related knowledge that people should be more clear what is a session, the session is session, when browsing the site, we will launch a session on the website, just like we use XShell establishes a session window access to our server, it is also a session, and you will find that, if the window for a long time don’t touch it, It automatically disconnects. The same is true when we are browsing the website. Our session duration is 30min by default. Here we do the session cutting business, according to the time recorded in the background of our website, every 30 minutes to cut into a session. We already have a rough idea that a Session is a bit like a request, and for this request, the Web server can automatically create a Session object that holds various properties, including PageView, Click, and so on.

2. The label of user Cookie (stored in HDFS) is in the following format:

Cookie1 | stubborn cookie2 cookie3 biased | | persistent cookie4 | executive force is very strongCopy the code

Explain that each cookie is labeled with a corresponding type as a label. This process can be done by the data team, based on the machine learning algorithm, and I don’t have to worry about it, but I’ll give you a tutorial later. Because this behavior information is also a lot, we also store it in HDFS.

Cookies, sometimes used in the plural, refer to data stored (usually encrypted) on a user’s local terminal by some web sites for identification and session tracking. — From Baidu Encyclopedia

Third, the website domain name tag (stored in the configuration library, such as a database)

"www.baidu.com" -> "level1",
"www.taobao.com" -> "level2", 
"jd.com" -> "level3", 
"youku.com" -> "level4"
Copy the code

Explain our custom ranking of sites that have been labeled with the appropriate ranking. Of course, some domains may not be classified. These domains are far less than behavioral information, and we can consider storing them in traditional databases.

4. Overall project process



In fact, the above is an ETL process, and finally we will get two tables, respectively TrackerLog table and TrackerSession table, which are also stored in HDFS. The storage mode we use here is Parquet format. TrackerLog corresponds to our original log table, TrackerSession is our Session information table after cutting.

We can roughly look at the above ten lines of data with our naked eyes and cut them into one session every 30min. We can see that there are three sessions. If we finally cut out three sessions, it means that our practical operation results are correct.

But in this process, we are mainly to achieve the middle of the above session cutting, generating session here side of the business.

5. Final data structure

A. Determine the table fields

Fields in the TrackerLog table(Ignore the data for now) :



TrackerSession table field(Ignore the data for now) :



Field to explain

Session_id: Uniquely identifies our session

Session_server_time: the creation time of the earliest accessed session

Landing_url: corresponds to the first URL accessed, that is, the earliest time that belongs to the same session

Domain: the domain name

The pageView_count and click_count columns in table 2 have a total of 10 columns.

Since we are going to use the Parquet format for storage, we need to define our Schema, such as:

{"namespace": "com.shaonaiyi.spark.session"."type": "record"."name": "TrackerLog"."fields": [{"name": "log_type"."type": "string"},
     {"name": "log_server_time"."type": "string"},
     {"name": "cookie"."type": "string"},
     {"name": "ip"."type": "string"},
     {"name": "url"."type": "string"}}]Copy the code
{"namespace": "com.shaonaiyi.spark.session"."type": "record"."name": "TrackerSession"."fields": [{"name": "session_id"."type": "string"},
     {"name": "session_server_time"."type": "string"},
     {"name": "cookie"."type": "string"},
     {"name": "cookie_label"."type": "string"},
     {"name": "ip"."type": "string"},
     {"name": "landing_url"."type": "string"},
     {"name": "pageview_count"."type": "int"},
     {"name": "click_count"."type": "int"},
     {"name": "domain"."type": "string"},
     {"name": "domain_label"."type": "string"}}]Copy the code

Save the above Schema information into two files: Trackerlog.avsc and TrackerSession.avsc. Now that we are ready, we can start building a project.

0x02 Programming implementation

1. Build the Maven project

There are many tutorials in my blog, which will not be repeated here. I believe that those who learn here are quite skilled in these basic knowledge.

Note that:

1. The package name must be consistent with the previous Schema information, such as mine:com.shaonaiyi.spark.sessionOf course, you can continue to operate first without creating the package name first.

2. Preparation before coding

A. Introducing dependencies and plug-ins (the full POM.xml file is below)


      
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.shaonaiyi</groupId>
    <artifactId>spark-sessioncut</artifactId>
    <version>1.0 the SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>The spark - core_2. 11</artifactId>
            <version>2.2.0</version>
            <! --<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>, version 1.8.1</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <testExcludes>
                        <testExclude>/src/test/**</testExclude>
                    </testExcludes>
                    <encoding>utf-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.6</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.avro</groupId>
                <artifactId>avro-maven-plugin</artifactId>
                <version>1.7.7</version>
                <executions>
                    <execution>
                        <phase>generate-sources</phase>
                        <goals>
                            <goal>schema</goal>
                        </goals>
                        <configuration>
                            <sourceDirectory>${project.basedir}/src/main/avro</sourceDirectory>
                            <outputDirectory>${project.basedir}/src/main/java</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id> <! -- this is used for inheritance merges -->
                        <phase>package</phase> <! Jar package merge on package node -->
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
Copy the code

B. Import Avro Schema file. Upload trackerlog. avsc and Trackersession. avsc files to Avro folder of the project (newly created by myself).

C. Import the data source file and the Scala SDK

Add data source files (visit_log.txt,cookie_label.txt) copy to the data folder of the project (newly created by ourselves), because during development, we generally upload part of the test data to the project for testing, and then modify the path to the path on HDFS after the test is completed, which is the development process. Create the Scala code source folder and import the Scala SDK. Here is the current project structure, please refer to:

3. Obtain source data

A. Generate Java classes from the Schema file (because Maven has introduced it)avro-maven-pluginPlugins, so just click on MavencompileGenerate the corresponding class)



After compiling, you can see that the corresponding file has been generated and the package has been automatically created:



B. Create a tool classRawLogParserUtil:

package com.shaonaiyi.session

import com.shaonaiyi.spark.session.TrackerLog

/ * * *@Auther: [email protected]
  * @Date: 2019/9/12 09:40
  * @Description: Parses each line of the raw log into a TrackerLog object */
object RawLogParserUtil {
  def parse(line: String): Option[TrackerLog] = {
    if (line.startsWith("#")) None
    else {
      val fields = line.split("\ \ |")
      val trackerLog = new TrackerLog()
      trackerLog.setLogType(fields(0))
      trackerLog.setLogServerTime(fields(1))
      trackerLog.setCookie(fields(2))
      trackerLog.setIp(fields(3))
      trackerLog.setUrl(fields(4))
      Some(trackerLog)
    }
  }
}
Copy the code

C. Create a project entry class SessionCutETL:

package com.shaonaiyi.session

import org.apache.spark.{SparkConf, SparkContext}

/ * * *@Auther: [email protected]
  * @Date: 2019/9/12 10:09
  * @Description: Main program entry for session cutting */
object SessionCutETL {

  def main(args: Array[String]): Unit = {

    var conf = new SparkConf()
    conf.setAppName("SessionCutETL")
    conf.setMaster("local")
    var sc = new SparkContext(conf)
    val rawRDD: RDD[String] = sc.textFile("data/rawdata/visit_log.txt")
    rawRDD.collect().foreach(println)

    sc.stop()
  }

}
Copy the code

After executing, you can see the data that has been loaded into the original log:



PS: Hadoop is not configured, so this error can be ignored:

4. Parse log source data

A. Continue to add code to SessionCutETL to parse log source data

val parsedLogRDD: RDD[Option[TrackerLog]] = rawRDD.map( line => RawLogParserUtil.parse(line))
parsedLogRDD.collect().foreach(println)
Copy the code

Error: Execute now:



The serialization interface Serializable can also be implemented at the end of the class. The Serializable interface can also be implemented at the end of the class.

B. to achieveTrackerLogClass serialization:



C. Execute again to see the result, so far we have implemented our log parsing:



D. Adjust the output format using the FlatMap API

Check out the code above:

val parsedLogRDD: RDD[Option[TrackerLog]] = rawRDD.map( line => RawLogParserUtil.parse(line))

The type we return is RDD[Option[TrackerLog]], and for the result, we return None, Some(), which is not what we want, and our result should be more concise. So you can change this line to:

val parsedLogRDD: RDD[TrackerLog] = rawRDD.flatMap( line => RawLogParserUtil.parse(line))

The calculation results are as follows:



The difference between map and flatMap needs to be clear here. In short, map is a one-to-one relationship. A row is a specific class, so there are as many classes as there are rows. FlatMap, on the other hand, flattens all rows into a single class and then cuts them, leaving only one class.

5. Clean logs

A. Description of invalid logs

If we look carefully at the previous log, we can see that the last two data types are HHHH and 3333SS, which are obviously not valid logs. Here we need to filter them out. In the actual log, dirty data may be different, but the processing process is similar.

HHHH 12:45:03 | 2017-09-04 | cookie1 | 127.0.0.3 3333 ss | | https://tieba.baidu.com/index.html 2017-09-04 12:45:03 | cookie1 | 127.0.0.3 | https://tieba.baidu.com/index.htmlCopy the code

B. Define valid types (same as main)

private val logTypeSet = Set("pageview", "click")



C. Modify the parsing code and add filtering conditions

val parsedLogRDD: RDD[TrackerLog] = rawRDD.flatMap(RawLogParserUtil.parse(_)) .filter(trackerLog => logTypeSet.contains(trackerLog.getLogType.toString))

D. Run the command again. The two dirty data items are not displayed

0x03 Thought Review

1. First of all, we need to be clear that our big project is actually a website user behavior analysis, and session cutting is just a small part of the project, as to what results can be achieved by the website user behavior analysis. We side can simply list a few here, for example, we can through the analysis of a person in the content browsing the site click, open content, time, etc., especially interested in preliminary judge whether the person to this content, to this conclusion, we can be customized to recommend people do, and so on, of course this is only a small example. We started with the project development process, but in fact, we should already know what information and data we can get, and then, with our information and data, we can mine the functionality we want. , for example, there is now a product, you need to find your site, or other web sites inside some of the same type of products are you particularly interested in, you only need to select them, then we can recommend your product to these people, these people with high matching DuDu is your senior clients, you sell products will certainly more smoothly. So, we’re going to work in that direction to achieve our goal, and our goal is to screen out those people who are a very good match for your product. 2, so in this process, we can according to the various ways and get the data you need, of course, these data can also be already existed before, may be after you determine the target to determine whether or not you want, what data you want to approach to obtain, and then to the storage location, like these web pages, data must be very big, Clicking, opening, sliding and so on are all a piece of data, and each person can generate many, many pieces of data. For the large amount of data, we can store it in HDFS, and for the small amount, we can store it in other places, which is not fixed and can be flexibly selected for storage. 3, this tutorial does not provide a tutorial to collect data, if you have read this blog readers should be clear, our data is how to get, such as Flume can be implemented, there are many ways. We have omitted those procedures here, and after the analysis is done, we can code to implement it, because the stored data is only text files, and what we need to do is to clean the data first, filter out the illegal data, then according to what conditions to filter? When learning traditional databases, such as MySQL, we can use WHERE to determine filtering conditions. For example, to retrieve students older than 18, age is a key factor. However, in our text files, we have not really specified fields, only the Schema information is determined in the first line. Without field shards, each row would have to be customized once, which is obviously cumbersome. Then we can uniformly divide the text file into sections, which fields correspond to what meaning, corresponding to what fields. Each field is in a vertical row, and obviously we can build an object that corresponds to a class that corresponds to a row of data, and then each field in that row corresponds to an attribute in an object. Class -> row, property -> field, so we need to define a class first. 4. Our Spark job requires serialization before data transfer, so we need to serialize our class. The Serialization class in Java is Serializable, and then we can get our data source, clean and filter it.

0 XFF summary

  1. In fact, Spark has a built-in Kryo serialization interface, which has better performance and is more suitable for our current application scenarios. This optimization point will be upgraded later in the project.
  2. When we were testing, we saw a lot of prompt messages at startup, which affected our view. We can set the screen to block them for our development, and also write them in the later tutorial.
  3. As the saying goes, “One minute on stage, ten years of work off stage.” But writing a tutorial is not the same, operating half a minute, thinking half a day, writing a tutorial a day. This is the stage half bell, ten years of work on the stage ah! Thank you readers for your support, thank you soup, follow, comment, come on.
  4. Web User Behavior Analysis Project Series: Web User behavior analysis project conversation cutting (1) Web User behavior analysis project conversation cutting (2) Web User behavior analysis project conversation cutting (3)

Author introduction: shao nai a whole stack engineers, market insight, column editing | | public, WeChat | weibo | CSDN | | Jane books

Welfare: Shao Nai a technical blog navigation Shao Nai an original is not easy, such as reproduced please indicate the source.