Hand in hand teach you to do the best e-commerce search engine (1) – category prediction

The introduction

E-commerce has played a pivotal role in our life today, and search engine as the main traffic entrance of e-commerce system, search experience plays a crucial role in the whole system, how to optimize search, let users search better, faster, has become a mature e-commerce system required course.

The heart and essence of search

What is a good e-commerce search engine?

In fact, the essence of an excellent e-commerce search engine is to understand the needs of users, help users quickly find the goods they want, and reach a deal.

So how do search engines understand users’ intentions? There are three common methods:

1. Regular approach based on dictionaries and templates

Manually categorize users’ search terms, for example, when users search for iPhone, manually categorize them into “mobile phone” categories. This kind of processing method is accurate and effective, it is very effective to deal with some popular goods/special nouns, but with more and more users, keywords, including long tail words, more and more complex, relying on artificial has not been able to deal with so much workload.

2. Statistical methods based on user behavior

According to the user’s behavior, for example, when users search for apple, most of them click on mobile phone and a few people click on fruit. According to the statistics, it can be concluded that the apple category is in order: mobile phone > fruit. This method relies on user behavior and, like method 1, it is difficult to hit complex words. Especially in Chinese, the profound word order of the wrong bit will not affect others to understand the language.

3. Distinguish the user’s intention based on machine learning model

With the development of artificial intelligence, Natural Language Processing (NLP) and Natural Language Processing have become the most important components of the AI era. In this way, through machine learning and deep learning, we train and learn the corpus of the marked domain to obtain an intention recognition model. Using this model, when another test set is input, it can quickly predict the corresponding classification of the corpus and provide the corresponding confidence degree. One advantage of using this approach is that the accuracy of the model will improve as the corpus becomes richer.

Today we’re going to focus on the third approach.

Search term processing steps

E-commerce search engines get a user’s search keywords, generally need to carry out the following processing steps:

1. Text normalization

Common operations are as follows:

(1). Stop words should be removed, such as special symbols and punctuation marks carelessly entered by users

(2) uniform case, such as Nike/ Nike, iPhone XR/iPhone XR

(3). Different language conversion, such as iPhone/iPhone, Adidas /adidas

2. Text correction, such as iphoe => iPhone

3.分词, eG: Men’s sports hoodie, Li Ning => Men’s sports hoodie, Li Ning

4. Intention recognition/central word recognition, e.g.

“Men’s sports hoodie and Li Ning”, particip result: “Men’s sports hoodie and Li Ning”

Recognition results:

People: Men category: hoodie category Modification words: sports hoodie brand: Li Ning

5. Category prediction/text classification, e.g.

Men’s sportswear hoodie Li Ning => sportswear

Pajama woman autumn/winter => lingerie/home wear

As one of the most classic application scenarios in the field of NLP, text classification has accumulated many implementation methods, such as Facebook’s open source FastText, convolutional neural network (CNN) and circular neural network (RNN), etc. Here we mainly look at text classification based on deep learning.

We won’t go into the details of the principles, but today we will mainly use FastText to practice category prediction.

Let’s do it!

FastText category prediction practice

System architecture

Data preparation

So let’s start collecting data.

As we all know, one of the major difficulties of machine learning is to collect a large number of labeled samples. It is not easy to ensure that the samples are easy to process, updated in time and covered comprehensively.

As the main source of goods in our system is Taobao, I take Taobao goods as an example. The category uses the first-level category of Taobao, and the text data uses the commodity title of Taobao.

Something like this:

category The title
Beauty care/body/essential oil Genuine Skin Care Lancome Moisturizer Soothing Moisturizer 50ml Medium sample Super Moisturizer Gel Gel

We need a word segmentation tool. There are many tools for word segmentation, such as third-party services such as Ali Cloud and Tencent Cloud, as well as open source tools such as stutter word segmentation. Here we use stutter word segmentation to build an HTTP word segmentation interface.

$ npm init
$ npm install nodejieba --save
$ vim fenci.js
Copy the code

Fenci.js contains the following contents:


"use strict";

(function () {
    const queryString = require('querystring'),
         http = require('http'),
         nodejieba = require("nodejieba")

    const port = process.env.PORT || 8800
    const host = process.env.HOST || '0.0.0.0'

    const requestHandler = (request, response) = > {
        let query = request.url.split('? ') [1]
        let queryObj = queryString.parse(query)
        let result = nodejieba.cut(queryObj.text, true)
        let res = {
            rs: result
        }
        response.setHeader('Content-Type'.'application/json')
        response.end( JSON.stringify(res, null.2))}const server = http.createServer(requestHandler);
    server.listen(port, host, (error) => {
        if (error) {
            console.error(error);
        }
        console.log(`server is listening on ${port}`)
    })
}).call(this)

Copy the code

Running with PM2:

$ pm2 start fenci.js

Copy the code

Verify the word segmentation interface (note that the parameter should urlencode once) :

$ curl 'http://127.0.0.1:8800/? text=%E4%BF%9D%E6%9A%96%E5%86%85%E8%A1%A3' 

Copy the code

If the following output is displayed, the interface is working properly:

{
  "rs": [
    "Warm"."Underwear"]}Copy the code

The title of the product looked something like this:

Authentic Skin Care Lancome Moisturizer Moisturizer 50 ml Medium sample Super Moisturizer Gel GelCopy the code

If the word segmentation is not accurate, we can collect some brand words and terms commonly used by e-commerce, save them in the word segmentation user – defined thesaurus (user.dict.utf8 file), and restart the word segmentation service.

Text sample library

Mysql > select * from ‘mysql’;

CREATE TABLE `tb_text_train` (
    `item_id` bigint(20) NOT NULL COMMENT 'Taobao Commodity ID'.`title` varchar(255) DEFAULT NULL COMMENT 'Taobao Product Title'.`level_one_category_name` varchar(255) DEFAULT NULL COMMENT 'Taobao Commodity Category'.`title_split` varchar(1000) DEFAULT NULL COMMENT 'Title participle'.`done` tinyint(1) DEFAULT '0' COMMENT 'Sample processed to 1, unprocessed to 0'.`updatetime` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT 'Update Time',
    PRIMARY KEY (`item_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Copy the code

In this way, we can use a timing script to periodically process the commodity titles in the library (excluding stop words, participles) and synchronize them to the sample library. You can also create a background and manually process annotations in the background.

Text processing to be trained

After the text annotation, we can use SQL statements to export the annotated data in the desired format as a data. TXT file in the following format:

__label__ Traditional nourishing nutrients Fu Yi Zhi Wet cream Fu Yi Wet tea Zhi Wet tea __label__ Men's autumn pants men's casual pants Korean fashion students loose leg pants versatile sports pants men's nine minutes pants __label__ residential furniture marble surface kung fu tea tea table simple Modern Chinese style multi-functional tea making one table living room office tea __label__ Dress accessories/belt/hat/scarf bow tie abasic hat summer sun protection fisherman's hat go on a holiday at the seaside day series literary women's straw hat __label__ women's shoes Autumn and winter leather warm bean shoes flat bottom Ladies' shoes large size mother's shoes pregnant women's white nurse's cotton-padded shoes Women's __label__ Household use ultrasonic mosquito repellent household mosquito repellent intelligent electronic mosquito repellent indoor rodent repellent cockroach __label__ Women's underwear/men's underwear/home wear men's boxer underwear pure cotton Middle-aged and old dad boxers full cotton shorts loose old man fat increase pantsCopy the code

Format description, one line:

__label__ The class name is the already processed title stringCopy the code

Then we split the whole data set into two parts, 90% as training set and 10% as test set, and the code is as follows:

import pandas as pd
import numpy as np

Storage path of corpus data set
data_path = "/usr/local/webdata/fastText/"

Read corpus data set text file
train = pd.read_csv(data_path+"data.txt", header=0, sep='\r\n', engine='python')
ts =  train.shape

TXT and test.txt
df = pd.DataFrame(train)
new_train = df.reindex(np.random.permutation(df.index))

The ratio of training set to test set is 9:1Indice_90_percent = int ([0] (ts / 100.0) * 90)# Break into 2 files
new_train[:indice_90_percent].to_csv(data_path+'train.txt',index=False)
new_train[indice_90_percent:].to_csv(data_path+'test.txt',index=False)
Copy the code

Use fastText for training

1. Install fasttext:

Fasttext installation is very simple:

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
$ pip install .
Copy the code

2. Fasttext training:

Run the training using the command line:

$./ FastText supervised input train.txt-output model-label __label__ -epoch 50-wordngrams 3-DIM 100-LR 0.5-loss hsCopy the code

Or write a Python program to call the Fasttext module:

import fasttext
model = fasttext.train_supervised(input="model", LR =0.5, EPOCH =100, wordNgrams=3, Dim =100, loss='hs')
Copy the code

After model training, model file model.bin and text vector file model.vec will be generated.

3. Validate the data set

Training almost, let’s use the test set to check, is the mule is the horse out for a walk.

$ ./fasttext test model.bin test.txt
Copy the code

The following results occur:

N   1892209
P@1 0.982
R@1 0.982

Copy the code

Well, it looks good.

Let’s try it with actual user search terms, such as “thermal underwear”, and get “thermal underwear” as a result, and submit fastText predictions like this:

$ echo 'Thermal underwear' | ./fasttext predict model.bin -

Copy the code

Wait a moment and you get something like this:

__label__ Women's underwear/men's underwear/home wearCopy the code

Bingo, the prediction was perfect!

The model.bin file generated by default is large, and you can use the quantize command to compress the model file:

$ ./fasttext quantize -output model
Copy the code

The model. FTZ file will be significantly reduced in size, and the result is very impressive. For example, in my case, it has been changed from 4.7G to 616MB

4.7G    model.bin
616M    model.ftz
60K     model.o
6.1G    model.vec
Copy the code

Use the same as model.bin, for example:

$ ./fasttext test model.ftz test.txt
$ echo 'Thermal underwear' | ./fasttext predict model.ftz -
Copy the code

4. Provide Web services

To make it easier to use the prediction service, let’s make the prediction function an HTTP interface as well, again using NodeJS.

$ npm install fasttext.js --save
$ vim predict.js

Copy the code

Predict.js is as follows:



"use strict";

(function () {

    const queryString = require('querystring'),
        FastText = require('fasttext.js'),
        http = require('http')

    const port = process.env.PORT || 8801
    const host = process.env.HOST || '0.0.0.0'

    const fastText = new FastText({
        loadModel: '/usr/local/webdata/fastText/model.ftz'
    })


    const requestHandler = (request, response) = > {
        let query = request.url.split('? ') [1]
        let queryObj = queryString.parse(query)

        fastText.predict(queryObj.text)
            .then(labels= > {
                let res = {
                    predict: labels
                }
                response.setHeader('Content-Type'.'application/json')
                response.end( JSON.stringify(res, null.2) )
            })
            .catch(error= > {
                console.error("predict error", error)
            })
    }

    const server = http.createServer(requestHandler)
    
    fastText.load()
    .then(done= > {
        console.log("model loaded")
        server.listen(port, host, (error) => {
            if (error) {
                console.error(error)
            }
            console.log(`server is listening on ${port}`)
        })
    })
    .catch(error= > {
        console.error("load error", error)
    });


}).call(this)

Copy the code

Again, we use PM2 to start the service:

$ pm2 start predict.js
Copy the code

Verify the effect (note that the parameter should first use the above word segmentation interface, and then urlencode again) :

$ curl 'http://127.0.0.1:8801/? text=%E4%BF%9D%E6%9A%96%20%E5%86%85%E8%A1%A3'
Copy the code

The result looks like this:

{
  "predict": [{"label": "Women's underwear/Men's underwear/Home wear"."score": "1.00005"
    },
    {
      "label": "Children's/baby/parent-child wear"."score": "0.0543005"}}]Copy the code

It seems that the effect is very good oh, choose the score value of the largest basically meet the needs.

conclusion

After the above simple steps, we have successfully built a set of Taobao commodity classification and prediction services, including word segmentation system, classification and prediction system. By integrating these services into our search capabilities, our searches will be more accurate and our users will be happier.

See my github for the code

reference

FastText fasttext.js Stutter Chinese segmentation