Advertising based on user portraits is the basis of optimizing the delivery effect and realizing precision marketing. The gender, age and other labels in demographic attributes are the basic information in user portraits. So how do you label the data as accurately as possible?

This is where machine learning comes in. This paper will take gender tags as an example to introduce the construction and optimization of machine learning model for demographic attribute tag prediction.

Gender label prediction process

In general, unsupervised learning is not only difficult to learn useful information, but also difficult to evaluate the effects of learning. So, if we can, we try to turn the problem into supervised learning as much as possible.

The same is true for gender tagging. We can use trusted gender sample data, plus useful information extracted from the original data collected by TalkingData, to transform the production task of gender tagging into a supervised machine learning task. More specifically, male and female are labeled as 1/0 (Label, often referred to as Y value, for ease of expression, we Label male and female as 1/0), thus the task of gender labeling is transformed into a dichotomy task.

The production flow chart of gender label is as follows:

  • Simply put, the input is a sample of data with reliable gender information, and useful features are extracted from the recently active original data.
  • After joining the two, the data set can be directly used for modeling.
  • Based on this data set, the gender prediction model is built.
  • The model was then used to make predictions for the entire sample to obtain a gender score for all the samples. At this point, the work of the model part is basically completed;
  • The final step is to determine the threshold and output the male/female tag. Here we do not rely on the model to determine the threshold, but rely on a reliable third-party tool to ensure that as many samples as possible are recalled within the expected precision.

In addition, in the face of TalkingData’s data volume of over one billion, in the process of label production, in order to accelerate the operation, we will give priority to Spark distributed to accelerate the operation except when we have to use a single machine.

Version iteration of features and model methods

In order to optimize the effect of the model, we carried out several iterations of the gender label prediction model.

01 Sex prediction model V1

The features initially used in the model include four dimensions: device application information, application package name embedded in the SDK, in-app custom event log embedded in the SDK, and device model information.

XGBoost (version 0.5) was used to train the model respectively based on the features of each dimension, and four submodels were obtained. Each submodel will output a score of the device’s male/female orientation based on this characteristic dimension, with a score range from 0 to 1. A high score indicates the device’s male orientation, while a low score indicates the device’s female orientation. The model code example is as follows:

< swipe left and right to see the full code >

import com.talkingdata.utils.LibSVM import ml.dmlc.xgboost4j.scala.DMatrix import Ml. DMLC. Xgboost4j. Scala. Spark. XGBoost / / version 0.5 / / train stage val trainRDD = LibSVM. LoadLibSVMFile (sc, // sc for SparkContext val model = xgBoost. Train (trainRDD, paramMap, numRound, trainRDD)// sc for SparkContext val model = xgBoost. nWorkers = workers) //predict stage val testSet = LibSVM.loadLibSVMFilePred(sc,testPath,-1,sc.defaultMinPartitions) val pred = testSet.map(_._2).mapPartitions{ iter => model.value.predict(new DMatrix(iter)).map(_.head).toIterator }.zip(testSet).map{case(pred, (tdid, feauture)) => s"$tdid\t$pred" }

Disadvantages and optimization direction:

  • The model is the fusion of four submodels, with complex structure and low operating efficiency. Therefore, a single model is considered to be used instead.
  • The feature coverage of the custom event log embedded in the SDK is low, and the ETL processing resource consumption is large, so the contribution of this field to the model needs to be reevaluated.
  • It is found that the device name field seems to have a male/female distinction — some user groups will name the device name after the name or nickname (for example, the field with “brother”, “jun” and other fields tends to be male, and the field with “sister”, “LAN” and other fields tends to be female), verify the effect and consider whether to add this field.

02 Sex prediction model V2

The four dimensions of model usage characteristics are adjusted as follows: application package name embedded in SDK, application AppKey embedded in SDK, device model information and device name.

Among them, the embedded SDK application package name and device name to do word segmentation. Then CountVectorizer is used to process the above four kinds of features into sparse vectors, and ChiSqSelector is used for feature screening.

The model adopts LR (Logistic Regression), and the code example is as follows:

< swipe left and right to see the full code >

import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.PipelineModel import org.apache.spark.ml.classification.LogisticRegression val transformedDF = Spark.read. parquet("/traindata/path")// Participle, CountVectorizer, ChiSquSelector For vector column val featureCols = Array("packageName","appKey", "model", "deviceName") val vectorizer = new VectorAssembler(). setInputCols(featureCols). setOutputCol("features") val lr = new LogisticRegression() val pipeline = new Pipeline().setStages(Array(vectorizer, lr)) val model = pipeline.fit(transformedDF) //predict stage val transformedPredictionDF = Spark.read. parquet("/predictData/path")// Consistent with train, this is the feature of a particible word, countVectorizer, and chisqSelector. Val PredictionDF for Vector = Model.Transform (TransformedPredictionDF)

Advantages and improving effect:

  • A single model can be used to measure the model with common model evaluation metrics (such as Roc-AUC, Precision-RECall, etc.) and serve as baseline in subsequent version iterations to facilitate version-improvement comparisons from a model perspective.

Disadvantages and optimization direction:

  • The LR model is relatively simple and has limited learning capabilities, and is subsequently replaced with more powerful models such as the XGBoost model.

03 Sex prediction model V3

In addition to the four dimensions included in the last version: application package name embedded in SDK, application AppKey embedded in SDK, device model information and device name, the features used in the model are added with the device application information recently aggregated. The processing method is similar to that of the last version, so I will not go into details.

Change the model from LR to XGBoost (version 0.82) with the following code example:

< swipe left and right to see the full code >

Import org. Apache. Spark. Ml. Feature. VectorAssembler import ml. The DMLC. Xgboost4j. Scala. Spark. XGBoostClassifier / / version 0.82 Val transformedDF = SPARK.read. parquet("/trainData/path")// Participle, CountVectorizer For vector column val featureCols = Array("packageName","appKey", "model", "deviceName") val vectorizer = new VectorAssembler(). setInputCols(featureCols). setOutputCol("features") val assembledDF = vectorizer.transform(transformedDF) //traiin stage //xgboost parameters setting val xgbParam = Map("eta" -> xxx, "max_depth" -> xxx, "objective" -> "binary:logistic", "num_round" -> xxx, "num_workers" -> xxx) val xgbClassifier = new XGBoostClassifier(xgbParam). setFeaturesCol("features"). setLabelCol("labelColname") model = xgbClassifier.fit(assembledDF) //predict stage val transformedPredictionDF = Spark.read. parquet("/predictData/path")// same as train, which is the feature after the participle, countVectorizer operation, Val AssembledPredictionDF = Vectorizer. Transform (TransformedPredictionDF) Val Predictions = model.transform(assembledpredicDF)

Advantages and improving effect:

  • The AUC increased by 6.5% compared to the previous version, and the recall rate in the final production of gender labels increased by 26%. Considering TalkingData’s data volume of more than a billion, this is an impressive number.

04 Sex prediction model V4

In addition to the five feature dimensions included in the last version, TalkingData’s own three AD category dimensions are also added. Although the coverage rate of AD category features only accounts for 20%, it also has a great impact on the improvement of the recall rate of the final label.

The model was replaced by DNN with XGBOOST and set the maximum number of training rounds (Epoch) to 40 with Early Stopping. Considering that the neural network can work based on big data, we doubled the sample size used for training to ensure the learning of the neural network.

The structure of DNN is as follows:

< swipe left and right to see the full code >

python GenderNet_VLen( (embeddings_appKey): Embedding(xxx, 64, padding_idx=0) (embeddings_packageName): Embedding(xxx, 32, padding_idx=0) (embeddings_model): Embedding(xxx, 32, padding_idx=0) (embeddings_app): Embedding(xxx, 512, padding_idx=0) (embeddings_deviceName): Embedding(xxx, 32, padding_idx=0) (embeddings_adt1): Embedding(xxx, 16, padding_idx=0) (embeddings_adt2): Embedding(xxx, 16, padding_idx=0) (embeddings_adt3): Embedding(xxx, 16, padding_idx=0) (fc): Sequential( (0): Linear(in_features=720, out_features=64, bias=True) (1): Batchnorm1D (64, eps= 1E-05, Momentum =0.1, Affine =True, Track_Running_Stats =True) (2): Relu () (3): Dropout(p=0.6) (4): Linear(in_features=64, out_features=32, bias=True) (5): Batchnorm1D (32, eps= 1E-05, Momentum =0.1, Affine =True, Track_Running_Stats =True) (6): Relu () (7): Dropout(p=0.6) (8): Linear(in_features=32, out_features=16, bias=True) (9): Batchnorm1D (16, eps= 1E-05, Momentum =0.1, Affine =True, Track_Running_Stats =True) (10): Relu () (11): Dropout(p=0.6) (12): Linear(in_features=16, out_features=2, bias=True) ) )

Advantages and improving effect:

  • The AUC only improved by 1.5% compared to the previous version, but the recall rate in the final production of gender labels increased by 13%, which is not bad considering the data volume and existing label volume. It can be seen from this that, when verifying the effect of version iteration, we should not only measure the AUC of the model as a single indicator, because it is not accurate enough to measure the improvement degree of the effect of version iteration. We should verify the final, true metric improvement — in the case of gender labeling prediction, the number of samples recalled at expected precision. However, we can still use AUC and other model-related indicators in version optimization to quickly verify the experimental effects of control variables, since these indicators are easy to calculate.

Model exploration tips

Exaching fields from the original log to aggregate information requires many ETL steps and involves many optimizations, which are handled by a dedicated ETL team and won’t be covered here.

The modeling team can directly use the fields that have been aggregated in time for modeling tasks, but the time spent on ETL and feature generation still accounts for most of the time spent on model optimization and iteration.

The following summary of two optimization pit and solution experience, I hope to give you some reference.

1. For gender label prediction, most of the input features are of Array type, such as recently collected device application information. For fields of this type, before training the model, we would normally call CountVectorizer to convert the Array into a Vector and then use it as input to the model, but CountVectorizer is a time-consuming step that makes it difficult to experiment quickly during version iteration.

To solve this problem, we can do this conversion in advance, and then store the generated Vector columns as well, thus saving CountVectorizer time on each experiment.

In actual production, since many labels will use the same fields, we can save a lot of time by converting Array into a Vector and storing it in advance. Later, different tasks can directly call Vector columns.

2. While the first one saves a lot of time, Spark is much more productive. In fact, in the early exploration of the model, we can also use Spark to generate the training set first — because the real samples are usually not many and the generated training set is usually not very large, then we can use a single machine to conduct rapid experiments.

On a stand-alone machine, we can use Python to draw more easily to understand data more intuitively, to filter features faster, and to validate ideas faster. After we have a deep understanding of the data and the model, we can quickly apply the conclusions obtained from the experiment to the production.

About the author: Xiaoyan Zhang, a data scientist at TalkingData, is currently responsible for the construction of enterprise-level user portrait platform and the research and development of efficient marketing and delivery algorithms. She has long been concerned with Internet advertising, user portrait, fraud detection and other fields.