Welcome to visit netease Cloud Community to learn more about Netease’s technical product operation experience.


I. Definition of user portrait


The concept of persona was first proposed by Alan Cooper, the father of interaction design: “Personas are a concrete representation of target users.” Refers to the virtual representation of real users, which is a target user model based ona series of attribute data. With the development of the Internet, the user portrait now includes new content and meaning. Usually, the user portrait is a labeled user model abstracted based on the user’s demographic characteristics, Internet browsing content, online social activities and consumption behavior. The core work of constructing user portrait is mainly to analyze and mine the massive logs stored on the server and the massive data in the database, and to attach “labels” to users, and “labels” are the identifiers that can represent the characteristics of a certain dimension of users. The specific form of the label can be seen in the following figure, that is, a website labels one of the users.



Second, the role of user portrait


Extracting user portraits requires a lot of time and manpower to deal with massive logs. Despite such high costs, most companies still want to create an accurate portrait of their users. So what does the user portrait do, and what does it help us achieve? It can be summarized in the following aspects:


1. Accurate marketing: accurate direct mail, SMS, App message push, personalized advertising, etc.


2. User research: guide product optimization, and even achieve personal customization of product functions.


3. Personalized services: personalized recommendation, personalized search, etc.


4. Business decisions: ranking statistics, regional analysis, industry trend, competitive product analysis, etc.


Iii. Content of user portrait


The content of the user portrait is not completely fixed, and the characteristics of attention vary according to industry and product. For most Internet companies, user profiles include demographic and behavioral characteristics. Demographic attributes mainly refer to the user’s age, gender, province and city, education level, marriage status, birth status, work industry and occupation, etc. Behavioral characteristics mainly include activity, loyalty and other indicators.


In addition to the more general features above, different types of sites extract user profiles with different focuses. Content-oriented media or reading websites, as well as search engines or general navigation websites, tend to extract the characteristics of users’ interest in browsing content, such as sports, entertainment, food, finance, travel, real estate, cars and so on. User portraits of social networking sites can also extract users’ social networks, from which close user groups and star nodes that play the role of opinion leaders in the community can be found. The user portraits of e-commerce shopping websites generally extract indicators such as users’ online shopping interests and consumption power. Online shopping interests mainly refer to users’ preferences in online shopping categories, such as clothing, bags, home, mother and baby, washing and care, food and beverage, etc. Consumption power refers to the purchasing power of users. If done carefully enough, users’ actual consumption level can be separated from their psychological consumption level in each category to establish characteristic latitude respectively.


In addition, you can add the user’s environment attributes, such as the current time, LBS characteristics of the visiting place, local weather, holidays, etc. Of course, for a specific website or App, there must be user latitude with special attention, so these dimensions need to be more detailed, so as to provide users with more accurate personalized services and content.






Four, the production process of user portrait:


The extraction of user features is the production process of user portraits, which can be roughly divided into the following steps:


1. User modeling, which refers to determining the degree of extracted user features and data sources to be used;


1. Data collection: Use data collection tools, such as Flume or self-written scripts, to store all required data in the Hadoop cluster;


2. Data cleaning. The process of data cleaning is usually located in Hadoop cluster, or may be carried out at the same time with data collection.


3. Model training: Some features may not be obtained directly from data cleaning, such as the content of users’ interest or the consumption level of users, so we can learn and predict through the collected known features;


4. Attribute prediction, using the trained model and the known features of users to predict the unknown features of users;


5. Data merge, combine features extracted by users from various data sources, and give a certain degree of reliability;


6. Data distribution: Distribute the merged result data to various platforms such as precision marketing, personalized recommendation and CRM to provide data support.




The following takes user gender as an example to introduce the process of feature extraction in detail:


1. Extract the data filled in by users themselves, such as the gender data filled in during registration or activities, which are generally highly accurate.


2. Extract the appellation of the user, such as the addressee or sender mentioned in the addressee, such as: Mr. / Ms. XXX, this data is also more accurate.


3. Predicting the user’s gender based on the user’s name is a dichotomous problem, which can extract the user’s name part (hundred family names have no correlation with gender), and then train a classifier with naive Bayes classifier. During the process, I encountered the problem of rare characters, such as “Zhen Huan”, which could not be correctly classified by the classifier because there were few characters in the name. Considering that Chinese characters are composed of Chinese character component radical, and Chinese character component radical also often has a special meaning (many associated with gender, such as grass character tend to be women, king side tend to be men), we use wubi input method decompose words, letters of the name itself and wubi play together again on the LR classifier for training. For example, the fighting style of “huan” is “female V+ L+ G+ clothes E = VLGE”, the female character here is very feminine.


4. In addition, there are also some features that can be used. For example, if the user has visited some beauty makeup or women’s clothing websites frequently, it is more likely that the user is female; if the user visits sports and military websites, it is more likely that the user is male. And the time of day a user uses the Internet, late-night users are more likely to be male. Adding these features into LR classifier for training can also improve certain data coverage.


Data management system


User portrait involves a large amount of data processing and feature extraction, which often requires the use of multiple data sources and multiple people to process data and generate features in parallel. Therefore, a data management system is needed to store and distribute data uniformly. Our system organizes data in a convention directory structure, with a basic directory hierarchy of /user_tag/ properties/date/source-author /. For example, dev1 stores the gender data extracted from the user name in /user_tag/gender/20170101/name_dev1. The gender data extracted by dev2 from user information is stored in /user_tag/gender/20170102/raw_dev2.


The data extracted from each source credibility is different, so the extracted data sources must be given a certain weight, contract is commonly a probability value between 0 and 1 for system data automatically merge, only need to do a simple weighted sum, and the normalized output to the cluster, stored in the hive table defined in advance. Next, data is incrementally updated to more application service clusters such as HBase, ES, and Spark.







6. Application examples of user portrait


Taking the personalized recommendation of a certain page of e-commerce website as an example, many online recommendation systems adopt LR (logistic regression) model training in consideration of the interpretability, expansibility and computational performance of the model. LR model is also taken as an example here. Commodity-based collaborative filtering is used in many recommendation scenarios, and the core of commodity-based collaborative filtering is a commodity correlation matrix W. Assuming there are N commodities, W is an N * N matrix, and the element WIj of matrix represents the correlation coefficient between commodity Ii and Ij. The user can be represented as an n-dimensional feature vector U=[i1, i2… in] according to the behavior characteristics of the user accessing and purchasing goods. Therefore, U*W can be regarded as the user’s interest in each commodity V=[v1, v2… vn], where V1 is the user’s interest in commodity I1, v1= I1 * W11 + I2 *w12 + in*w1n. If the correlation coefficients w11, W12… , w1n is regarded as the required variable, so LR model can be used to substitute the behavior vector U of users in the training set for solution. Such a preliminary LR model was then trained, and the effect was similar to that of commodity-based collaborative filtering.


At this time, only the user’s behavioral characteristics are used, while other contexts such as demographic attributes, online shopping preferences, content preferences, consumption power and environmental characteristics have not been used. The above features were added to the LR model, and the attributes of the target product itself, such as text label, category, sales volume and other data, were added, as shown in the figure below, to further optimize and train the original LR model. So as to maximize the use of the extracted user portrait data, to achieve more accurate personalized recommendation.






References:


Typical application of big data in JINGdong: Jingdong user portrait technology exposure

Deep reveal Tencent big data platform

User insight in the Era of Big Data: Building user portraits


This article is from netease Practitioner community, by the author Yang Jie authorized netease cloud community.

【 recommended 】 Database routing middleware MyCat – Source code (11)