Data mining of typical user comments in Python

The work of user experience can be said to be the analysis of user needs and user cognition. The voice of consumers is a very important part of this process. It contains users’ comments on products, whether good or bad, which will be helpful for our product improvement and iteration. In addition, everything has to consider the cost of money and labor cost, so I hope to use machine learning algorithm to assist the analysis and extract and insight into the user’s comment data.

1. Data acquisition and cleaning

Now crawlers flood, the access to public data on the network is no longer a problem. Simple points can use some Internet crawler services (such as god archer, octopus, etc.), complex points can also write their own crawler. Here we use crawlers to obtain the comment data of JINGdong. Compared with Amazon, JINGdong is relatively pit. The first pit is jingdong’s anti-crawler is good, through the normal product URL into the comment list is almost impossible to climb out of the data, all most web crawler services stop there. The second trap is that as long as the number of comments on a product exceeds 10,000, JINGdong will only show the first 1,000, without public data. Then you can’t do anything with your crawler technology, unless you open the crawler to update data periodically and increments.

The good thing about writing your own crawler is that you can avoid falling into the first pit, but you can’t do the second pit. Here I have climbed mi MIX and Mi MIX2 review data (I have climbed the latest several phones, please click the background if necessary), mi MIX a total of 1578, mi MIX2 a total of 3292.

Through the analysis of these data, the following objectives are expected to be accomplished in this paper

1. Favorable comment rate after data cleaning

2. An overview of good/medium/poor reviews

3. Analysis of typical opinions

First, take a look at MIX2 in general:

There were 3,497 comments, some of which were exactly the same. Reviews were made about 9 days after purchase (possibly related to delivery date) with an average rating of 4.87, some of the reviews were identical, the mi MIX2 had only one color, etc.

So let’s do the first thing

Jd.com uses a 5-point scale, with 4-5 positive reviews, 2-4 medium reviews and 1 bad reviews. MIX2 has a positive rating of 96.63%, the same as that on jd.com’s official website.

A cursory glance at the comments below reveals several types of invalid comments.

The first is all punctuation or just one or two words:

This situation can be removed by regular expressions. The second one is more troublesome, for example:

In this review it is purely word count and filler and does not contain any product characteristics. One idea is to see if the nouns in the comments are words from the mobile world, but the actual situation can be very complicated, such as

“Very good”, “very bad”…

It doesn’t have a subject, it doesn’t know what it’s evaluating. Here we reverse this and assume that each type of invalid comment has similar keywords. If a word in a comment has some spam comment keywords, we judge it as invalid. Of course, you don’t need to give all the invalid comment words, tFIDF can trace one word to find other similar words. (Text similarity algorithm can also be used to find)

In addition, there is another case, although it does not belong to invalid comments, but it affects the proportion of favorable comments.

This situation appears more in the review, there is jingdong default praise. Although the content is a bad review, the mark is worth 5 points. In theory, you can find most of them through algorithms. In the NLP field, there is a topic called Sentiment Analysis, which can determine whether a sentence’s emotional direction is positive or negative (on a probability scale of 0-1). If the emotional direction of a review is too different from the rating, it is reasonable to believe that the rating is wrong. There is one condition, of course, that the sentiment analysis algorithm is very accurate.

Let’s take a look at snownLP, an open source sentiment analysis package that has been trained by one of the most influential people in e-commerce reviews.

Yeah, 92.63 percent accuracy, which seems pretty good, but… Because I rated all reviews as positive, I got it right 96.54% of the time. Look at the ROC curve in the graph above, and, well, it’s awful. The larger the area between the curve and the X-axis (denoted as AUC), the better the discriminant ability of the model. In the normal case, the curve will be above the diagonal (the diagonal is equivalent to the result of random prediction), and the AUC can be 0.157, which is much worse than random prediction.

Better sentiment analysis would probably require retraining with a lot of corpus from the mobile world, but we’ll leave that aside for now.

Second, the semantic understanding of good/medium/bad reviews

Semantic understanding is a very difficult topic, this paper does not pursue absolute accuracy, only hope to have a quick understanding of product reviews. This paper will expound the semantics of the same type of comment corpus from three aspects:

1. Word clouds. It counts the number of occurrences (frequency) of each word in a text, and the higher the frequency, the larger the font in the word cloud. By looking at the word cloud, you can see what a piece of text is mainly about

2, TextRank. TextRank algorithm is a graph-based sorting algorithm for text, which can give the keywords of a text. Its basic idea comes from Google’s PageRank algorithm, through the text is divided into several component units (words, sentences) and the establishment of graph model, the use of voting mechanism to rank the important components of the text, only the use of single document itself information can achieve key words extraction, abstracting. Unlike models such as LDA and HMM, TextRank does not require multiple documents to be learned and trained in advance, and is widely used because of its simplicity and effectiveness.

3. Theme decomposition. Assume that each paragraph of text has a topic, such as sports, current events, gossip, etc. Through the topic decomposition of a series of corpora (LDA is adopted in this paper), we can know which topics are involved in the corpus. (The actual effect of the LDA used in this article is not very good, just for entertainment. Better methods may be updated later)

Analysis word clouds, keywords, and topics are easy to find

1, praise focused on: screen, surprise, feel, full screen, frame, roughly speaking xiaomi phone is good; Feels good; The full screen is amazing and stuff like that;

2, the comments focus on: screen, ok, disappointed, border, etc

3, the bad comments focus on: customer service, failure, after-sales, disappointment, model, wechat, etc., roughly the mobile phone failure; Wechat phone screen? Because the version and so on some after-sales customer service problems?

I can only say so, vaguely and intermittently. Because it only gives the words, there is no accompanying emotion.

3. Extraction and mining of typical opinions

E-commerce comments are different from the general network text, its main feature is that the corpus is based on the evaluation of some characteristics of the product. What we want to do in this video is find these features through algorithms.

If you think about it, corpus is mainly about evaluating features, and features are usually nouns, and evaluations are usually adjectives. Relatively speaking, there are not many adjectives for the product, such as “good”, “smooth”, “very good” and so on, so we can find the initial characteristic – adjective pairs through correlation analysis, such as (” mobile “-” good “), (” mobile “-” smooth “), etc.

The feature – adjective pairs need to be screened through correlation analysis, mainly in two aspects.

1. There are not only noun-adjective pairs, but two nouns, adjective-verb pairs, etc.

2. Not considering the distance between two words in the text. For example, the noun is in the first sentence and the adjective is in the last sentence;

In fact, the filtering is not enough. Association analysis will only mine features whose support degree is greater than a certain value, which we call “common features”. What about unusual features? How do we dig it out? Note that many adjectives have been mined above, these are the most commonly used product reviews, and we can use them to reverse mine for “uncommon features.”

As you can see, most of the features related to mobile phones have been found, while some are related to JINGdong, such as “speed”, “jingdong” and “express”. Some are not characteristic, like: “kind of “,” imaginative”.

Search the corpus for statements related to “appearance” and see what people are talking about when they talk about appearance.

It seems that the appearance of Mi MIX2 is still very good, there are a lot of people are to buy the appearance. Next, we quantify the percentage of positive and negative reviews for each feature.

The idea here was to use the SnownLP sentiment analysis package because it gives a specific probability of whether a review is positive or not. Considering the current accuracy of sentiment analysis, here we still use the original score to quantify. “Look at just keyword | sense” as an example, we have

Using this method, scaling up to all of the above features can be obtained:

It can be seen that the most mentioned features are in order: feel, screen, speed, feel, system, frame, camera, full screen, photography, experience, 256G, appearance, quality, cost-effective

The best ones are cost performance, quality, feel, speed, appearance and feeling

The worst are 256GB, screen, border, camera, system, experience, full screen

Finally, let’s look at the corpus corresponding to these features.

To sum up, the bad comments are mainly shown as follows:

No1. Delivery problem of the 256 GB version

No2. Narrow frame problem

No3. The photo taking effect of MIX2 needs to be improved

No4. The front camera is inconvenient at the bottom

No5. System, MIUI has many advertisements

Iv. Report output

Here amway is a self-built wheel: ReportGen, combined with DataFrame format can automatically generate PPTX reports. Github currently has more than 20 followers.

In ReportGen, each slide is simplified into four parts: title, subtitle, body (data graph, table, text box, or image), and footnote. Given each page of data, ReportGen automatically generates PPTX for you, typically in four lines of code. Such as: