The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from the author of Tencent Cloud: The Tao of Python

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)

The purpose of this short article is to share my experience of learning Python crawlers from scratch over the past few days, and to show some of the results of mining for sentiment analysis (text categorization) of crawlers. Different from other introduction focusing on crawler technology, this paper first expounds the motivation of crawling network data, then introduces the crawling of text data by taking Douban film review as an example, and finally uses text classification technology to conduct sentiment analysis in a machine learning way. Since the content is too extensive to cover in detail, this article aims to provide a cognitive window for those with little or no exposure to the relevant field, hoping to stimulate readers’ interest in exploring on their own.

The following sample code is written in Pyhton, using the sklearn and Scrapy libraries.

So, what is Sentiment Analysis?

Sentiment analysis, also called Opinion Mining, is a field that studies people’s opinions, emotions or attitudes towards certain things, such as products, topics and policies. With the explosion of opinion data on the Internet, sentiment analysis has been widely studied and applied. Sentiment Analysis and Opinion Mining of Bing Liu made a comprehensive summary and case study on this issue. Another classic paper by Bo Pang, Opinion Minning and Sentiment Analysis, focuses on the classification of emotions.

Simple example of an application, a company want to investigate their products on taobao popular degree, can from product reviews, a trained classifier to determine the be fond of of leave a comment for each user of the product attitude, positive or negative evaluation, in order, fully tap the text content.

Python crawler

Of course, the first step of sentiment analysis is to obtain data, and the Internet, especially social networks, has abundant and easily accessible opinion-based data resources. Python’s open source crawler library, Scrapy, works well, and is the first tool to get started as a beginner. Scrapy Wiki provides comprehensive learning resources, including materials and documentation in Chinese. As I’ve always emphasized, data scientist is a job that requires a wide range of skills, and learning through practice is a good way to do that. I encourage readers of Python crawlers not to worry about their lack of knowledge. There are no barriers to entry.

By the time you have finished reading the above introduction, you should know the composition of a scrapy project, the scraping process, how each Spider in your scrapy works, and the syntax rules of XPATH. Then you see that there are only four steps to writing a simple crawler:

  1. scrapy startproject PROJECT_NAME

  2. Define a crawler:

  • The selection of Spider (CrawlSpider) depends on the appropriate application scenarios of target and crawler

  • There is an initial URL, or a method that generates the initial URL

  • There is a “parse” method that generates requests

  1. The contents of the item class that you want to grab

  2. scrapy crawl SPIDER_NAME

The first example I got used to was to collect the film review data of Douban. First, I chose Douban because of its rich corpus resources and equipped scoring system, which made it easy to obtain labels for classifying questions. Second, you can avoid account login, less restrictions. The idea is to take a particular movie and crawl through all the reviews and ratings. In this way, the late text can be used as a classification feature and the score as a classification label. I chose Embroidered Spring Knife (my favorite movie of 2014) as my target and defined the following reptiles:After a few dozen lines of code, you can start grabbing all the reviews and ratings for a movie. Before doing so, add a DOWNLOAD_DELAY = 2 to your Settings, otherwise you will be banned from Douban before you reach halfway.

Emotion classification

Characteristics of the transformation

When we get a movie review, can some algorithm automatically predict whether the review is positive or negative? Computers can’t understand human characters. Is there a way to turn words into a message that machines can understand? Taking simple linear regression as an example, variables helpful to regression prediction are used as features in regression analysis. Here, text is mainly used as features containing available information. In general, many classification algorithms require quantized feature vectors of fixed length, and the original text can be “fed” to these classification algorithms only after one step transformation. This is where the emotion classification problem is different from the general classification problem.

One of the most straightforward and general conversion methods is to count the frequency of words (single words) in the text, i.e. :

A text can be divided into several words. For example, in English, we can divide words by Spaces and punctuation marks, and in Chinese, we can make use of some complete word libraries.

Count the number of times each word appears in a paragraph

Thus, the frequency of each individual word represents a variable (characteristic), and each short comment represents a sample.After successfully converting text into a feature matrix, you might think that frequently used words such as “of” and “I” would not actually be very helpful in determining preferences, but that their high frequency might overwhelm the really important words, reducing the predictive power of features. Tf-idf is a common re-weight method. The main idea is: if a certain word or phrase appears in TF with a high frequency in one article and rarely in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.

N-grams

Even with weights, this single-word feature doesn’t do a very good job of conveying meaning. The meaning of a text depends not only on the words it is composed of, but also on the arrangement and dependence of those words. For example, “I love her” and “she loves me” are made up of the same three words, but in different order they mean different things.

One remedy is to use the N-gram. The N here refers to the number of words in a group. Instead of one word representing a feature, we can link two or more words together to form a feature. The idea behind it is simple: compare two texts, and if they have more subsets in common, they are more similar. In the above example, in addition to “I”, “have”, “little donkey” and “Beijing”, for a 2-gram, we would have “I have”, “have little donkey” and “have Beijing” appearing in the feature matrix.

Return to embroider spring dao…

I got a total of 16,587 comments, which is less than the total number of comments, because the interruption actually got about half of them, and more importantly because some of them weren’t rated and weren’t used. Depending on the number of stars, three or fewer are considered negative, while four or more are considered positive.

The basic operation process is as follows:

Several good classifiers in text classification are used: Naive Bayes, Stochastic Gradient Descent, Support Vector Machine and Random Forest.The best classifier was Bernoulli Naive Bayes, with a prediction accuracy of 0.67 for cross-validation estimates.

As for the comparison of classifiers, variable screening, parameter selection and other contents have gone beyond the scope of this article.

conclusion

  1. The motivation and definition of sentiment analysis are introduced

  2. The premise of sentiment analysis is opinion data, and crawler can obtain a large number of comments and text data, so we introduce the popular Python crawler tool scrapy, try to learn to write a simple crawler from scratch

  3. One of the difficulties of text classification is to convert text into a feature matrix that can “feed” the classification algorithm. The most straightforward way is to separate text into a group of words and calculate the frequency of the words

  4. N-gram is used to try to capture the order and dependency between words to reduce semantic loss as much as possible

Via: zhihu