Writing in the front

I saw the introduction of Python crawler on the front page of the brief book, so I wanted to climb the bullet screen of station B and draw the word cloud. Therefore, I had such a simple attempt, from building the environment to running through the demo. I don’t know the syntax and the meaning, install the environment, check the API, and run through the demo, which is the target! Pure zero base adorable new! Demo address (only Python demo, R not uploaded)

For details about problems encountered during environment installation and commissioning, go to the next step

Python climbs station B barrage

The environment that

Github scrapy document Scrapy framework introduction

Step-by-step instructions

  • Install python3.6
  • Install scrapy1.4
  • Establish scrapy demo
  • Run through demo and solve problems
  • Change demo to station B barrage crawl demo

    I’m following the reference document hereIntroduction to scrapy crawler frameworkThis demo, this article, whether it’s an introduction orscrapyThe introduction of moOCs is very detailed, I suggest you to follow this to get started, but due to the structure and style of MOOCs has been changed, so the demo cannot run, so I changed to the bullet screen demo of crawling station B. By September 2, 2017, the test can run.

The Demo illustrates

1. Build the project after installing scrapyscrapytest

scrapy startproject scrapytest
Copy the code

2. Demo Directory The demo directory only stores files that are available to the demo. The file names are different from those automatically generated by scrapy, and files that are not involved are deleted

│ scrapy. CFG / / project configuration file └ ─ scrapytest │ CourseItems. Py / / define a container to save data to crawl │ MyPipelines. Py / / project in the pipelines file. │ ├─ ├─ ├─ ├─ ├─ ├─ ├─ └ MyspiderCopy the code

3. The demo code

Create the courseKites.py file to define a container to hold the data to crawl. To define common output data, Scrapy provides the Item class. The Item object is a simple container that holds the retrieved data. It provides a Dictionary-like API and a simple syntax for declaring available fields. Since only the contents of the barrage are output at the end, only the contents of the barrage are defined in the container

Import scrapy class CourseItem(scrapy.item): content = scrapy.field ()Copy the code

Write the crawl code myspider.py

  • Bilibili’s barrage is in XML file. Each video has its cid and AID, and we take the numbers from cid and put them inhttp://comment.bilibili.com/+cid+.xmlTo obtain the CID of the video.

    Cid method: CID is not found in the source code, my current practice is F12 on the page, and then look for CID, the CID is the identifier of the bullet screen page, if there is a way to check through the code, please inform. The CID in the current example has more than 1000 bullets. It is recommended that you use a smaller test.





Cid lookup method

  • The XML file structure of danmu is very simple, so it can be easily parsed by Xpath





XML file structure of barrage

Import CourseItem class Myspider(scrapy. Allowed_domains = ["bilibili.com"] # fill in the crawl address start_urls = [" http://comment.bilibili.com/2015358.xml "] # write crawl method def parse (self, response) : CourseItem() = CourseItem() Str0 = 'for box in response.xpath('/ I /d/text()'): str0 += box. Item ['content'] = str0; // The structure of the last output is the value: the structure of the string, as detailed in the output graph yield itemCopy the code

After obtaining the information, verify and store it. In this case, the data is simply stored in JSON.

Scrapy. exceptions import DropItem import JSON class MyPipeline(object): def __init__(self): Json ('data.json', 'w', encoding=' utF-8 ') def process_item(self, item, spider): Line = json.dumps(dict(item), Ensure_ascii =False) + "\n" # Write file self.file.write(line) # Return item Return item # This method is called when spider is enabled. Def open_spider(self, spider): pass # This method is called when the spider is closed. def close_spider(self, spider): passCopy the code

The register Pipeline finds settings.py, the crawler’s configuration file, in which to add

ITEM_PIPELINES = {
   'scrapytest.MyPipelines.MyPipeline': 300,
}
Copy the code

The above code for registering Pipeline, including scrapytest MyPipelines. MyPipeline for you to register for classes, on the right side of the ‘300’ as the priority of the Pipeline, range 1-1000, yue xiaoyue executed first. 4. Run Demo. Run the CMD console at the level of myspider. py and run the command.

scrapy crawl MySpider
Copy the code

This is a JSON file. The output structure of the JSON file is changed in myspider. py. You can also change the code to show it in a different way





Convenient display of word segmentation




Another way

Python crawl end

To this python crawl station B bullet-screen demo ends, next we get the JSON file to R language for word segmentation.

Examples of R language participles

The environment that

R language 官网 RStudio 下载 address R language Chinese part package jiabaR R and json dumb programming R language w3c tutorial JiebaR noon word document WordCloud2 Gtihub R language ︱ Text mining — wordcloud WordCloud2 package

Step-by-step instructions

  • Install R, Rstudio, jiebaR, rJSON
  • Importing JSON files
  • Word processing
  • Stop word processing
  • Filter numbers and letters
  • To generate data
  • Call wordCloud2 to draw the wordcloud

    aboutjiebaRThe participle is basically according toThe R language Chinese divides the package jiabaRThe demo of this blog. The blog post is aboutjiebaRThe following demo will not go into the details of the code. Change the paths in demo.

The demo illustrates

Only one jiebar. R file completes word segmentation and word cloud drawing, and the code is as follows:

The value of the 'content' key in the object is a long string that is joined by commas when the crawler outputs. So when word segmentation is done through a comma participle myfile < - fromJSON (file = "F: / gitlab/py/scrapytest/scrapytest/spiders/data. Json") $# content pretreatment, Myfile. res<-myfile[myfile! Wk = worker(stop_word ='F:/R/stopw.txt') #segment = wk[myfile.res] # for word segmentation results are regular filter to remove the Numbers and letters segment = gsub (" [a zA - Z \ \ / \ \. 0-9] + ", "", segment) # word frequency calculation, Data < -freq (segment) # install library(wordcloud2) # create wordcloud2 Wordcloud2 (data,size = 1, fontFamily = "微软雅黑",color = "random-light"")Copy the code





The result after calculating the word frequency




I never thought at first that I would be so ugly

Problem specification

1. Word frequency calculation Due to the large number of words in bullet screen, there are many word frequencies after word segmentation filtering, and there is no detailed search on how to further sort and filter words, so the result of word cloud is not very good. Keyword extraction Personally thinks that a word cloud extracted by keywords would be better. Once, jiabaR provides a keyword extraction method and the result of the extraction, which is the frequency of the word occurrence.

# keyword result re = worker("keywords",topn=150) # keyword result re = vector_keywords(segment,keys)Copy the code





Extracted keyword results



vector
wordcloud
wordcloud
data.frame
vector
data.frame
csv
You can’t get through without word frequencywordcloudTo draw!!
Ask for advice on how to put keywords intowordcloudDraw!!

#re is the result of calling vector_keywords. Data.frame (re); Write. CSV (data.frame(re),"F:/R/2345.csv",row.names = T)Copy the code





The result of calling the keyword function vector_keywords

3. Put forward the words how to text more in the thesiary processing, I used the Sogou thesiary to replace jiabaR the original thesiary, so can appear similar to God Luo Tian zheng such four-word words, in the original thesiary, Lianyuzhibo are separated! But how to extract a very short sentence, from the beginning of the bullet screen can see that there are a lot of repeated words in the original file, in addition to their sogou word package set fixed words and sentences, do not know there is no other method, welcome guidance.

The R language participle end

I made the last picture so ugly that it was hard to watch. If I had expected it to be so ugly at the beginning… I would never go to points, and now do cloud text websites have their own participle seems to be, so… So I don’t know what I’m doing… . If there are mistakes, please advise!