170 lines of code to crawl the data from the short review of White Snake: Origin

♚

Geeky Monkey, a Python enthusiast, currently specializes in creating web crawlers and Django frameworks using Python.

It takes about 11 minutes to read the text.

In my childhood memories, most of the cartoons broadcast by TV stations were imported from Japan and the United States. Many cartoons are classics on the screen, such as Transformers series, Beast of Prey, Spider-Man, Dragon Ball, Detective Conan, Slam Dunk, Digimon, etc.

However, there are very few domestic fine animations, which may be because China’s animation industry was still in its infancy at that time. A few decades have passed, now the domestic animation is a strong rise, which also emerged “Fight to break the Sky”, “The Moon of Qin”, “Nine songs” and other excellent animation.

On Jan 11, 2019, white Snake, a Chinese animated film, hit theaters across the country and received rave reviews. The film received 9.4 points from Maoyan and 8.0 points from Douban for its stunning screen and excellent dubbing.

Since it is rare to see the boutique, so I go to the cat eye climb the netizen’s short comment, see the netizen’s point of view.

01 Analysis Page

It is estimated that many people often patronize cat eye movie network, cat eye anti-crawling mechanism is more and more strict, more and more means. If you select Just Front, the total revenue may not be high. Moreover, the PC page only has excellent short comments, not all of the page comment data.

So I chose to shift the battlefield and start with the mobile page to see if anything came of it. Set the browser to mobile mode browser and find that the mobile page has all the short comment data. Click “View All Discussion” to continue the packet capture analysis.

After scrolling through a few pages of data on your own, you finally find a pattern. \

The address requested for the page is:

Copy the code

M.maoyan.com/review/v2/c…

The following parameters are carried:

Copy the code

MovieId =1235560& # movieId userId=-1& # default userId offset=0& # number of shard pages limit=15& # display specific values for each shard ts=0& # current time type=3\

The value of offset is then incremented at intervals of 15.

I format the request to make sure I get the content of the comment, the number of likes, etc. However, this cannot meet my needs. I want to obtain urban information, and then I want to draw a geographic thermal map. Currently, this interface has no city information.

So I chose to browse the major search engines and try my luck. In the end, luck came to my rescue and I found a cat ‘s-eye that someone else had dug up.

Copy the code

M.maoyan.com/mmdb/commen…

In the command, 1235560 indicates the ID of the movie, and offset indicates the number of pages.

02 Crawler Making

Because the amount of data of short comments may be quite large, SO I choose to use the database to store the data. Data export and deduplication are convenient.

Extract the desired data from the JSON data result yourself, then design the data table and create it.

Copy the code

def create_database(self): create_table_sql = ( “CREATE TABLE IF NOT EXISTS {} (” “`id` VARCHAR(12) NOT NULL,” “`nickName` VARCHAR(30),” “`userId` VARCHAR (12), “” ` userLevel ` INT (3),” “` cityName ` VARCHAR (10),” “` gender ` tinyint (1),” “` score ` FLOAT (2, 1),” “` startTime ` VARCHAR(30),” “`filmView` BOOLEAN,” “`supportComment` BOOLEAN,” “`supportLike` BOOLEAN,” “`sureViewed` INT(2),” “`avatarurl` VARCHAR(200),” “`content` TEXT” “) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4”.format(self.__table) ) try: self.cursor.execute(create_table_sql) self.conn.commit() except Exception as e: self.close_connection() print(e)\

Then construct the Session Session, request headers, and URL address.

Copy the code

“”” Session = requests.Session() # movie_url = ‘m.maoyan.com/mmdb/commen… }’ headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Linux; The Android 6.0. Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36’, ‘Accept-Encoding’: ‘gzip, deflate’, ‘accept’: ‘text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp, * / *; Q = 0.8 ‘, ‘Host’ : ‘m.maoyan.com’, ‘the Accept – Language’ : ‘useful – CN, useful; Q =0.8’, ‘cache-control ‘: ‘max-age=0’, ‘Connection’: ‘keep-alive’,’ upgrade-insecure -Requests’: ‘1’,}\

The URL address is then requested and the returned Json data is parsed.

Copy the code

offset = 1 while 1: Print (” = = = = = = = = = = = = to grab the first ‘, offset, ‘page essay = = = = = = = = = = = =’) print (” = = = = = = = = = = = = > > > ‘, movie_url.format(offset)) response = session.get(movie_url.format(offset), headers=headers) if response.status_code == 200: Data_list = [] data = {} for comment in json. Response ()[‘ CMTS ‘]: loads()[‘ CMTS ‘] data[‘id’] = comment.get(‘id’) data[‘nickName’] = comment.get(‘nickName’) data[‘userId’] = comment.get(‘userId’) data[‘userLevel’] = comment.get(‘userLevel’) data[‘cityName’] = comment.get(‘cityName’) data[‘gender’] = comment.get(‘gender’) data[‘score’] = comment.get(‘score’) data[‘startTime’] = comment.get(‘startTime’) data[‘filmView’] = comment.get(‘filmView’) data[‘supportComment’] = comment.get(‘supportComment’) data[‘supportLike’] = comment.get(‘supportLike’) data[‘sureViewed’] = comment.get(‘sureViewed’) data[‘avatarurl’] = comment.get(‘avatarurl’) Data [‘ content ‘] = comment. Get (‘ content ‘) print (data) data_list. Append (data) data = {} print (” = = = = = = = = = = = = resolves to ‘, Len (data_list), ‘the essay data = = = = = = = = = = = =’) self. Insert_comments (data_list) else: Print (‘>=== ‘, offset, ‘) print(‘>== ‘, offset, ‘) Error code: ‘+ response.status_code) break offset += 1 time.sleep(random.randint(10, 20))\

After parsing the data, the last step is to insert the data into the database.

Copy the code

def insert_comments(self, datalist): “”” insert_SQL = (“insert into “” {} (id, nickName, userId, userLevel, cityName, gender, score, ” “startTime, filmView, supportComment, supportLike, sureViewed, avatarurl, content)” “values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)”.format(self.__table)) try: templist = [] for comment in datalist: if comment.get(‘gender’) is None: comment[‘gender’] = -1 data = (comment.get(‘id’), comment.get(‘nickName’), comment.get(‘userId’), comment.get(‘userLevel’), comment.get(‘cityName’), comment.get(‘gender’), comment.get(‘score’), comment.get(‘startTime’), comment.get(‘filmView’), comment.get(‘supportComment’), comment.get(‘supportLike’), comment.get(‘sureViewed’), comment.get(‘avatarurl’), comment.get(‘content’)) templist.append(data) self.cursor.executemany(insert_sql, templist) self.conn.commit() except Exception as e: print(‘===== insert exception –>>> %s’, e)\

I haven’t finished the crawler yet because I control the crawler rate. As for the results of the crawl, see the data analysis of the film reviews in the next article for details.

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

More recommended

What is Paige? Draw it for you in Python! \

Hidden Markov model (HMM) and Viterbi algorithm \

Understanding the Python “garble” problem \

Use Python to crawl financial market data \

Build CNN model to crack website captcha \

Image recognition with Python (OCR)

Email: [email protected]

**** Free membership of the Data Science Club ****

170 lines of code to crawl the data from the short review of White Snake: Origin

Related Posts

A long way to Go | Golang unit test of Go on topic

Do you know what an Iterator is?

Leetcode 125. Validate palindrome strings