In recent days, there is a Domestic film due to praise and word of mouth and began to rise as a dark horse counterattack, in the circle of friends and weibo will often see the relevant content, that is the Unknown, starring Chen Jianbin, Ren Suxi and so on. Such a domestic film with no big names or streaming stars, and even no distinctive titles or posters, has attracted a lot of people’s attention, and has directly outscored other domestic films such as Venom and Fantastic Beasts: The Crimes of Grindelwald in terms of ratings. The drama has received 8.3 points out of 10 on Douban, with 5-star reviews accounting for 34.8 percent, while maoyan has received more than 50 percent. Looking at this figure, it is still a good Domestic drama. On a seemingly normal day, I ran cat’s Eye review data on a low-powered Mac to see if the low-budget comedy was worth watching.


We need to explain in particular why we use the data of Cat’s eye instead of Douban’s? Mainly because Douban is directly rendered HTML, while Maoyan data is JSON, it is more convenient to process.


Obtain cat-eye interface data


As a programmer who stays at home for a long time, he knows how to catch all kinds of packages. If you look at the mode of the original code in Chrome, you can see the interface clearly. The interface address is:

http://m.maoyan.com/mmdb/comments/movie/1208282.json? _v_=yes&offset=15

In Python, we can easily use Request to send a network request and get the result back:

Def getMoveinfo(url): session = requests.Session() headers = {" user-agent ": "Mozilla/5.0 (iPhone); CPU iPhone OS 11_0 like Mac OS X)" } response = session.get(url, headers=headers) if response.status_code == 200: return response.text return NoneCopy the code

According to the above request, we can get the returned data of this interface. There is a lot of information in the data content, but there is a lot of information that we do not need. Let’s take a general look at the returned data first:

{ "cmts":[ { "approve":0, "approved":false, "assistAwardInfo":{ "avatar":"", "celebrityId":0, "celebrityName":"", "Rank" : 0, "title" : ""}," authInfo ":" ", "cityName" : "guiyang", "content" : "I have to be very, borrow money to see a movie. ", "filmView":false, "id":1045570589, "isMajor":false, "juryLevel":0, "majorType":0, "movieId":1208282, "nick":"nick", "nickName":"nickName", "oppose":0, "pro":false, "reply":0, "score":5, "spoiler":0, "startTime":"2018-11-22 23:52:58", "SupportComment" : true, "supportLike" : true, "sureViewed" : 1, "tagList" : {" fixed ": [{" id" : 1, "name" : "praise"}, {" id ": 4, "Name" : "ticket"}}], "time", "the 2018-11-22 23:52", "userId" : 1871534544, "a userLevel" : 2, "videoDuration" : 0, "vipInfo" : "", "vipType":0 } ] }Copy the code

With so much data, we are only interested in the following fields:

nickName, cityName, content, startTime, score

Next, we do some important data processing and parse out the required fields from the JSON data:

def parseInfo(data): 
    data = json.loads(html)['cmts']
    for item in data:
        yield{
            'date':item['startTime'],
            'nickname':item['nickName'],
            'city':item['cityName'],
            'rate':item['score'],
            'conment':item['content']
        }Copy the code

Once we have the data, we can start analyzing it. However, in order to avoid frequent requests for data from cat’s eye, data needs to be stored. Here, the author uses SQLite3 and puts it into the database, which is more convenient for subsequent processing. The code for storing data is as follows:

def saveCommentInfo(moveId, nikename, comment, rate, city, start_time) conn = sqlite3.connect('unknow_name.db') conn.text_factory=str cursor = conn.cursor() ins="insert into comments values (? ,? ,? ,? ,? ,?) " v = (moveId, nikename, comment, rate, city, start_time) cursor.execute(ins,v) cursor.close() conn.commit() conn.close()Copy the code


The data processing

Because we are using the database for data storage, so you can directly use SQL to query the results you want, such as what are the top five cities in the comments:

SELECT  city, count(*) rate_count  FROM comments GROUP BY city ORDER BY rate_count DESC LIMIT 5Copy the code

The results are as follows:


From the data above, we can see that the most comments come from Beijing.

Not only that, you can use more SQL statements to query the desired results. For example, the number of people scoring each grade, the proportion, etc. If I am interested, I can try to query the data. It is as simple as that.

In order to better display the data, we use the Pyecharts library for data visualization.

According to the data obtained from Maoyan, Pyecharts is directly used to display the data on the map of China according to the geographical location:

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment']) city = data.groupby(['city']) city_com = city['rate'].agg(['mean','count']) city_com.reset_index(inplace=True) data_map = [(city_com['city'][I],city_com['count'][I]) for I in range(0,city_com.shape[0])] geo = geo (" geo location analysis ",title_pos = "center",width = 1200,height = 800) while True: try: Attr,val = geo. Cast (data_map) geo. Add ("",attr,val,visual_range=[0,300],visual_text_color="# FFF ", symbol_size=10, is_visualmap=True,maptype='china') except ValueError as e: e = e.message.split("No coordinate is specified for ")[1] data_map = filter(lambda item: item[0] ! = e, data_map) else : break geo.render('geo_city_location.html')Copy the code

Note: In the data map provided by Pyecharts, there are some cities in cat’s eye data that cannot find corresponding slave labels, so in the code, GEO added the cities with errors, and we deleted them directly, filtering out a lot of data.

Using Python, this is as simple as generating the following map:


It can be seen from the visualized data that people who both watch and comment on movies are mainly distributed in eastern China, with Beijing, Shanghai, Chengdu and Shenzhen having the largest numbers. Although you can see a lot of data from the graph, it is not intuitive, if we want to see the distribution of each province/city, we need to process the data further.

However, in the data obtained from Maoyan, the data contained in the city contains the data of the county seat, so it is necessary to convert the data to the corresponding provinces and cities, and then add up the number of comments in the same province to get the final result.

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment']) city = data.groupby(['city']) city_com = city['rate'].agg(['mean','count']) city_com.reset_index(inplace=True) fo = open("citys.json",'r') citys_info = fo.readlines() citysJson = json.loads(str(citys_info[0])) data_map_all = [(getRealName(city_com['city'][i], citysJson),city_com['count'][i]) for i in range(0,city_com.shape[0])] data_map_list = {} for item in data_map_all: if data_map_list.has_key(item[0]): value = data_map_list[item[0]] value += item[1] data_map_list[item[0]] = value else: data_map_list[item[0]] = item[1] data_map = [(realKeys(key), data_map_list[key] ) for key in data_map_list.keys()] def getRealName(name, jsonObj): for item in jsonObj: if item.startswith(name) : return jsonObj[item] return name def realKeys(name): Return the name. The replace (u "province", ""). The replace (u" city ", ""). The replace (u" hui autonomous region ", ""). The replace (u" uygur autonomous region ", ""). The replace (u" zhuang autonomous region," "").replace(u" ", "")Copy the code

After the above data processing, use the map provided by Pyecharts to generate a map showing by province/city:

Def generateMap(data_map): map = map (width= 1200, height = 800, title_pos="center") Attr,val = ge. cast(data_map) map.add("",attr,val,visual_range=[0,800], visual_text_color="# FFF ",symbol_size=5, is_visualmap=True,maptype='china', is_map_symbol_show=False,is_label_show=True,is_roam=False, ) except ValueError as e: e = e.message.split("No coordinate is specified for ")[1] data_map = filter(lambda item: item[0] ! = e, data_map) else : break map.render('city_rate_count.html')Copy the code


Of course, we can also visualize the number of people for each score, which is shown in a bar chart:

Data = pd. Read_csv (f, sep = '{', the header = None, encoding = utf-8', names = [' date ', 'nickname', 'city' and 'what' and 'comment']) # according to the grade classification rateData = data.groupby(['rate']) rateDataCount = rateData["date"].agg([ "count"]) rateDataCount.reset_index(inplace=True) count = rateDataCount.shape[0] - 1 attr = [rateDataCount["rate"][count - i] for i in range(0, rateDataCount.shape[0])] v1 = [rateDataCount["count"][count - i] for i in range(0, Ratedatacount.shape [0]) bar = bar ("数 据 数 ") bar. Add ("数 据 数 ",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2, xaxis_interval=0,is_splitline_show=True) bar.render("html/rate_count.html")Copy the code

As shown in the graph below, in maoyan’s data, five-star comments account for more than 50%, much better than the 34.8% five-star comments on Douban.


From the above data of audience distribution and score, it can be seen that this play is still very popular among audience friends. Earlier, I got the data of audience comments from Cat’s Eye. Now, I will divide the comments on jieba and create a Wordcloud through Wordcloud. Let’s have a look at the audience’s comments on nobody:

data = pd.read_csv(f,sep='{',header=None,encoding='utf-8',names=['date','nickname','city','rate','comment'])
comment = jieba.cut(str(data['comment']),cut_all=False)
wl_space_split = " ".join(comment)
backgroudImage = np.array(Image.open(r"./unknow_3.png"))
stopword = STOPWORDS.copy()
wc = WordCloud(width=1920,height=1080,background_color='white',
    mask=backgroudImage,
    font_path="./Deng.ttf",
    stopwords=stopword,max_font_size=400,
    random_state=50)
wc.generate_from_text(wl_space_split)
plt.imshow(wc)
plt.axis("off")
wc.to_file('unknow_word_cloud.png')Copy the code

Export:



Besides, from the word cloud picture we can clearly see the “little guy”, “good”, “comedy”, “acting” these four words are very outstanding, have always been able to be called dark horse is a small cost and reflect the absurd comedy is of little, from the four keywords we seem to see the film in why on earth will reap many high praise. As one douban comment put it: “It’s not love, but better than love. Just right for bereavement, just right for darkness, just right for warmth. Some people say that there are no “healing” movies in China. Since then, there have been. Look at this, we laugh and cry. There are not a few realistic films about the bottom characters, but this one is the most sincere, honest and “sad but not hurt” one I have seen in recent years. You will fully feel what is called ‘real acting’, you will see Chen Jianbin ren Suxi Zhang Yu Wang Yanhui and other ‘top acting team’ how to play. I sincerely hope that the spring of good actors and good movies will begin.

The author | Luo Zhaocheng


Coordinating editor | tang traditional operas


Address:
Why Nobodies are a Dark horse