Author: Suke, Zero-based, python crawler and data Analysis

Blog: www.makcyun.top

Abstract: Use Scrapy to crawl 70,000+ App of Pea pod network and conduct exploratory analysis.

Write first: If you are not interested in the data capture section, you can directly drop down to the data analysis section.

1 Analysis Background

Scrapy > > > > > > > > > > > > < span style = “box-sizing: border-box! Important; New & have spent Why is this article about catching apps again?

Because I love messing around with apps, haha. Of course, mainly because of the following points:

First, before crawling the web page is very simple

When fetching cool Ann network, we use a for loop, through hundreds of pages just finished all the contents of the grab, is very simple, but the reality is often not so easy, sometimes we have to grasp the content will be huge, grab the site data, for example, in order to enhance the crawler skills, so this paper chose the website “peas”.

The goal: To crawl all the App categories on the site and download the App icon, 70,000 or so, an order of magnitude higher than Cool.

Second, practice using a powerful Scrapy framework again

In this article, we will try to use Scrapy and add random UserAgent, agent IP, and image download Settings.

Third, compare the two websites of Cool and Pea pod

I believe many people are using Wandoujia to download App, while I use Kuan more, so I would like to compare the similarities and differences between these two websites.

Without further ado, let’s grab the flow.

▌ Analyzing objectives

First of all, let’s know what the target web page to crawl is like.

It can be seen that apps on this website are divided into many categories, including “application playing”, “system tools”, etc. There are altogether 14 categories, and each category is subdivided into several sub-categories. For example, video playing includes “video”, “live broadcasting”, etc.

Click “Video” to enter the page of the second level subclass, and you can see part of the information of each App, including icon, name, number of installs, volume, comments, etc.

In the previous article, we analyzed this page: AJAX loading, GET request, parameters easy to construct, but the exact page number is uncertain, and finally For and While loops were used to grab all page number data, respectively.

Then, we can enter the third level of the page, that is, the details of each App page, you can see the number of downloads, praise rate, the number of comments these parameters, grab ideas and the second level of the page is much the same, at the same time, in order to reduce the site pressure, so the App details page is not captured.

Therefore, this is a classification of multi-level page crawl problem, crawl all subclass data under each major category in turn.

Learned to grasp this idea, many websites we can go to catch, such as many people love to climb the “Douban movie” is also such a structure.

▌ Analyzing content

After the completion of data capture, this paper is mainly a simple exploratory analysis of classification data, including the following aspects:

  • Total ranking of most/least downloaded apps
  • Ranking of most/least downloaded App categories/subcategories
  • App download interval distribution
  • How many apps have the same name
  • Compare it with the Kuan App

▌ Analysis Tools

  • Python
  • Scrapy
  • MongoDB
  • Pyecharts
  • Matplotlib

2 Data Capture

▌ Website Analysis

We have preliminarily analyzed the website just now, and the general idea can be divided into two steps. First, extract the URL links of all subclasses, and then grab the App information under each URL respectively.

It can be seen that the URL of the subclass is composed of two numbers, the first number represents the category number, and the second number represents the subcategory number. After obtaining these two numbers, all App information under the category can be captured. Then how to obtain these two numeric codes?

Go back to the classification page and locate and check the information. It can be seen that the classification information is wrapped in each Li node, and the URL of subcategory is in the href attribute of child node A. There are 14 large categories and 88 subcategories in total.

Here, the idea is very clear, we can use CSS to extract the URL of all subcategories, and then grab the required information.

In addition, it should be noted that the homepage information of this website is statically loaded, and Ajax is used for dynamic loading from page 2. Different urls need to be parsed and extracted respectively.

▌ Scrapy grab

We need to crawl two parts of the content, one is the data information of the APP, including the name, number of installs, volume, comments and so on mentioned above, the other is to download the icon of each APP and store it in folders.

Since the site has some anti-crawling measures, we need to add random UA and proxy IP.

Scrapy-fake-useragent scrapy-fake-UserAgent scrapy-Fake-UserAgent scrapy-Fake-UserAgent scrapy-Fake-UserAgent

Now, let’s go straight to the code.

items.py

 1import scrapy
 2
 3class WandoujiaItem(scrapy.Item):
 4    cate_name = scrapy.Field() Classification of #
 5    child_cate_name = scrapy.Field() # Classification number
 6    app_name = scrapy.Field()   # subcategory name
 7    install = scrapy.Field()    # Subcategory number
 8    volume = scrapy.Field()     Volume #
 9    comment = scrapy.Field()    # comments
10    icon_url = scrapy.Field()   # icon url
Copy the code

middles.py

The middleware is used to set the proxy IP address.

 1import base64
 2proxyServer = "http://http-dyn.abuyun.com:9020"
 3proxyUser = "Your message"
 4proxyPass = "Your message"
 5
 6proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8"7)class AbuyunProxyMiddleware(object) :8    def process_request(self, request, spider) :
 9        request.meta["proxy"] = proxyServer
10        request.headers["Proxy-Authorization"] = proxyAuth
11        logging.debug('Using Proxy:%s'%proxyServer)
Copy the code

pipelines.py

This file is used to store data to MongoDB and download ICONS to the categorized folder.

Store to MongoDB:

1 mongo storage 2class MongoPipeline(object) :3    def __init__(self,mongo_url,mongo_db) :
 4        self.mongo_url = mongo_url
 5        self.mongo_db = mongo_db
 6
 7    @classmethod
 8    def from_crawler(cls,crawler) :
 9        return cls(
10            mongo_url = crawler.settings.get('MONGO_URL'),
11            mongo_db = crawler.settings.get('MONGO_DB')
12        )
13
14    def open_spider(self,spider) :
15        self.client = pymongo.MongoClient(self.mongo_url)
16        self.db = self.client[self.mongo_db]
17
18    def process_item(self,item,spider) :
19        name = item.__class__.__name__
20        # self.db[name].insert(dict(item))
21        self.db[name].update_one(item, {'$set': item}, upsert=True)
22        return item
23
24    def close_spider(self,spider) :
25        self.client.close()
Copy the code

Click on folder to download icon:

 1# Download by folder
 2class ImagedownloadPipeline(ImagesPipeline):
 3    def get_media_requests(self,item,info) :
 4        if item['icon_url'] :5            yield scrapy.Request(item['icon_url'],meta={'item':item})
 6
 7    def file_path(self, request, response=None, info=None) :
 8        name = request.meta['item'] ['app_name']
 9        cate_name = request.meta['item'] ['cate_name']
10        child_cate_name = request.meta['item'] ['child_cate_name']
11
12        path1 = r'/wandoujia/%s/%s' %(cate_name,child_cate_name)
13        path = r'{}{}.{}'.format(path1, name, 'jpg')
14        return path
15
16    def item_completed(self,results,item,info) :
17        image_path = [x['path'for ok,x in results if ok]
18        if not image_path:
19            raise DropItem('Item contains no images')
20        return item
Copy the code

settings.py

 1BOT_NAME = 'wandoujia'
 2SPIDER_MODULES = ['wandoujia.spiders']
 3NEWSPIDER_MODULE = 'wandoujia.spiders'
 4
 5MONGO_URL = 'localhost'
 6MONGO_DB = 'wandoujia'
 7
 8# Follow the robot rules
 9ROBOTSTXT_OBEY = False
10Since the purchased ABU Cloud can only request 5 times per second, a 0.2s delay is set for each request
11DOWNLOAD_DELAY = 0.2
12
13DOWNLOADER_MIDDLEWARES = {
14    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'None.15    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware'100.# random UA
16    'wandoujia.middlewares.AbuyunProxyMiddleware'200 # Abcloud Agent
17    )
18
19ITEM_PIPELINES = {
20   'wandoujia.pipelines.MongoPipeline'300.21   'wandoujia.pipelines.ImagedownloadPipeline'400.22}
23
24# the URL is not heavy
25DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
Copy the code

wandou.py

The key parts are listed here:

 1def __init__(self):
 2        self.cate_url = 'https://www.wandoujia.com/category/app'
 3        # subcategory home page URL
 4        self.url = 'https://www.wandoujia.com/category/'
 5        # sub-category Ajax request page URL
 6        self.ajax_url = 'https://www.wandoujia.com/wdjweb/api/category/more?'
 7        # instantiate the category tag
 8        self.wandou_category = Get_category()
 9def start_requests(self):
10        yield scrapy.Request(self.cate_url,callback=self.get_category)
11
12def get_category(self,response):    
13        cate_content = self.wandou_category.parse_category(response)
14        #...
Copy the code

Here, we define a few urls, including the category page, the subcategory home page, and the subcategory AJAX page, which starts on page 2. Then we define a class, Get_category(), that is dedicated to extracting all subcategory urls, which we’ll expand later.

The program starts from start_requests, parses the front page for the response, calls the get_category() method, and extracts all urls using the parse_category() method in the get_category() class, as follows:

 1class Get_category():
 2    def parse_category(self, response) :
 3        category = response.css('.parent-cate')
 4        data = [{
 5            'cate_name': item.css('.cate-link::text').extract_first(),
 6            'cate_code': self.get_category_code(item),
 7            'child_cate_codes': self.get_child_category(item),
 8        } for item in category]
 9        return data
10
11    Get all main category tag numeric codes
12    def get_category_code(self, item) :
13        cate_url = item.css('.cate-link::attr("href")').extract_first()
14        pattern = re.compile(r'.*/(d+)')  Extract the main class tag code
15        cate_code = re.search(pattern, cate_url)
16        return cate_code.group(1)
17
18    Get all subclassification names and codes
19    def get_child_category(self, item) :
20        child_cate = item.css('.child-cate a')
21        child_cate_url = [{
22            'child_cate_name': child.css('::text').extract_first(),
23            'child_cate_code': self.get_child_category_code(child)
24        } for child in child_cate]
25        return child_cate_url
26
27    # Regex extract sub-classification code
28    def get_child_category_code(self, child) :
29        child_cate_url = child.css('::attr("href")').extract_first()
30        pattern = re.compile(r'.*_(d+)')  Extract the subclass tag number
31        child_cate_code = re.search(pattern, child_cate_url)
32        return child_cate_code.group(1)
Copy the code

Here, in addition to the classification name cate_name can be easily extracted directly, we use get_category_code() to extract the names and codes of classification codes and subcategories. The extraction method uses CSS and regular expression, relatively simple.

The final extracted category name and coding result are as follows. With these codes, we can construct URL request to start extracting App information under each sub-category.

 1{'cate_name''Video play'.'cate_code''5029'.'child_cate_codes': [
 2    {'child_cate_name''video'.'child_cate_code''716'}, 
 3    {'child_cate_name''live'.'child_cate_code''1006'}, 
 4.5    ]}, 
 6{'cate_name''System Tools'.'cate_code''5018'.'child_cate_codes': [
 7    {'child_cate_name''WiFi'.'child_cate_code''895'}, 
 8    {'child_cate_name''browser'.'child_cate_code''599'}, 
 9.10    ]}, 
11..Copy the code

I’ll continue with the previous get_category() and extract the App’s information:

 1def get_category(self,response):    
 2        cate_content = self.wandou_category.parse_category(response)
 3        #...
 4        for item in cate_content:
 5            child_cate = item['child_cate_codes']
 6            for cate in child_cate:
 7                cate_code = item['cate_code']
 8                cate_name = item['cate_name']
 9                child_cate_code = cate['child_cate_code']
10                child_cate_name = cate['child_cate_name']
11
12                page = 1 # set the number of pages to climb
13                if page == 1:
14                    Construct the home page URL
15                    category_url = '{} {} _ {}' .format(self.url, cate_code, child_cate_code)
16                else:
17                    params = {
18                        'catId': cate_code,  # category
19                        'subCatId': child_cate_code,  # subcategory
20                        'page': page,
21                        }
22                    category_url = self.ajax_url + urlencode(params)
23                dict = {'page':page,'cate_name':cate_name,'cate_code':cate_code,'child_cate_name':child_cate_name,'child_cate_code':child_cate_code}
24                yield scrapy.Request(category_url,callback=self.parse,meta=dict)
Copy the code

Here, all the category names and encodings are extracted in turn to construct the requested URL.

Since the URL on the front page and the URL at the beginning of page 2 are different, we construct them separately using an if statement. Next, the URL is requested and parsed by calling the self.parse() method, which uses the meta parameter to pass related parameters.

 1def parse(self, response):
 2        if len(response.body) >= 100:  The length of the page is 87 when there is no content
 3            page = response.meta['page']
 4            cate_name = response.meta['cate_name']
 5            cate_code = response.meta['cate_code']
 6            child_cate_name = response.meta['child_cate_name']
 7            child_cate_code = response.meta['child_cate_code']
 8
 9            if page == 1:
10                contents = response
11            else:
12                jsonresponse = json.loads(response.body_as_unicode())
13                contents = jsonresponse['data'] ['content']
14                # response is json,json content is HTML, HTML is text can not be directly extracted using.css, need to be converted first
15                contents = scrapy.Selector(text=contents, type="html")
16
17            contents = contents.css('.card')
18            for content in contents:
19                # num += 1
20                item = WandoujiaItem()
21                item['cate_name'] = cate_name
22                item['child_cate_name'] = child_cate_name
23                item['app_name'] = self.clean_name(content.css('.name::text').extract_first())  
24                item['install'] = content.css('.install-count::text').extract_first()
25                item['volume'] = content.css('.meta span:last-child::text').extract_first()
26                item['comment'] = content.css('.comment::text').extract_first().strip()
27                item['icon_url'] = self.get_icon_url(content.css('.icon-wrap a img'),page)
28                yield item
29
30            # Recursively climb the page
31            page += 1
32            params = {
33                    'catId': cate_code,  # large categories
34                    'subCatId': child_cate_code,  Small # category
35                    'page': page,
36                    }
37            ajax_url = self.ajax_url + urlencode(params)
38            dict = {'page':page,'cate_name':cate_name,'cate_code':cate_code,'child_cate_name':child_cate_name,'child_cate_code':child_cate_code}
39            yield scrapy.Request(ajax_url,callback=self.parse,meta=dict)    
Copy the code

Finally, the parse() method is used to parse and extract the required App name, installation number and other information. After a page is parsed, the page is incremented and the parse() method is repeated until the last page of the class is parsed.

Finally, after a few hours, we were able to grab all the App information. I got 73,755 messages and 72,150 ICONS. The two numbers are different because some apps only have information but no ICONS.

Icon Download:

A simple exploratory analysis of the extracted information will be performed below.

3 Data Analysis

▌ General situation

First, let’s take a look at the number of apps installed. After all, there are more than 70,000 apps, so it’s natural to be interested in which apps are used the most and which are used the least.

The code implementation is as follows:

 1plt.style.use('ggplot')
 2colors = '#6D6D6D' # font color
 3colorline = '#63AB47'  # Red CC2824 # pea pod green
 4fontsize_title = 20
 5fontsize_text = 10
 6
 7# Total downloads ranking
 8def analysis_maxmin(data):
 9    data_max = (data[:10]).sort_values(by='install_count')
10    data_max['install_count'] = (data_max['install_count'] / 100000000).round(1)
11    data_max.plot.barh(x='app_name',y='install_count',color=colorline)
12    for y, x in enumerate(list((data_max['install_count'))) :13        plt.text(x + 0.1, y - 0.08.'%s' %
14                 round(x, 1), ha='center', color=colors)
15
16    plt.title('Top 10 most installed apps? ',color=colors)
17    plt.xlabel('Downloads (100 million)')
18    plt.ylabel('App')
19    plt.tight_layout()
20    # plt.savefig(' most installed app.png ',dpi=200) # plt.savefig(' most installed app.png ',dpi=200)
21    plt.show()
Copy the code

There are two “surprises” :

  • Top of the list is a mobile management app

    I was surprised by the first place on Wandoujia. First, I was curious whether everyone loves cleaning cell phones or is afraid of poisoning? After all, my own phone has been running naked for years; Second, the first is not actually goose factory’s other products, such as: wechat or QQ.

  • The list looked around, expected to appear did not appear, did not expect to appear

    In the top 10, there are some names that you have never heard of, such as Shuqi Novel and Inke, while the national App wechat and Alipay are not even in the list.

With questions and curiosity, I found the homepages of “Tencent Mobile Phone Manager” and “wechat” respectively:

Tencent Mobile Butler downloads and installs:

Wechat downloads and installs:

What is going on here??

Tencent’s 300 million downloads are equivalent to the number of installs, while wechat’s 2 billion downloads are only 10 million. The comparison of the two sets of data roughly reflects two problems:

  • Either that or Tencent Butler doesn’t actually have that many downloads
  • Or there are fewer downloads on wechat

Either way, it reflects a problem: the site isn’t being careful enough.

To prove this point, a comparison of the installs and downloads of the top 10 shows that many apps have the same installs as the downloads, which means that the actual installs of these apps are not that high, and if so, the list is heavily inflated.

Is this the result of all that hard work?

Taking a look at the fewest installed apps, here are the 10 fewest:

After a quick glance, I was even more surprised:

“QQ Music” is actually the last one, only 3 installs!

Is this the same product as QQ Music, which has just been listed with a market value of 100 billion yuan?

Double check:

Yes, it says 3 people install!

How far has this gone? This installation amount, goose factory can also “attentively do music”?

To tell the truth, here has no longer want to go down the analysis, afraid to climb to pick up more unexpected things, but hard climb for so long, or go down to see it.

After looking at the beginning and end, let’s take a look at the overall distribution of the installed number of all apps, and remove the top 10 apps with a lot of moisture.

Surprisingly, there are 67,195 apps, 94% of which have less than 10,000 installs!

If all the data on the site are true, then the number one mobile butler on the list alone is almost equal to the number of installed apps of over 60,000!

For most App developers, there is only one thing to say: the reality is very cruel. The probability of an App being developed with hard work is as high as 95%, with no more than 10,000 users.

The code implementation is as follows:

 1def analysis_distribution(data):
 2    data = data.loc[10:,:]
 3    data['install_count'] = data['install_count'].apply(lambda x:x/10000)
 4    bins = [0.1.10.100.1000.10000]
 5    group_names = ['Under 10,000'.'1-100000'.'10-1 million'.'100-10 million'.'10 million - 100 million ']
 6    cats = pd.cut(data['install_count'],bins,labels=group_names)
 7    cats = pd.value_counts(cats)
 8    bar = Bar('App Download Distribution '.'Up to 94% of apps downloaded less than 10,000')
 9    bar.use_theme('macarons')
10    bar.add(
11        'App number'.12        list(cats.index),
13        list(cats.values),
14        is_label_show = True.15        xaxis_interval = 0.16        is_splitline_show = 0.17        )
18    bar.render(path='App download distribution.png',pixel_ration=1)
Copy the code

▌ Classification

Now, let’s take a look at apps by category, not by installs, but by number, to eliminate distractions.

It can be seen that among the 14 categories, there is not much difference in the number of apps in each category. The largest number of “life and leisure” is a little more than twice that of “photographic image”.

Then, we take a further look at the number of apps in 88 subcategories and screen out the 10 subcategories with the largest and least number of apps:

Two interesting things can be observed:

  • The ‘radio’ category had the most apps, with more than 1,300 available

    This is very surprising, radio can be said to be completely old, there are so many people to develop.

  • There is a large gap in the number of App subclasses

    The most “radio” is nearly 20 times the least “dynamic wallpaper”. If I were an App developer, I would like to try to develop niche apps with smaller competition, such as “reciting words” and “children’s Encyclopedia”.

After reading the overall situation and classification, I suddenly thought of a question: Are there any apps with the same name?

I was surprised to find that there are as many as 40 apps named “one-click lock screen”. Is it hard to think of another name for this feature? Many phones now support touch-screen locking, which is more convenient than one-click locking.

Next, let’s simply compare the apps of Wandoujia and Kuan.

▌ Compare Kuan

The most intuitive difference between the two is that Wandoujia has an absolute advantage in the number of apps, which is ten times as much as Kuan, so we are naturally interested in:

Does Wandoujia include all apps on Kuan?

If it’s, “I have what you have and I have what you don’t have,” then Kuann has no advantage. When counted, only 3,018 pods, or about half, were included, leaving out the other half.

This is no doubt there are two platforms on the App name inconsistent phenomenon, but there is more reason to believe that many niche boutique App is unique, pea pod is not.

The code implementation is as follows:

 1include = data3.shape[0]
 2notinclude = data2.shape[0] - data3.shape[0]
 3sizes= [include,notinclude]
 4labels = [U 'contains'.U prime does not contain.]
 5explode = [0.0.05]
 6plt.pie(
 7    sizes,
 8    autopct = '%.1f%%'.9    labels = labels,
10    colors = [colorline,'#7FC161'].# Pea pod Green
11    shadow = False.12    startangle = 90.13    explode = explode,
14    textprops = {'fontsize':14.'color':colors}
15)
16plt.title('Pea Pod only contains half the number of apps on Kuan',color=colorline,fontsize=16)
17plt.axis('equal')
18plt.axis('off')
19plt.tight_layout()
20plt.savefig('Contain without guarantee contain contrast. PNG',dpi=200)
21plt.show()
Copy the code

Next, let’s take a look at the number of downloads on both platforms for the included apps:

As you can see, there is still a significant gap between the number of App downloads on the two platforms.

Finally, I looked at what apps were not included on the pod:

Found many artifacts are not included, such as RE, green guardian, a wooden letter and so on. So much for the pea pod and Cool Ann comparison. Use Scrapy to pick up and analyze multilevel pages.

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

\

More recommended

\

Python iterators use details \

\

Learn about Python iterables, iterators, and generators

\

Use Python to crawl financial market data \

\

Build CNN model to crack website captcha \

\

Image recognition with Python (OCR)

\

Email: [email protected]

\

**** Free membership of the Data Science Club ****