Use Python to crawl all WordPress plugins

Illustrations by Wojtek Fus

♚ \

Qiu He, born in the 1980s, now lives in Shanghai. He is responsible for operation and maintenance development of an Internet company. I am interested in technology, like to study, and I am interested in new technology.

Blog: bestscreenshot.com

The author of this article has joined the Python Chinese Community Columnist Program

preface

Anyone who uses WordPress more or less installs plugins that enrich and extend the functionality of WordPress. Plugins and themes for the WordPress platform have created a unique economic ecosystem and developer community that supports numerous WordPress-related development companies and developers. Various powerful WordPress plug-ins also emerge in endlessly, and some can even make fully-functional websites, such as recruitment websites, classified information websites, e-commerce websites, review websites, training websites and so on, which makes me admire.

Recently, I have been obsessed with WordPress, as if I fell in love with my first lover after many years. These days, I have a fancy to climb down so many dazzling plugins on WordPress and see if I can analyze something interesting.

The overall train of thought

A total of 54,520 plug-ins are listed at wordpress.org/plugins/. Remember that the official website can be browsed by a variety of categories, now only recommended plug-ins, favorites plug-ins, popular plug-ins several categories displayed, the other can only rely on search.

So the first step we need to know where can I find all the WordPress plugin list, searched a circle found that WordPress SVN has the full list, plugins.svn.wordpress.org/ (this page is larger, more than 5 m), than official website is complete, More than 70,000 in all. It would be easy to have a complete list.

The next step is to get all kinds of information about the plug-in, such as author, downloads, ratings, and so on. Where can I get this? Of course, the stupidest way is to extract the web page of each plug-in based on the address listed above, which is what crawlers do. However, the WordPress.org API provides a powerful and convenient interface for developers to access almost all of the topics, plugins, news, and other related information on WordPress.org, as well as support for various parameters and queries. Note that this is different from the REST API of WordPress. Basically you can see the difference between the Apple.com API and the iOS API (although Apple.com doesn’t have any apis…).

Some data such as the need to plug-in, then you can use the description of the plug-in API, https://api.wordpress.org/plugins/info/1.0/ {slug}. Json, slug, that is, each plug-in unique address, This is already available on the SVN. Using this API, you can return information about the plug-in in JSON format, as follows:

Once you have the list and the return format, the next step is to strip it down, essentially iterating through Python’s famous Requests library or using Python’s crawler framework, Scrapy. In terms of storing the crawl data store, I originally intended to store it in mongodb, but I encountered a pit where the VERSION of the JSON object returned by THE API has a key with a decimal point, such as “0.1”, which cannot be directly stored in mongodb. Jsonline, another python library, can store a string of JSON files as jsonl files, and is also very fast to read and write. The file that finally crawls all the data is 341M in size…

In the end, you can do some interesting data analysis using the Python data analysis tools: pandas, Numpy, seaborn, etc. As can be seen from the above returned information, there are many dimensions that can be analyzed, such as which authors develop the most plug-ins, which plug-ins are downloaded the most, which types of plug-ins are most, which countries have the most developers, and the annual growth of plug-ins, etc. Even further, you can download the ZIP files of all the plug-ins and do some in-depth code analysis with AI. It is quite interesting to think about it. The goal of this article is to provide a way of thinking and method, hoping to throw a brick to attract others.

Let’s get into the world of code

Crawl data

The preparatory work

To crawl data, the first step is to confirm the entry page of the crawler, that is, where to start crawling, find the next URL along the entry page, search – crawl – search, repeat until the end of the cycle. The analysis of incoming web pages can be done within scrapy, so if you already know the location of all requested web pages in advance, you can just throw the list of urls into scrpay and let it crawl along the list.

In order to make it clear, the crawler part does not need to be explained again, so it is carried out step by step. All urls to climb should be prepared first and can be used directly later. Said before, all WordPress plugin name list plugins.svn.wordpress.org/ can be found here, this page is a very simple static web pages, is a huge list of ul, each li is a plugin name:

 <ul>
  <li><a href="0-delay-late-caching-for-feeds/">0-delay-late-caching-for-feeds/</a></li>
  <li><a href="0-errors/">0-errors/</a></li>
  <li><a href="001-prime-strategy-translate-accelerator/">001-prime-strategy-translate-accelerator/</a></li>
  <li><a href="002-ps-custom-post-type/">002-ps-custom-post-type/</a></li>
  <li><a href="011-ps-custom-taxonomy/">011-ps-custom-taxonomy/</a></li>
  <li><a href="012-ps-multi-languages/">012-ps-multi-languages/</a></li>
Copy the code

The href here is the slug of the plug-in, the unique identifier wordpress.org uses to identify the plug-in. Parsing this HTML is a breeze for Python, such as the most common BeautifulSoup or LXMP, so I decided to try a newer library, Requests -html: HTML Parsing for Humans, another Parsing effort by kennethreitz, the master of the Requests library, didn’t come too far with Parsing HTML documents.

After slug gets it, it combines the url format of the API and writes it all into a file.

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('http://plugins.svn.wordpress.org/')

links = r.html.links  

slugs=[ link.replace('/'.' ') for link in links ]

plugins_urls=[ "Https://api.wordpress.org/plugins/info/1.0/ {}. Json".format(slug) for slug in slugs ]

with open('all_plugins_urls.txt'.'a') as out:
    out.write("\n".join(plugins_urls))
Copy the code

For comparison, here’s how BeautifulSoup works:

import requests 
from bs4 import BeautifulSoup

html=requests.get("http://plugins.svn.wordpress.org/").text

soup=BeautifulSoup(html,features="lxml")
lis=soup.find_all('li')
baseurl="https://api.wordpress.org/plugins/info/1.0/"

with open('all_plugins_urls.txt'.'a') as out:
    for a in soup.find_all('a', href=True):
        out.write(  baseurl + a['href'].replace('/'.' ') + ".json"+"\n")
Copy the code

So a simple contrast or more obvious, simple and clear. The final result of this step is the all_plugins_urls.txt file with 79,223 plug-ins.

With this list, the following Scrapy steps can be removed, and the wget can be easily scraped down to 70,000 JSON files:

wget -i all_plugins_urls.txt
Copy the code

Or simply iterate through requests and get all of the plug-in’s data, allowing you to proceed directly to the data analysis phase. For demonstration purposes, it is also a simple introduction to scrapy, for those who have not been exposed to scrapy, it can be a very preliminary introduction.

Install scrapy

The easiest way to do this is a PIP installation

pip install Scrapy
scarpy -V # verify
Copy the code

Create a new crawler Project

Scrapy provides a comprehensive command tool to facilitate a variety of crawler related operations. Generally speaking, the first thing to do with scrapy is create your scrapy project. My habit is to first create a new folder (named after the site to crawl, so that it is easy to distinguish crawlers from other sites) as my overall workspace, then go into that folder and create a new scrapy project called Scrap_wp_plugins, which can be changed to whatever name you want

mkdir ~/workplace/wordpress.org-spider
cd ~/workplace/wordpress.org-spider
scrapy startproject  scrap_wp_plugins
Copy the code

This automatically creates a file structure similar to the following:

├ ─ ─ scrap_wp_plugins │ ├ ─ ─ just set py │ ├ ─ ─ __pycache__ │ ├ ─ ─ the items. The py │ ├ ─ ─ middlewares. Py │ ├ ─ ─ pipelines. Py │ ├ ─ ─ TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXT TXTCopy the code

For our purposes, with the exception of settings.py, which needs to be modified a little bit, the rest of the files are left alone and will not be used in this simple project.

Currently it’s just a skeleton that doesn’t do anything, because there’s no crawler yet, so you can either write it all by hand or create one from a template. We’ll use our scrapy command line to generate a crawler automatically in the following syntax:

Syntax: scrapy genspider [-t template] <name> <domain>

Template is the template for the crawler to use, and the default is to use the basic one.

So name is the name of the crawler, so you can take this arbitrarily, and you’re going to use it when you start crawling. For example, if you name your spider spring Thirteen, you can summon it by Shouting, “Go! My thirteen of spring!”

Domain is the domain that the crawler is allowed to run, as in, “Go! My spring thirteen! Only along this route!”

Therefore, run the following command:

cd scrap_wp_plugins
scrapy genspider plugins_spider wordpress.org
Copy the code

This gives rise to a crawler file called plugins_spider.py under the spiders folder, which can be filled with some crawler logic and content parsing.

Create a Spider: Create a Spider to crawl a web page

Scrap_wp_plugins /plugins_spider.py

# -*- coding: utf-8 -*-
import scrapy


class PluginsSpiderSpider(scrapy.Spider) :
    name = 'plugins_spider'
    allowed_domains = ['wordpress.org']
    start_urls = ['http://wordpress.org/']

    def parse(self, response) :
        pass
Copy the code

As you can see, this is the simplest of the Spider classes, automatically filling in the parameters used to create the crawler in the previous step.

Name: the crawler’s identification name, which must be unique. You must define different names for different crawlers, which is the plugins_spider in the command line of the previous step.

Start_urls: A list of urls that the crawler starts crawling. The crawler starts grabbing data from here, so the first download of data will start with these URLS. Other urls will be generated by inheritance from these starting urls. Specifically, in the preparation section, we’ve got a list of urls in the file all_plugins_urls.txt, and now we just need to read that file in.

Parse () : crawler’s method, which is called with a Response object returned from each URL as a parameter. Response will be the only parameter of the parse method, which parses the returned data, matches the captured data (parsed to item), and tracks more urls. In this project, since json is returned, there is no need to parse any HTML. In order to save trouble, I directly store the entire JSON for later data analysis and then select the required fields. Of course, you can also choose to filter out the unwanted JSON fields according to your needs.

So, our first reptile is just around the corner! Look at the code. It’s got everything

# -*- coding: utf-8 -*-
import scrapy
import json
import jsonlines
Copy the code

Run the crawler

After modifying the crawler code above, you can now make the crawler run. Pikachu!”

scrapy crawl plugins_spider
Copy the code

Oh ho…

kv7Xgs.png

Forbidden by robots.txt

The accident happened… Nothing climbing down?? Self-confidence, look at the error message, the original is api.wordpress.org/robots.txt regulations don’t allow the crawler, and scrapy is abide by the agreement of robot, the default Settings so simple bypass line, open the setttings. Py, find below the line, Change from True to False, meaning: “whatever, I ignore your robots.txt”

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
Copy the code

Now you can have fun climbing. Add a scrapy execution log to keep the crawler status intact, and the crawler will continue to crawl from the middle of the break, if you don’t want to have to start all over again because of a network break or some other unexpected interruption

scrapy crawl plugins_spider -s JOBDIR=spiderlog   --logfile  log.out  &
Copy the code

So you can go to sleep and wake up and see all the hot and fresh WordPress plugins.

I’ve seen ships burn at the end of Orion. I’ve seen c-rays shine in the darkness near the Gates of Tanwesser. All of these moments will fade in time, like tears in rain.

— Blade Runner, 1984

** ** Hotgate push recommended 8 cutting-edge technology fields public number with Python development timer program with Python to generate douyin character video! The Python library is used to read and write the MySQL database

Click below to read the original article

Free to become a community registered member, members can enjoy more rights and interests

Use Python to crawl all WordPress plugins

preface

The overall train of thought

Crawl data

The preparatory work

Install scrapy

Create a new crawler Project

Create a Spider: Create a Spider to crawl a web page

Run the crawler

Related Posts

How to Program with Go (Environment preparation)

Revisit Java’s eight basic data types and Strings

Get rid of Gson generic encapsulation