Today I’m going to talk about a way to capture data without writing code, which can meet the needs of a considerable number of people.

Crawl data usually need to use the program to write a section of network request code to obtain web pages, encountered some web pages are asynchronous loading or confused with JS, and to spend energy to analyze. In particular, many crawler players are non-professional players, so it is quite difficult to write the code. From my understanding, most companies or crawler data requirements are one-time, and the magnitude of data acquisition is small, such as tens or hundreds of thousands of data, and it is one-time. This can be done without having to develop a program, using tools like the Web Scraper tool.

Web Scraper

The Web Scraper is a Web Scraper that runs on the Chrome browser as a Chrome plug-in without requiring complex installation and configuration. Don’t worry about whether the web page is loaded asynchronously or there is JS confusion and so on, is the WYSIWYG crawler way, skilled crawler data only need to spend 10-20 minutes to complete the configuration can start crawler (write code may take hours or even days). Ideally suited to the needs of one-time/short-term/non-reptilian professionals climbing data.

To demonstrate, for example, we want to grab the store name and user reviews under the URL of this website.

How do I install and configure a Web Scraper?

1. The Web Scraper is available in the Chrome store. If you can’t access the Chrome store, you can install the Chrome plug-in locally from the Web Scraper.

2. After the Web Scraper plug-in is installed, open the Developer tool of Chrome. The Web Scraper option indicates that the installation is successful.

3. Configure the capture rule

Configuration is also very simple, first popular science, when we write a program to crawl the web page, usually need an entrance page (this page is usually a channel page, list page and so on), the program to extract the URL of the entrance page, then access these URLS and then extract the detailed information we need inside.

For example, we need to extract the name of the restaurant under the classification of private dishes on Yelp and the comments in the shop. We first need an entrance page (that is, the URL of the private food channel), extract the URL of the store in the page, the program visits these store URLS, and then extract the store name, comments and other information in the store.

The same is true for the Web Scraper, which requires a Start URL, a URL rule to extract the entry page, and then a rule to extract details. For a detailed Web Scraper tutorial, I wrote a simple Web Scraper configuration tutorial on ape Homology.

For example, if you want to grab restaurants and reviews on dianping’s private food channel.

The first step is to set the URL of the private cooking channel as the Start URL.

If you want to turn a page, check out Yelp’s page-turning rules, which look like this:

The second page

www.dianping.com/shanghai/ch…

The third page

www.dianping.com/shanghai/ch…

I can write the page turning rule like this

www.dianping.com/shanghai/ch…

Page 1 to page 5

The second step is to create a rule to extract the URL in the entrance page, that is, extract the store URL, see GIF most intuitive:

Completely visual operation, Select Link for Type, Select Select for Selector, and then Select the store on the page a few times, automatically write the xpath rule to extract the url of the store. Click Data Preview to check and preview if the extraction rules are in effect.

The third step is common rules for extracting detailed page information of stores, such as extracting comments:

You can configure a fetching rule in 10-20 minutes if you are comfortable with it, but for more complex data extraction rules you can see the documentation on its website.

www.webscraper.io/documentati…

The Web Scraper is no problem for scraping thousands of data at a time, but can be used for a small number of crawls for data analysis or supplementary crawls. Of course, with the use of switching agent IP software, you can also do a lot of data capture for a long time, but the efficiency is not so high.

To summarize the benefits of the Web Scraper:

1. Capture dynamically loaded data, such as page turning data through Ajax;

2. The captured data can be exported to a local computer in CSV format.

3. It is convenient to grab the data that needs to be logged in, because the plug-in runs on the browser;

4. Don’t worry about JS/CSS obfuscating data;

5. Simple configuration and visualized configuration extraction rules.