This article will take about 8 minutes to readCopy the code


Today to bring you the article is the author after learning a reptile to write a Demo, the author is a Jami, usually the main tiger flutter. So I thought I’d grab some NBA news. If you’ve ever caught toutiao news, you know that toutiao news content is loaded via Ajax. Different from static pages. So today the author also specialized in this kind of website to write a technical experience. Without further ado. Let’s get down to business.


1. The quasi
For work


The authors here use Selenium to emulate the way the browser runs. The benefits of using Selenium to crawl pages can be summed up in a word: visible and crawling. That is, we don’t have to worry about the actual rendering inside the page. We need to install Selenium library, Chrome browser and ChromeDriver. Parse the library BeautifulSoup. After the data is captured, we can save it to the database, or we can save it as a file.


2. Grasp
Take the analysis


Before grasping, the logic of grasping should be analyzed first. We open the headlines today NBA section of https://www.toutiao.com/ch/nba/, as shown in the figure below.

Put the mouse cursor on any news title, right click to check, open the developer tool, you can see the news title and link in the class TAB “Link title” as shown in the picture below.

Click the link of this article to jump to the detailed page of this article. Then we open the developer tools. We can extract the title, author, source, content and other information we want. As shown in the figure below.


3. The real
Zen practice

We just analyzed the logic of the web page, so now we use the program to achieve today’s headlines NBA news grab it.

First we need to get a link to each story on the current page.

Once we get a link to each article, we can access it and get the information we want, as follows.

Finally, the captured news is saved to the database. The relevant codes are as follows.

Ok, so we’ve got the news we want down here, but there’s just one problem. Because today’s headlines are Ajax loading pages. So we can’t change the page count to get more news. We can only get the news that is currently displayed on the page we visit. So if we want to get more news. What to do. Here the author uses a method of simulating mouse dragging to solve our dilemma of getting more content. Here’s the code.

At this point, our program is complete. Now take a look at the final result saved in the database.


4.”
language


Finally, we this section of code is given address: https://github.com/NGUWQ/Python3Spider/tree/master/toutiao


Readers who want to take it a step further should try Scrapy when they’re done with this project.

Reference code:

https://github.com/NGUWQ/Python3Spider/tree/master/scrapytoutiao


If you understand this article, Ajax data crawling should be easy for you.


If it works for you.Please,star.


If you’re interested in crawlers, data analysis, algorithms. Please follow my wechat official account :TWcoding. We study together.