Every Year in June and July are examination season, graduation season, which we pay most attention to is the college entrance examination. College entrance examination is a special hot search constitution, and its related events frequently dominate the hot search list of major platforms. This is not, the major colleges and universities in order to recruit students have come up with their own skills, the history of the strongest recruitment brochure is also the number of colleges and universities handsome boy. Fish this take you to climb this “the strongest recruitment brochure”, go ~

I’m sure you’ve all heard of a reptile, but what exactly is a reptile? The Internet we use every day is like a giant spider web. Think of the Internet as the root of a spider web. It’s the spider silk that holds our computers together. Data is stored at each node of the web, and a crawler, as its name suggests, is a tiny worm, spider, that follows the silk of the spider to get the data we want. From the technical level is, through the program to simulate the browser request site behavior, the site to return a variety of types of data in the local, and then screen out the data you want, access to use.

Tools needed for crawler: Request Library: Requests, Selenium (allows browsers to parse and render CSS and JS)

Parsing libraries: Re, Beautifulsoup, PyQuery

Repositories: files, MySQL, Mongodb, Redis

Campus beauty website: www.xiaohuar.com/list-1-0.ht… We take the campus flower network as the object to climb, climb the picture information

Build a crawler:

After the basic information of crawler is defined, the prase function is defined to process crawler information

Information by looking at the website, can clearly know the information we climb take photos, there is a picture in picture information address, but this is not the real address of images, images of the real address need combined with campus network home page, when you make the picture of the real address input browser, you can clearly see the pictures of the original image, right click to save as, But we don’t save photos one by one

Scrapy scrapy scrapy scrapy scrapy scrapy scrapy Scrapy Scrapy Scrapy

Python comes with urllib. You can also use urllib to do crawlers. Here we define headers headers: If the headers is not set, the user-agent will declare itself to be a Python script. If the site has the idea of anti-crawler, the user-agent will declare itself to be a Python script. Must reject such connections. Modify Headers to avoid this problem by disguising your crawler script as a normal browser access.

Go to Python and give it a try! Let’s see if there’s a college goddess!!