Easy access to Songhuajiang News [www.shjnet.cn/ms/msxw/]

1, the first analysis of the source page to see the content to crawl in what position 2, analysis of the HTML to get the content you want


1, view the source code

We found that the data we want is under the <h4 tag

Get the source code for the web page through Requests by coding

html = requests.get(url).content
Copy the code

BeautifulSoup then finds the label we want

links = soup.find_all('h4', class_='blank')
Copy the code

This crawls the data from the news list

3. Next, use the URL that the list crawls to get the details of the content, the method is the same as above


Direct paste source code:

#! /usr/bin/env python
# coding:utf8
import sys

import requests
from bs4 import BeautifulSoup

reload(sys)
sys.setdefaultencoding("utf8")

url = 'http://www.shjnet.cn/ms/msxw/index.html'


def getNewsList(url, page=0):
    if(page ! =0):
        url = 'http://www.shjnet.cn/ms/msxw/index_%s.html' % page
    html = requests.get(url).content
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('h4', class_='blank')
    for link in links:
        detailUrl = "http://www.shjnet.cn/ms/msxw/" + link.a.get('href').replace('/'.' ')
        print "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
        print "Headline :" + link.a.get_text() + "Address for Details :" + detailUrl
        getNewsDetail(detailUrl)
    page = int(page) + 1
    print soup.select('#pagenav_%s' % page)
    if (soup.select('#pagenav_%s' % page)):
        print U 'Start grabbing next page'
        print 'the %s page' % page
        getNewsList(url, page)


def getNewsDetail(detailUrl):
    html = requests.get(detailUrl).content
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('div', class_='col-md-9')
    for link in links:
        # print link.span.get_text()
        # print link.h2.get_text()
        # print link.find('div', class_='cas_content').get_text()
        if (link.find('div', class_='col-md-10').select('img')):
            imgs = link.find('div', class_='col-md-10').find_all('img')
            for img in imgs:
                print Image: "" + detailUrl[:detailUrl.rfind('/')] + "/" + img.get('src').replace('/'.' ')


if __name__ == '__main__':
    getNewsList(url)
Copy the code

Effect:


The Python used in this article is 2.7
Problems encountered in the crawl
  • printhtml = requests.get(url).textGarbled code consulted the students in the team group, get the answer..textUnicode data is returned..contentIt returns bytes, which is binary data, and then converts ithtml = requests.get(url).contentResolve garbled characters
  • Delete the url when stitching details. /Redundant stringlink.a.get('href').replace('./', '')
  • Error getting details content

http://

  • BeautifulSoup for the first time has a simple book on the right for how to use it
Find_all (“tag”) Searches for the collection of all current tag tags.
Find (“tag”) returns a tag. (This method is rarely used)
Select (“”) can be found by label name, but is much more used to find filter elements by label layer by layer.
To obtainContent > <Content to use.get_text()
Get <href content > content to use.get('href')

Currently deploys content in the console