How to crawl ajax web pages to crawl snowball web articles

1. The rendering

2. Portal Click on portal

3. After sending the tool, there will be an interface when sliding down (of course, the more you slide, the more interfaces there will be).

Max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id = max_id

Max_id = 20333000; max_id = 20333000; max_id = 20333000; max_id = 20333000

max_id = 20333000
while True:
	The requested URL
	url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id={}&count=15&category=-1'.format(max_id)
	The returned data is in JSON format
	resp = requests.get(url, headers=headers).json()
	max_id += 15
Copy the code

6. Next, analyze the returned data so that we can grab it. From the figure below, we can find that each article is stored in the list key, so we take out the list key first

The code is as follows:

The data we need is stored in a list, first fetch the list
lists = resp.get('list')
Copy the code

7. Look at the information of each article. Copy and paste the information of data to the website json.cn to view the INFORMATION of JSON, which can be sent to the data to get the information we need

for temp in lists:
    In the data key of each element, fetch data
    data = temp.get('data')
    Data is a STR, and we need to convert it to dict to operate
    data = json.loads(data)
    Check whether data exists
    if data:
        Get the title of the article
        title = data.get('title')
        # Continue if there is no title, because from my observation, titles without titles are usually ads
        if not title:
            continue
        # get summary
        description = data.get('description')
        Data cleaning, using regular expression sub method
        description = re.sub(r'
      
       ||
       
         '
       *?>
      *?>.' ', description)
        # retrieve user information. User information is in the user key of data
        user_name = data.get('user').get('screen_name')
        What type of article is get
        column = temp.get('column')
        Get the timestamp of the publication
        created_at = data.get('created_at')
        # Get readership
        view_count = data.get('view_count')

        Declare a dictionary to store data
        data_dict = {}
        data_dict['title'] = title
        data_dict['description'] = description
        data_dict['user_name'] = user_name
        data_dict['column'] = column
        data_dict['created_at'] = created_at
        data_dict['view_count'] = view_count

        print(data_dict)
Copy the code

8. The last step is to save the data to a file, where the data_list is what I declared earlier

Write data to a JSON file
with open('data_json.json'.'a+', encoding='utf-8-sig') as f:
    json.dump(data_list, f, ensure_ascii=False, indent=4)
print('JSON file written')

Write data to a CSV file
with open('data_csv.csv'.'w', encoding='utf-8-sig', newline=' ') as f:
    # header
    title = data_list[0].keys()
    # statement writer
    writer = csv.DictWriter(f, title)
    Write the table header
    writer.writeheader()
    Write data in batches
    writer.writerows(data_list)
print('CSV file has been written')
Copy the code

9. Complete code

Complete code public number reply ‘snowball net’ keyword can be

Public id: pythonislover

Remember to set the delay oh, we are a civilized crawler ~~~ forgot to mention, cookies expire, need to update cookies in time

Contributed by: Crawler

How to crawl ajax web pages to crawl snowball web articles

Related Posts

Ten years of a Networker’s Struggle – Migration

How to create a reservation scheduling platform. DIY vs. plugins vs. Trafft

SpringBoot2.0 configures multiple data sources