Programmers over the age of 25 know few Chinese herbs. Python Crawler Lesson 9-9

“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Chinese medicinal materials in the eyes of eraser, only brucestrum, cassia, xanthium, and lotus seeds, yellow seeds, bitter beans, chinadian seeds, I want to face, which is from the “Compendium of Materia Medica” to learn. The rest also know a wolfberry, notoginseng, Huoxiang Zhengqi water, radix isatidis. In order to get rid of the dilemma of not knowing Chinese medicinal materials, I decided to take the data of Chinese medicinal materials and store it locally. This is the writing background of this article.

First of all, the Chinese herbal medicine mentioned just now is posted in the picture, know it (also really recognize one, when I was a child in the field walking will touch a xanthium son on the leg).

Analysis work before climbing

The target website of this time is: www.zhongyaocai.com/. We opened the Database of Chinese medicinal materials and found 752 pages of data in total, about 12 pieces of data per page, and nearly 10,000 kinds of medicinal materials. Our goal today is to store these data.

The regular expression part can be obtained separately. The HTML source code of the specific part to be matched is as follows:

<div class="poem-head">
  <a class="poem-title" href="https://www.zhongyaocai.com/zyc/gelifen_2542.htm"
    >Clam powder </a ><div class="poem-handler"></div>
</div>

<div class="poem-body">
  <div class="poem-sub">
    <span class="list_span"></span ><span>Four-horn clams, shells slightly quadrangular, solid, shell length 36-48mm, shell......</span>
  </div>
  <div class="poem-sub">
    <span class="list_span">Sexual flavour.</span><span>Taste salty; Sex cold</span>
  </div>
  <div class="poem-sub">
    <span class="list_span"></span ><span>For internal use: Decocted soup, 50-100g; Or into pills, powder, 3-10g. .</span>
  </div>
  <div class="poem-sub">
    <span class="list_span"></span ><span>Heat; Reducing phlegm and dampness; Soft firm. .</span>
  </div>
</div>
Copy the code

The regular expression part is as follows:

    pattern = re.compile(
        r'
      
        (. *?) '
      )
    title_url = pattern.findall(html)
    xing = re.findall(
         (.*?) ', html)
    wei = re.findall(
         (.*?) ', html)
    liang = re.findall(
        R ' ', html)
    zhi = re.findall(
        R '(.*?) ', html)
    items = []
Copy the code

After the data matching is successful, the data will be stored locally in JSON format, mainly to avoid the problem of disorderly version caused by

symbol in Excel. Of course, this problem will not exist when the data is stored directly in the database.

Encoding time

This case as the ninth lecture of crawler small class, the content is very simple, for now you are very simple, after opening multithreading directly crawl can be.

import requests
import re
import json
import threading
headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}

flag_page = 0

def anay(html) :
    pattern = re.compile(
        r'
      
        (. *?) '
      )
    title_url = pattern.findall(html)
    xing = re.findall(
         (.*?) ', html)
    wei = re.findall(
         (.*?) ', html)
    liang = re.findall(
        R ' ', html)
    zhi = re.findall(
        R '(.*?) ', html)
    items = []
    for i in range(0.len(title_url)):
        dict_item = {
            "name": title_url[i][1]."url": title_url[i][0]."xing": xing[i],
            "wei": wei[i],
            "liang": liang[i],
            "zhi": zhi[i]
        }
        items.append(dict_item)
    return items

def save(json_data) :
    with open(f"./data1/one.json"."a+", encoding="utf-8") as f:
        f.write(json_data+"\n")

def get_list() :
    global flag_page
    while flag_page < 752:
        flag_page += 1
        url = f"https://www.zhongyaocai.com/zyc_p{flag_page}.htm"
        print(url)
        r = requests.get(url=url, headers=headers)
        r.encoding = "utf-8"
        data = anay(r.text)
        json_data = json.dumps({"yaos": data}, ensure_ascii=False)
        save(json_data)

if __name__ == "__main__":
    for i in range(1.6):
        t = threading.Thread(target=get_list)
        t.setName(f't{i}')
        t.start()
Copy the code

Data is stored locally in the following format: one row of data on each page, each row is in JSON format, and can be manipulated arbitrarily after being read.

Summary time of crawler class

This series of sessions will provide you with the basics of the Requests library, and hopefully you’ll have a better understanding of the Requests library after 9 sessions. The rest of the sessions will be automatic as you continue to learn programming, and many cloud seniors have given you the same answer.

The most important thing in the Requests library is to send the request and get the data. The core methods include GET and POST, as well as two common attributes text and content. The other contents belong to the extended part of knowledge.

That concludes the Requests library for our crawler tutorial.

Today is the 100th day of continuous writing. If you have ideas or techniques you’d like to share, feel free to leave them in the comments section.

Programmers over the age of 25 know few Chinese herbs. Python Crawler Lesson 9-9

Analysis work before climbing

Encoding time

Summary time of crawler class

Related Posts

Etcd gRPC service API

Record a quick registration interface optimization

How to improve storage performance IO model and AIO reveal