1s climb to 1131 digimon, send to digimon: the last evolution > Python crawlers lesson 4-9

“This is the 13th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

“Digimon: The Last Evolution” hit the Chinese mainland on Oct 30. Everyone our age still remembers digimon, and the most memorable one is the first one, forever!

Writing in the front

By the way, have you ever yelled “Archaic evolution” at your dog?

If I hadn’t met The Delu beast. — Claire. If I didn’t come into the world of digimon – Ashu. If I didn’t join in the adventure — Meimei. We would never have become what we are now – Mitsuko. That’s right, because we have digimons with us all the time — Ah and. Because we still have so many good companions – Wu. So we understand more about the importance of unity and cooperation. Therefore, we are more able to live out our true self – tai Yi.

I remember Digimon from ’99. What about you?

The last blog post ended with a talk about climbing all the digimons in the digimon animation, which was recently released in time for the release of Digimon: The Last Evolution, dedicated to our childhood.

Analysis before crawling

The crawl site at: www.digimons.net/digimon/chn… The final goal needs to be defined.

Waiting for us to climb the information is the name of the digi-beast, type, debut, picture (the picture has been named digi-beast).

This information is available via the Requests module.

Crawl code writing time

By default, only part of the data is displayed on the page, but after looking at the source code, you find that all the data has been returned to the foreground, so you can directly write the regular expression match.

The specific data regularization is:

<li class="(. *?) "><span class="date">(? P<date>. *?)</span><span class="(. *?) ">(? P<level>. *?)</span><span class="name"><a href="(? P
       .*?) " target="_blank">(? P<name>. *?)</a></span><span class="debut">(? P<show>. *?)</span></li>
Copy the code

Note that in the regular expression above (? P

.*?) Create an alias for the retrieved group data, which can be retrieved later using the set name when using the group method.

However, using an alias needs to be used in conjunction with the re.search method and has no effect in the re.findall method.

Next, complete the Python code:

import requests
import re

headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}

def get_all() :
    r = requests.get(
        "http://www.digimons.net/digimon/chn.html", headers=headers)
    r.encoding = "utf-8"
    if r.status_code == requests.codes.ok:
        pattern = re.compile(
            r'(? P
        
         .*?) 
        ? P
        
         .*?) 
        (? P
         
          .*?) 
         (? P
        
         .*?) 
        
')
        items = pattern.findall(r.text)
        print(items)
Copy the code

After the above code is run, it will get the effect as shown below. 1131 digital beasts will be acquired in 1s.

Continue to analysis, also need to grab digital beast corresponding specific image, analysis the picture address is www.digimons.net/digimon/mam…

Format name: http://www.digimons.net/digimon/ {digital beast English name} / digital beast English name. JPG for code to look at the first access to the data of a single specific formats:

(' C_6 ', 'March 2018 ',' Level Mark6 ', 'Pole Body ',' Bryweludramon /index.html', 'BLVD Dragon Beast ',' 20TH Anniversary EDITION of LCD Toy Digital Beast Swing Machine ')Copy the code

The important data is ‘bryweludramon/index.html’, which contains the English name of the digital animal, which needs to be intercepted by string operation

item = ('c_5'.'November 2020'.'level mark5'.'完全体'.'were_garurumon_sagittarius/index.html'.Werewolf Garulu: Archer Form.'Animated Digimon Adventure:')

en_name = item[4] [0: item[4].find('/')]
Copy the code

Use find instead of index. If index is used, an error message will be displayed if no substring is found.

After obtaining the English digital beast name, you can grab the picture, the specific code is as follows:

import requests
import re

headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}


def get_img(name, en_name) :
    img_url = f"http://www.digimons.net/digimon/{en_name}/{en_name}.jpg"
    r = requests.get(img_url, headers=headers, timeout=5)
    content = r.content
    with open(f"{en_name}.jpg"."wb") as f:
        f.write(content)

if __name__ == "__main__":
    item = ('c_5'.'November 2020'.'level mark5'.'完全体'.'were_garurumon_sagittarius/index.html'.Werewolf Garulu: Archer Form.'Animated Digimon Adventure:')

    en_name = item[4] [0: item[4].find('/')]
    get_img(Garulu werewolf: Archer Form., en_name)
Copy the code

When the above code is running, pay attention to control the response time of the picture, because digimons.net website belongs to a foreign server, there is a response time problem, if the capture can not do an error prompt.

Write in the back

The final crawl results can be improved by ourselves, the data captured to the local, you can read a familiar digital baby photos, full of memories.

1s climb to 1131 digimon, send to digimon: the last evolution > Python crawlers lesson 4-9

Writing in the front

Analysis before crawling

Crawl code writing time

Write in the back

Related Posts

Minio image storage

MyBatis log module analysis

This article introduces you to the basic use of MybatisPlus