“This is the second day of my participation in the November Gwen Challenge. See details of the event: The last Gwen Challenge 2021”.

preface

Today take everyone to climb to climb to climb the website comics, nonsense not to say, directly start ~

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Re module;

Shutil module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Thought analysis

The comics are actually pictures, so let’s find out where the links are! Because the purpose of this article is to achieve what comics want to crawl what comics, so search any comics, here to seal throne as an example, and then click into the details page to view any words; In the browsing page, the source code of the web page does not have the data we need, so we need to open the developer tool to capture the package, and finally successfully find the link of the picture.

Once you find the image link, you then have to figure out how to get the image link from the packet, that is, the link to access the packet. Chapter_newid changes every time the page is turned. Comic_id is the unique identifier of a comic book.

https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=5323&chapter_newid=1006&isWebp=1&quality=middle\
https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=5323&chapter_newid=2003&isWebp=1&quality=middle\
https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=5323&chapter_newid=3004&isWebp=1&quality=middle

Copy the code

Then find where these two arguments came from. Enter the homepage to search for the throne of Seal of God, and then check the source code of the web page. It is found that the URL for entering the comic details page can be found in the source code of the web page. When I tried to extract it using regular expressions and xpath syntax, I found that there were many identical HTML tags in the source code and that there was more than one comic book in the source code.

Then I tried to search other comics, found no source code, I found THAT I fell into the pit, later found that the source code is the site’s home page source code, inadvertently, tears! But that’s okay. It’s not in the source code. Let’s go grab the package.

Open developer tools, enter XHR in Network, and search for signet throne. The first time I searched, I found a packet, but he reported red:

But there’s something in there that we need. We can’t see the data in the developer tools because of the red flag, so we have to open the packet:

If you need to get non-red packets, you need to re-click the input box, it will be loaded, if only refresh the web page and re-click search he is unable to get.

After we get the packet, we find the unique identifier of the cartoon, comic_id, and just extract it from the packet:

After finding the comic_id, look for chapter_newid. He is different from one comic to another; But if you search for Douro for the first time, you’ll see chapter_newid is incrementally changing.

Chapter_newid: 1006 chapter_newid: 1006 chapter_newid: 1006 chapter_newid: 1006 chapter_newid: 1006

So we know that the first chapter_newid is statically loaded from the detail page and can be extracted from the source of the detail page, and the url is made up of https://www.kanman.com/+comic_id:

I only need chapter_newid for the first sentence, but where do I get the others? I found that the chapter_newid on the next page was retrieved from the previous page:

Code implementation

Build the extract comic_id and chapter_id functions:

def get_comic(url) :\
    data = get_response(url).json()['data'] \for i in data:\
        comic_id = i['comic_id']\
        chapter_newid_url = f'https://www.kanman.com/{comic_id}/ '\
        chapter_newid_html = get_response(chapter_newid_url).text\
        chapter_id = re.findall('{"chapter_id":"(.*?) "} ', chapter_newid_html)\
        data_html(comic_id, chapter_id[0])
Copy the code

Key code, if you have climbed the weibo comment data before, you will find that the two steps are similar, and the value of turning the page needs to be obtained from the previous page:

def data_html(comic_id, chapter_id) :\
    try:\
        a = 1\
        while True:    # loop to get chapter_id\
            if a == 1:\
                comic_url = f'https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id={comic_id}&chapter_newid={chapter_id}&isWebp=1&quality=middle'\
            else:\
                comic_url = f'https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id={comic_id}&chapter_newid={chapter_newid}&isWebp=1&quality=middle'\
            comic_htmls = get_response(comic_url).text\
            comic_html_jsons = json.loads(comic_htmls)\
            if a == 1:\
                chapter_newid = jsonpath.jsonpath(comic_html_jsons, '$.. chapter_newid') [1] \else:    # From the second URL, extract rule +1\
                chapter_newid = jsonpath.jsonpath(comic_html_jsons, '$.. chapter_newid') [2]\
            current_chapter = jsonpath.jsonpath(comic_html_jsons, '$.. current_chapter') \for img_and_name in current_chapter:\
                image_url = jsonpath.jsonpath(img_and_name, '$.. chapter_img_list') [0]    # url \ images
                # chapter_name contains a space, so strip it
                chapter_name = jsonpath.jsonpath(img_and_name, '$.. chapter_name') [0].strip()\
                save(image_url, chapter_name)\
            a += 1\
    except IndexError:\
        pass
Copy the code

Save data:

def save(image_url, chapter_name) :\
    for link_url in image_url:    # picture name \
        image_name = ' '.join(re.findall('/(\d+.jpg)-kmh'.str(link_url)))\
        image_path = data_path + chapter_name\
        if not os.path.exists(image_path):    # Create section title folder \
            os.mkdir(image_path)\
        image_content = get_response(link_url).content\
        filename = '{} / {}'.format(image_path, image_name)\
        with open(filename, mode='wb') as f:\
            f.write(image_content)\
            print(image_name)\
    get_img(chapter_name)    # Splicing function section title, not required
Copy the code

Console:

if __name__ == '__main__':\
    key = input('Please enter the cartoon you want to download:')\
    data_path = R'd :/ data knife/crawler ④/ manga /{}/'.format(key)\
    if not os.path.exists(data_path):    Create folder \ based on the cartoon name entered by the user
        os.mkdir(data_path)    \
    url = f'https://www.kanman.com/api/getsortlist/?search_key={key}'   # This URL is obtained by removing unnecessary arguments
    get_comic(url)
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawler training, Requests module, Python implementation crawl site comics

preface

The development tools

Environment set up

Thought analysis

Code implementation

Display of saved data

Python crawler training, Requests module, Python implementation crawl site comics

preface

The development tools

Environment set up

Thought analysis

Code implementation

Display of saved data

Related Posts

How to set up a RocketMQ high Availability cluster!

RocketMQ source code interpretation – the same consumer group under different consumer subscriptions

It’s time to dig deeper into the Linux architecture