Recently, I wanted to change the profile picture of a couple with my girlfriend. After searching around, I finally found several interesting ones in the little red book. When I saved the picture, I found that there was no download button!

Can you share it, copy the share link in the browser to open a try, so you can save the picture, but the picture is watermarked.

Can I download the image to watermark?

I went to a simple search on the Internet, but search to a lot of small procedures to watermark, but these are almost free use of the number of restrictions, parsing a few times will not let use, to charge money or what, as a crawler programmer, this can endure?

Since they can parse unwatermarked images directly through xiaohongbook’s share link, then theoretically, SO can I!

Don’t say much, do what you say.


1. Analysis of ideas

First, click share in the little red book APP, access to its link to share, such as: www.xiaohongshu.com/discovery/i…

Then open it in a browser (CHROME in mine).

Press F12 or Ctrl + Shift + I to open developer Tools, switch to Network type and filter to Img, as shown in the picture.

Refresh the page and it is easy to extract the link to the image we want.

You can Preview the image in Preview and see the request header for the image in Headers.

As shown above, you can see the download link of the image.

Ci.xiaohongshu.com/0c7a3f7b-92…

For a brief analysis, a link consists of the following parts: Domain name (https://ci.xiaohongshu.com/) + photo id (0 e0b c9 c7a3f7b – 92-4-5408-4154 abc82d86) +? + compressed format (imageView2/2 / / 100 w / 100 / h/q / 75).

Tips: the browser directly visit https://ci.xiaohongshu.com/0c7a3f7b-92c9-4e0b-5408-4154abc82d86?imageView2/2/w/100/h/100/q/75 download interface will pop up; And get rid of? At the back of the part, visit https://ci.xiaohongshu.com/0c7a3f7b-92c9-4e0b-5408-4154abc82d86 will open the image in the browser.

Through the above link to download directly to the picture, we found that there are watermarks, then how to get the picture without watermarks?

Let’s go ahead and analyze.

As we already know, the image link is composed of domain name + ID + compression format, and the compression format field in the latter only affects the size and quality of the image, and does not affect whether there is a watermark or not, even it does not matter (it seems that after removing the image, we get the original image before compression).

Therefore, the image without watermark must be controlled by ID. And as a programmer’s intuition, the unwatermarked image ID (if any) must be next to the watermarked image ID.

Next, we copy 0C7a3f7B-92C9-4e0B-5408-4154abc82d86 (watermarked image ID) to search the source code of the web page to see if there is any harvest.

After some searching, I finally found a suspicious place. It is a JSON format text. There are many elements in imageList, each of which contains url, width and height, fieldId and traceId information.

We found that the URL was the image link we just found (in which \u002F was the URL encoding for slash /) and fieldId was the ID of the image we found.

TraceId = traceId = traceId = traceId = traceId

In the spirit of trying, I replaced the image ID in the URL with the traceId value and copied it to the browser to check

Ci.xiaohongshu.com/5ab4de05-81…

Hey, guess what? Watermark is gone!! Ha ha ha ha

In this way, we completed the analysis of the idea of image watermarking and successfully extracted the image link without watermarking.

Next, we use a Python crawler to do this.

2. Coding

To sort out our extraction ideas:

  1. Find the little red book to share the link to the page source

  2. Extract the imageList field containing the image information and parse it into JSON format

  3. Extract the traceId field and replace the fieldId part of the original image URL

  4. Download the image using the newly spliced URL

Next we code to do this.

2.1 Network request function

import requests

def fetchUrl(url) :
    Initiate a web request to obtain the source code of a web page.
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9 '.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4098.3 Safari/537.36'.'cookie':'Your own cookie',
    }
    
    r = requests.get(url, headers = headers)
    return r.text
Copy the code

We use the Requests library to initiate web requests, where the URL is the url to request and the headers is the request header to disguise the crawler as a browser and send the necessary parameters to the server.

The server of little Red book needs to verify the cookie information, and the access will fail if the cookie is not filled in or expired. Therefore, it is necessary to copy its own cookie in the browser and replace it with the code.

2.2 Analyzing image links

def parsing_link(html) :
    Parse HTML text to extract urls of unwatermarked images

    beginPos = html.find('imageList') + 11
    endPos = html.find(',"cover"')
    imageList = eval(html[beginPos: endPos])

    for i in imageList:
        picUrl = f"https://ci.xiaohongshu.com/{i['traceId']}"
        yield picUrl, i['traceId']

Copy the code

The function is to parse the source code of the web page and extract the URL link of the watermark image.

Here we did not extract the third party parsing library, but to take a qiao, we can refer to.

So what we’re going to do is we’re going to look at the data in the red box, and we’re going to look at the data between the strings imageList and cover.

So use.find to locate the start and end positions directly, and HTML beginPos: endPos extracts the middle part directly.

In addition, the eval() function here parses a string into a JSON object.

2.3 Download and save the Image


def download(url, filename) :
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4098.3 Safari/537.36',}with open(f'{filename}.jpg'.'wb') as v:
        try:
            r = requests.get(url, headers=headers)
            v.write(r.content)
        except Exception as e:
            print('Image download error! ')
Copy the code

This function is used to download and save an image locally from the image link.

The difference between saving an image and saving text is that the image data is binary, so

  1. Use R.tent when processing network request results (r.ext is generally used for plain text)

  2. To save the file, select WB for mode (w is used for plain text)

2.4 the main function

if __name__ == '__main__':
    original_link = 'https://www.xiaohongshu.com/discovery/item/60a5f16f0000000021034cb4'
    html = fetchUrl(original_link)
    for url, traceId in parsing_link(html):
        print(f"download image {url}")
        download(url, traceId)
        
    print("Finished!")
Copy the code

As a crawler scheduler, the main function is used to start crawler and control crawler progress.

Original_link is the original request link, namely the share link of little Red book

In download(URL, traceId), the second parameter is the file name of the image to save. Here I set traceId as the file name, you can set your own naming rules according to your needs.

3. Operation effect

Start the program, the crawler runs smoothly, the following is the running result.

Watermarked images are also saved locally.

4. Crawler source code sharing

import requests
import os

"' https://ci.xiaohongshu.com/ this is the little red book without watermark stitching links, as long as the incoming behind: the inside of the traceId parameters can ' ' '

def fetchUrl(url) :
    Initiate a web request to obtain the source code of a web page.
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9 '.'cookie':'Your own cookie'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4098.3 Safari/537.36',
    }
    
    r = requests.get(url, headers = headers)
    return r.text

def parsing_link(html) :
    Parse HTML text to extract urls of unwatermarked images

    beginPos = html.find('imageList') + 11
    endPos = html.find(',"cover"')
    imageList = eval(html[beginPos: endPos])

    for i in imageList:
        picUrl = f"https://ci.xiaohongshu.com/{i['traceId']}"
        yield picUrl, i['traceId']

def download(url, filename) :
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4098.3 Safari/537.36',}with open(f'{filename}.jpg'.'wb') as v:
        try:
            r = requests.get(url, headers=headers)
            v.write(r.content)
        except Exception as e:
            print('Image download error! ')

if __name__ == '__main__':
    original_link = 'https://www.xiaohongshu.com/discovery/item/60a5f16f0000000021034cb4'
    html = fetchUrl(original_link)
    for url, traceId in parsing_link(html):
        print(f"download image {url}")
        download(url, traceId)
        
    print("Finished!")
Copy the code

Now I can finally have the pleasure of swapping profiles with my girlfriend.

If there is something in the article that is not clear, or the explanation is wrong, please criticize and correct it in the comment section, or scan the qr code below and add our wechat. We can learn and communicate together and make progress together.