The article directories

    • preface
    • Web analytics
      • Get links to all public blogs
    • Web analytics
      • Test documentation
      • Three problems loom
        • Problem 1 Solution:
        • Solution to Problem 2:
        • Problem 3 solution
          • Idea 1:
          • Idea 2:
          • Idea 3:
          • My choice
      • Results analysis
      • Analysis of new results
      • Interface Xpath
    • Crawl a blog
      • Regular expression analysis
      • Regular expression testing
      • The state machine
      • Save to file
    • Get all blogs

preface

Do something exciting this time. Crawl through all my blogs. Of course, I’m the only one who can do this. If you want to crawl, you can crawl your own, and I’ll post the code and analysis results later.

I’ve had a couple of bad things in the last two weeks, but I’m pretty disappointed right now.


Web analytics

Get links to all public blogs

At first, I was looking for the site map to see if I could find my section. Later I found out that I thought too much, sitemaps exist, but there are so many bloggers, one is not realistic. So the road was blocked.

Next, I went to the “Article Management” screen, but I immediately realized that it was a dynamic page. I looked at the page number at the bottom. Fifteen pages. It’s kind of embarrassing anyway. I think, this page is simpler than the home page, grab a bag to see. I found the package of the article ID and found that the url could not be opened alone, so I gave up again.

Finally, I went back to the home page.

Page number at the bottom, page seven, yes, go ahead.

I originally thought, the link and the title together, then on second thought, the article also has the title, when the time comes to take it together.

So I started writing code. To be fair, Google works really well when it comes to picking up Xpath, so if YOU’re using Firefox to pick up Xpath, it’s always empty, but if You’re using Google to pick up relative Xpath, you’re in the right place.

import requests
import threadpool
from lxml import etree
import pandas as pd

cookie = 'Put your own'
header = {
'User-Agent': 'Put your own'.'Connection': 'keep-alive'.'accept': 'application/json, text/javascript, */*; Q = 0.01 '.'Cookie': cookie,
'referer': 'Put your own home page'
}

url_list = ['https://lion-wu.blog.csdn.net/article/list/1'.'https://lion-wu.blog.csdn.net/article/list/2'.'https://lion-wu.blog.csdn.net/article/list/3'.'https://lion-wu.blog.csdn.net/article/list/4'.'https://lion-wu.blog.csdn.net/article/list/5'.'https://lion-wu.blog.csdn.net/article/list/6'.'https://lion-wu.blog.csdn.net/article/list/7']	# This link is very regular

keep_url_list = []	# This is used for

def outdata(url) :
    try:
        print('succeed'+url)
        res = requests.get(url,headers=header)
        wbdata = res.content.decode('UTF-8')
        tree = etree.HTML(wbdata)
        el_list = tree.xpath('//*[@id="articleMeList-blog"]/div[2]//div/h4/a/@href')
        print(el_list)
        keep_url_list.append(el_list)

    except:
        print('failed'+url)

def Thread_Pool(outdata,datalist = None,Thread_num = 5) :
    Create a thread pool, specify a thread pool to execute a task, place a task in the thread pool, call it quits :param outData: function pointer, task executed by the thread pool :param datalist: Param Thread_num: number of initialized threads :return: not yet ''
    pool = threadpool.ThreadPool(Thread_num)  Create a Thread_num thread

    tasks = threadpool.makeRequests(outdata, datalist)  Specifies the tasks to be performed by the thread
    # outData is the name of the function, datalist is alist of arguments, and the thread pool will extract the arguments from datalist into the function to execute the function, so the length of the argument list is the number of tasks to be performed by the thread pool.

    [pool.putRequest(req) for req in tasks]  # Put the task to be executed into the thread pool

    pool.wait()  Exit by waiting for all child threads to finish executing


Thread_Pool(outdata,datalist=url_list,Thread_num = 7)


#outdata('https://lion-wu.blog.csdn.net/article/list/1')

u2 = []
for i in keep_url_list:
    for j in i:
        print(j)
        u2.append(j)

pd.DataFrame(u2).to_csv('My_CSDN.csv')
Copy the code

Web analytics

Test documentation

This article uses test document: test document, to do their own friends please open the test document to follow the operation.

Three problems loom

Randomly click open a blog source, see inside the different components have different labels. So here are three questions:

1. How many different effects did I use? 2. When crawling, how to keep the data under different labels in the original order when storing 3

Problem 1 Solution:

The first question is easy to handle, open the edit interface can be very clear to see all the effects:

Recall all the effects I’ve used, including:

Article Title, In-text Title, (table of Contents), yellow, bold, italic, Unordered, Ordered, to do, [quote], [code block], [picture], [form], [hyperlink], [separator]Copy the code

Parentheses are not needed, parentheses are commonly used.

That, how to see these effects in the source code, to find it is impossible to find, write a blog, put these features are packaged into the test.


Solution to Problem 2:

For question two, I struggled for a while because I didn’t know if Xpath would preserve the order of different tags when crawling. Baidu for a while, really, all bullshit.

So I did a demo test:

import requests
from lxml import etree

# The preceding string will no longer be released

def outdata(url) :
    try:
        print('succeed'+url)
        res = requests.get(url,headers=header)
        wbdata = res.content.decode('UTF-8')
        tree = etree.HTML(wbdata)
        el_list = tree.xpath('//*[@id="articleMeList-blog"]/div[2]//div/h4/a')
        for el in el_list:
            e = el.xpath('./text() | ./@href')	 
            # I reversed the order to rule out this possibility, because there is no prior order when you actually start climbing, and there is no rule.
            print(e)
            
    except:
        print('failed'+url)

outdata('https://lion-wu.blog.csdn.net/article/list/1')
Copy the code

It turns out to be a success, with a bit of string slicing to cut out the escape characters and the surrounding Spaces.


Problem 3 solution

Originally thought this question is the simplest, just I want not to stay the question. It turns out that’s not the case.

Idea 1:

For this problem, if you grab the text inside the tag directly, you will eventually lose the tag. I have thought about this problem. We can first take down the title of the article, then take down the whole source of the body part of the article, mark each label in the source code with regular expression, and then use Xpath to take out the text and links.

Result: No return after converting to string.

Then I had another idea.

Idea 2:

First of all, no special effects that are not required. For example, bold, italic, yellow, underline, such as, do not. Unordered, ordered, todo in one category, no more.

With this option, the only special effects to note (extracting a separate copy for markup) are: references, code blocks, images, tables, and hyperlinks.

Reference, the code block only marks the beginning and end, the table after the table header is removed only marks the beginning and end, hyperlinks and image links need to be removed.

The matching algorithm will take care of the rest.

That is, extract all the text and links first, and then start all over again to extract some important information.

This is just a little bit more complicated, but the implementation is fine.

Idea 3:

During Xpath extraction, see if you can mark the text directly, and if you can, that’s great.


My choice

I said three. It worked. Doesn’t method 1 say convert etree objects to strings? So I could have just selected all the labels, I don’t take the text, I just go to the string, and then I get all the labels and all the text, right? Finally, we use regular expressions to convert long tags into shorter tags in HTML code.

Take a look at the tags from the test document:

def outdata(url) :
    try:
        print('succeed'+url)
        res = requests.get(url,headers=header)
        code = res.apparent_encoding  Get the encoding format of the url
        res.encoding = code

        wbdata = res.text
        tree = etree.HTML(wbdata)
        el_list = tree.xpath('//*[@id="content_views"]')

        for el in el_list:
            # e = etree.tostring(el, encoding=code).decode(code)
            This step will get the source code for the body of the article

			Interface xpath is provided below
            es = el.xpath('./h1 |./h2 |./h3 |./h4 |./h5 |./h6 |./p |./p/mark |./p/span/span/span/span[2]//span/span[2]'
                '|./p/strong |./p/em |./ul//li |./ol//li |./ul//li |./blockquote/p |./pre/code |./p/code '
                '|./div/table/thead/tr//th |./div/table/tbody/tr//td |./hr |./p/img |./p/a')

            for e in es:
                print(etree.tostring(e, encoding=code).decode(code))
                print('-- -- -- -- --)	# debug used to make the results clearer

    except:
        print('failed'+url)
Copy the code

Results:

succeedhttps://lion-wu.blog.csdn.net/article/details/113402976
<p/>

-----
<p/> 

-----
<h1><a id="_2"/> First level title </h1> ----- <h2><aid="_3"/> Secondary title </h2> ----- <h3><aid="_4"</h3> ----- <h4><aid="_5"----- <h5><aid="_6"/> Level 5 title </h5> ----- <h6><aid="_7"</h6> ----- <p> This is a test document, now do not know why to use very normal, I am writing a crawler project, wait for my crawler self-study series last out to know, when the time comes if you want to reproduce, just come to me directly. </p> ----- <p><spanclass="katex--display"> <span class="katex-display"> <span class="katex"> <span class="katex-mathml">
     
      
       
        
         a
        
        
         =
        
        
         b
        
        
         +
        
        
         c
        
       
       
         a = b + c 
       
      
     </span><span class="katex-html"> <span class="base"> <span class="strut" style="height: 0.43056em; vertical-align: 0em;"/><span class="mord mathdefault">a</span><span class="mspace" style="margin-right: 0.277778em;"/><span class="mrel">=</span><span class="mspace" style="margin-right: 0.277778em;"/></span><span class="base"><span class="strut" style="height: 0.77777em; vertical-align: -0.08333em;"/><span class="mord mathdefault">b</span><span class="mspace" style="margin-right: 0.222222em;"/><span class="mbin">+</span><span class="mspace" style="margin-right: 0.222222em;"/></span><span class="base"><span class="strut" style="height: 0.43056em; vertical-align: 0em;"/><span class="mord mathdefault">c

-----
mord mathdefault">a</span> ----- <span class="mord mathdefault">b</span> ----- <span class="mord mathdefault"> c < / span > -- -- -- -- -- < p > < mark > this is highlighted font < / mark > < / p > -- -- -- -- -- < mark > this is highlighted font < / mark > -- -- -- -- -- < p > < strong > this is bold font < / strong > < / p > -- -- -- -- -- < strong > this is bold font < / strong > -- -- -- -- -- < p > < em > this is italics < / em > < / p > -- -- -- -- -- < em > this is italics < / em > -- -- -- -- -- < li > this is disorderly < / li > -- -- -- -- -- < li > this is disorderly < / li > -- -- -- -- -- < hr / > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < hr / > -- -- -- -- -- < li class ="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled"/> This is to do ----- task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled"/ > it is still pending < / li > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < hr / > -- -- -- -- -- < br / > < p > this is a reference or reference here < / p > -- -- -- -- -- < code class ="prism language-python"> here < / code > a code block -- -- -- -- -- < hr / > -- -- -- -- -- < br / > < p > this is a reference or reference here < br / > here is the reference < br / > here < br / > or quoted here is reference < br / > here < br / > or quoted here is reference < br / > ----- prism language-python -----

https://img-blog.csdnimg.cn/20210129182417155.jpg? x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQzNzYyMTkx,size_ 16,color_FFFFFF,t_70" alt="Insert a picture description here"/>

-----
https://img-blog.csdnimg.cn/20210129182417155.jpg? x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQzNzYyMTkx,size_ 16,color_FFFFFF,t_70" alt="&#x5728; 这 里 插 入 图 片 描 述" /> ----- <p><a href="https://blog.csdn.net/qq_43762191?spm=1001.2101.3001.5343"Word-wrap: break-word! Important; "> </a></p> ----- <a href="https://blog.csdn.net/qq_43762191?spm=1001.2101.3001.5343"Word-wrap: break-word! Important; "> < / a > -- -- -- -- -- < hr / > -- -- -- -- -- < p > don't forget the underline the < code > this is a line of code < / code > < / p > -- -- -- -- -- < code > < / code > is this line of code -- -- -- -- --Copy the code

Results analysis

Analyze the results here so we can make a decision about our next step.

1. First of all, I saw the formula around a bunch of tags at the first glance. I don’t remember if I have a blog with the formula, I will keep it, it is a regular thing anyway.

2. The second obvious problem is repetition. Before, it didn’t appear when extracting text directly, because ‘/’ only extracts everything under the current subpath, but now with the string, ‘./p ‘becomes the parent tag of many tags that start with’./p ‘. Repetition is inevitable at this point. When taking the label, this seems to be irreconcilable contradictions, then we have to take out the label after a de-weight. So I’m going to have to write a function that gets rid of the weight

3. Another solution to the above problem is to leave all tags starting with ‘./p/ ‘, only ‘./p ‘, and set

as the lowest priority

And let’s see what happens.

succeedhttps://lion-wu.blog.csdn.net/article/details/113402976
<p/>

-----
<p/> 

-----
<h1><a id="_2"/> First level title </h1> ----- <h2><aid="_3"/> Secondary title </h2> ----- <h3><aid="_4"</h3> ----- <h4><aid="_5"----- <h5><aid="_6"/> Level 5 title </h5> ----- <h6><aid="_7"</h6> ----- <p> This is a test document, now do not know why to use very normal, I am writing a crawler project, wait for my crawler self-study series last out to know, when the time comes if you want to reproduce, just come to me directly. </p> ----- <p><spanclass="katex--display"> <span class="katex-display"> <span class="katex"> <span class="katex-mathml">
     
      
       
        
         a
        
        
         =
        
        
         b
        
        
         +
        
        
         c
        
       
       
         a = b + c 
       
      
     </span><span class="katex-html"> <span class="base"> <span class="strut" style="height: 0.43056em; vertical-align: 0em;"/><span class="mord mathdefault">a</span><span class="mspace" style="margin-right: 0.277778em;"/><span class="mrel">=</span><span class="mspace" style="margin-right: 0.277778em;"/></span><span class="base"><span class="strut" style="height: 0.77777em; vertical-align: -0.08333em;"/><span class="mord mathdefault">b</span><span class="mspace" style="margin-right: 0.222222em;"/><span class="mbin">+</span><span class="mspace" style="margin-right: 0.222222em;"/></span><span class="base"><span class="strut" style="height: 0.43056em; vertical-align: 0em;"/><span class="mord mathdefault"> c < / span > < / span > < / span > < / span > < / span > < / span > < / p > -- -- -- -- -- < p > < mark > this is highlighted font < / mark > < / p > -- -- -- -- -- < p > < strong > this is bold font < / strong > < / p > -- -- -- -- -- < p > < em > this is italics < / em > < / p > -- -- -- -- -- < li > this is disorderly < / li > -- -- -- -- -- < li > this is disorderly < / li > -- -- -- -- -- < hr / > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < li > this is orderly < / li >  ----- 
      
-----
task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled"/> This is to do ----- task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled"/ > it is still pending < / li > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < li > this is orderly < / li > -- -- -- -- -- < hr / > -- -- -- -- -- < br / > < p > this is a reference or reference here < / p > -- -- -- -- -- < code class ="prism language-python"> here < / code > a code block -- -- -- -- -- < hr / > -- -- -- -- -- < br / > < p > this is a reference or reference here < br / > here is the reference < br / > here < br / > or quoted here is reference < br / > here < br / > or quoted here is reference < br / > ----- prism language-python -----

https://img-blog.csdnimg.cn/20210129182417155.jpg? x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQzNzYyMTkx,size_ 16,color_FFFFFF,t_70" alt="Insert a picture description here"/>

-----

https://blog.csdn.net/qq_43762191?spm=1001.21013001.5343.Word-wrap: break-word! Important; "> < span style = "max-width: 100%; < / a > < / p > -- -- -- -- -- < hr / > -- -- -- -- -- < p > don't forget the underline the < code > is this line of code < / code > < / p > -- -- -- -- --Copy the code

Analysis of new results

The new results, both in terms of repetition and in terms of text decoding, are better than the above results, so I do a new analysis.

1. First of all, if it is a formula, remove the empty part after segmentation and take the penultimate element. Blockquote:

The result is:

</p> </blockquote>Copy the code

or

<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>Copy the code

3, in the acquisition of image links, pay attention to remove clean before and after. 4. Pay attention to the extraction of line code. 5, re, both to extract labels, but also to extract text, need to pay attention to the problem of storage.

Not much else


Interface Xpath

//*[@id=”articleContentId”] //*[@id=”content_views”]


Where the title of the article is located: //*[@id=”content_views”]//h1 //*[@id=”content_views”]//h2 //*[@id=”content_views”]//h3 //*[@id=”content_views”]//h4 //*[@id=”content_views”]//h5 //*[@id=”content_views”]//h6


Paragraph text location: / / * [@ id = “content_views”] / / p yellow marking label location: / / * [@ id = “content_views”] / / p/mark formula — — — — — – location: / / * [@ id = “content_views”] / / p/span/span/span/span [2] / / span/span [2] black bold location: / / * [@ id = “content_views”] / / p/strong italics – location: //*[@id=”content_views”]//p/em


//*[@id=”content_views”]//ul//li/text() //*[@id=”content_views”]//ol//li/text() Reference tag location: / / * [@ id = “content_views”] / / blockquote/p / / text ()


Code block of location: / / * [@ id = “content_views”] / / pre/code/text () lines of code – location: / / * [@ id = “content_views”] / / p/code links – location: //*[@id=”content_views”]//p/a


//*[@id=”content_views”]//div/table//td underline: //*[@id=”content_views”]//hr


Photo: / / * [@ id = “content_views”] / / p/img



Crawl a blog

After careful analysis above, I plan to crawl a blog post and save it in the correct file. Which one to climb? Test documentation, of course.

Regular expression analysis

After a while of hard work, I came up with a regular expression like this:

res = re.findall('(<. +? >) ',string = string)
#res2 = re.findall('(>.*? <)',string = string)
res2 = re.findall('(>[\s\S]*? <) ',string=string)
print(res)

# Since the second regular expression is extracted with '>' and '<', it needs to be removed
for r2 in range(len(res2)):
	res2[r2] = res2[r2].replace('>'.' ').replace('<'.' ').replace('\n'.' ').strip()
    # subscript is required for traversal changes
    # Once a string is written, it cannot be modified by subscript
	
for r3 in res2[:]:  The traversal will skip without res2[:]
    if r3 == ' ':
        res2.remove(r3)

result = ' '.join(res2)	# this should not be a simple integration, this integration should be given to the formula.
# This will be a separate blog post later.

print(result)
Copy the code

Regular expression testing

First, try the simplest one:

string = '

['<h1>'.'<a id="_2"/>'.'</h1>'] Level 1 titleCopy the code

Indicates that this expression is tentatively available.


Get a longer one:

['<p>'.'</p>'[This is a test document, now do not know why to use it is normal, I am writing a crawler project, I will know when the last of my crawler self-study series comes out, then if you want to reproduce, just come to me.Copy the code

I think you can tell which one it is.


Now, this is the formula part that we’re tired of looking at. Will it work?

string = '''

a = b + c =< span>+c

'''
Copy the code

Results:

['<p>'.'<span class="katex--display">'.'<span class="katex-display">'.'<span class="katex">'.'<span class="katex-mathml">'.'</span>'.'<span class="katex-html">'.'<span class="base">'.'< span class = "struts" style = "height: 0.43056 em. vertical-align: 0em;" / > '.'<span class="mord mathdefault">'.'</span>'.'< span class = "mspace" style = "margin - right: 0.277778 em;" / > '.'<span class="mrel">'.'</span>'.'< span class = "mspace" style = "margin - right: 0.277778 em;" / > '.'</span>'.'<span class="base">'.'< span class = "struts" style = "height: 0.77777 em. Vertical - align: 0.08333 em." / > '.'<span class="mord mathdefault">'.'</span>'.'< span class = "mspace" style = "margin - right: 0.222222 em;" / > '.'<span class="mbin">'.'</span>'.'< span class = "mspace" style = "margin - right: 0.222222 em;" / > '.'</span>'.'<span class="base">'.'< span class = "struts" style = "height: 0.43056 em. vertical-align: 0em;" / > '.'<span class="mord mathdefault">'.'</span>'.'</span>'.'</span>'.'</span>'.'</span>'.'</span>'.'</p>']

a=b+c
Copy the code

I can see it was a success!!

However, the algorithm can not forget the link processing, they can be in the tag inside!!


The state machine

The previous code seems to have changed a little bit, I can’t remember. With this state machine, you can initially type in labels and everything. Of course, there are still changes to be made, but at present I think the cost is not high, so I did not write.

def get_div_name(div_list) :
    "" This is a state machine used to extract tags :param div_list: list of tags :return: final tag name" "
    if div_list[0] = ='<hr/>':
        return '[underline]'
    elif div_list[0] [1] = ='h':
        hn = re.search('[0-9] {1}',div_list[0]).group(0)
        return '【' + hn + 'Level title'
    elif div_list[0] = ='<li>':
        return '[enumeration]'
    elif div_list[0] = ='<li class="task-list-item">':
        return '[To Do]'
    elif div_list[0] = ='<blockquote>':
        return '[quote]'
    elif '<code class' in div_list[0]:
        l = re.search('(. +?" ) ',div_list[0])
        language = l.group(0).replace(The '-'.' ').replace('"'.' ')
        return '【'+language+'Language code block'
    elif div_list[0] = ='<p>':
        if div_list[1] = ='</p>':
            return '[Plain Text]'
        elif 'katex' in div_list[1] :return '[Formula]'
        elif div_list[1] = ='<mark>':
            return '[Yellow highlight]'
        elif div_list[1] = ='<strong>':
            return '[bold]'
        elif div_list[1] = ='<em>':
            return '[italic]'
        elif div_list[1] = ='<code>':
            return 'line code'
        elif 'img' in div_list[1]:
            h = re.search('(". +?" ) ',div_list[1])
            href = h.group(0).replace('"'.' ').replace('"'.' ')
            return '[picture] :' + href
        elif 'href' in div_list[1]:
            h = re.search('(". +?" ) ', div_list[1])
            href = h.group(0).replace('"'.' ').replace('"'.' ')
            return '[hyperlink:]'+ href
        else:
        	return ' '
    else:
        return ' '

def outdata(url) :
    try:
        print('succeed'+url)
        res = requests.get(url,headers=header)
        code = res.apparent_encoding  Get the encoding format of the url
        res.encoding = code

        wbdata = res.text
        tree = etree.HTML(wbdata)
        el_list = tree.xpath('//*[@id="content_views"]')

        for el in el_list:

            es = el.xpath('./h1 | ./h2 | ./h3 | ./h4 | ./h5 | ./h6 | ./p |./ul//li | ./ol//li | ./ul//li | ./blockquote | ./pre/code '
                '| ./div/table/thead/tr//th | ./div/table/tbody/tr//td | ./hr')

            for e in es:
                string = etree.tostring(e,encoding=code).decode(code)
                res = re.findall('(<. +? >) ', string=string)
                res2 = re.findall('(>[\s\S]*? <) ', string=string)
                div_name = get_div_name(res)
                for r2 in range(len(res2)):
                    res2[r2] = res2[r2].replace('>'.' ').replace('<'.' ').replace('\n'.' ').strip()

                for r3 in res2[:]:  The traversal will skip without res2[:]
                    if r3 == ' ':
                        res2.remove(r3)

                ifdiv_name ! =' ':
                    res2.insert(0,div_name)

                print('\n'.join(res2))
                print('-- -- -- -- --)

    except:
        print('failed'+url)
Copy the code

Save to file

We’re getting close to the end. I made a few tweaks and saved the data to a file.

def save_to_file(file_name,contant) :
    This function is used to write data to a file :param file_name: filename :param contant: file contents :return: None"
    file_path = R 'D: \ CSDN blogs'
    if not os.path.exists(file_path):   # If the destination folder does not exist
        os.mkdir(file_path)
    w_file_path = file_path+'\ \'+'file_name'+'.txt'
    f = open(w_file_path,'w')
    for c in contant:
        f.write(c)
    f.close()
Copy the code

Get all blogs

Basically, throw it into a thread pool. Add two lines below to start the thread pool:

url_list = pd.read_csv('My_CSDN.csv') ['url']

Thread_Pool(outdata,datalist=url_list,Thread_num = 10)
Copy the code

It’s a little rough, but version 1.0 is out, and the next thing is optimization.

This article code is also detailed, to take the complete code, scan the qr code next to the background reply: “blog”, get the latest version of the current. The first version will be put before 2021.2.1 is ready.

As for the private blog, I will bring it when I optimize it.