This article is participating in Python Theme Month. See the link to the event for more details

I was talking to a fan the other day. He said to ask if I could teach him to crawl the QQ zone. I talked to him and asked him what had happened. The original is his pursuit of the goddess of the QQ space confession wall, he wanted to know what his goddess like is what kind of person. But because the white wall published too many people, not easy to find, so want to a certain period of all climb down, slowly look.

People of a certain age were not very clear that QQ still had a confession wall. Dude, I had a look, and it was like a microblog tree hole when we were in college. Mentioning the micro-blog tree hole is estimated to be an elderly person to understand.

It looks something like the picture above. Because the handsome man is too handsome, lest everyone envy, I block next. In fact, it is crawling space talk.

How to log in

If it is climbing qq space, then the first question comes, if bypass login? Or is there a way to log in? In fact, there are many methods, method 1, fill in the cookie, but some cookies often change, is not very easy to use. So the second method, using a browser driver, simulates login.

It’s recommended to use ChromeDriver, which you can download on your own, and there are many on the web.

Selenium is a Web automation tool that allows you to browse websites and click on things without having to do anything manually. Anything a man can do, this library can do.

Driver is used to find controls

Knowing that using WebDriver can solve the above difficulties, so we will operate.

(executable_path='chromedriver') Login page driver.get(' web link ') # Driver.find_element_by_id ('switcher_plogin').click() # Enter known QQ account driver.find_element_by_id('u').send_keys(user) # Enter known QQ account driver.find_element_by_id('p').send_keys(pw) Driver.find_element_by_id ('login_button').click()Copy the code

Webdriver. Chrome(executable_path=’ Chromedriver ‘) gets the driver, and uses driver.get to get the URL returned, Driver. find_element_by_id gets the control by id. The whole process is to simulate the manual click operation login space.

Get the content and save it

 qq_name = div.xpath('./div[2]/a/text()')
 qq_content = div.xpath('./div[2]/pre/text()')
 qq_time = div.xpath('./div[4]/div[1]/span/a/text()')
 qq_name = qq_name[0] if len(qq_name) > 0 else ''
 qq_content = qq_content[0] if len(qq_content) > 0 else ''
 qq_time = qq_time[0] if len(qq_time) > 0 else ''
 print(qq_name, qq_time, qq_content)
 f.write(qq_content+"\n")
Copy the code

The xpath used to get the content here is worth mentioning. From LXML import etree is used. Everyone knows that front pages are presented in a tree structure. The LXML library here needs to be downloaded and installed by yourself, using PIP directly.

Xpath is easily retrieved by right-clicking on the corresponding element.

F. write(qq_content+”\n”), and finally we just save the content

The complete code is as follows:

#coding: utf-8 from imp import reload import time from selenium import webdriver from lxml import etree def text(friend, user , pw): Print (complete code, please step to the public number: Webdriver.chrome (executable_path=' chromeDriver ' Driver.get (' weblink ') # Driver.find_element_by_id ('switcher_plogin').click() # Enter known QQ account driver.find_element_by_id('u').send_keys(user) # Enter known QQ account driver.find_element_by_id('p').send_keys(pw) Find_element_by_id ('login_button').click() time.sleep(10 Driver.switch_to.default_content () Driver.get (' web link + friend + '/311') next_num = 0 # initial 'next page' id while True: For I in range(1, 6), scroll down to enable the browser to load dynamically loaded content StrWord = "window.scrollby (0," + STR (height) + ")" driver.execute_script(strWord) time.sleep(4) for div in divs: qq_name = div.xpath('./div[2]/a/text()') qq_content = div.xpath('./div[2]/pre/text()') qq_time = div.xpath('./div[4]/div[1]/span/a/text()') qq_name = qq_name[0] if len(qq_name) > 0 else '' qq_content = qq_content[0] if len(qq_content) > 0 else '' qq_time = qq_time[0] if len(qq_time) > 0 else '' print(qq_name, qq_time, qq_content) f.write(qq_content+"\n") if driver.page_source.find('pager_next_' + str(next_num)) == -1: Break # find the "next page" button, because the next page button is dynamic, Driver.find_element_by_id ('pager_next_' + STR (next_num)).click() # Driver.switch_to.parent_frame () if __name__ == '__main__': Friend = '# # # # # # # # #' # # wall qq user = '# # # # # # # # your qq number pw =' # # # # # # # # # '# your qq password text (friend, the user and pw)Copy the code

What did you do? Did you fail? If you find it interesting, click “like” before you go. Blunt lv3… One unit