This is the 8th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021 “.

1. pyquery

1.1 introduction

If you’re familiar with CSS selectors and Jquery, there’s another parser library for you –Jquery

Website pythonhosted.org/pyquery/

1.2 installation

pip install pyquery

1.3 Usage Mode

1.3.1 Initialization Mode

  • string
    from pyquery import PyQuery as pq
    doc = pq(str)
    print(doc(tagname))
Copy the code
  • url
    from pyquery import PyQuery as pq
    doc = pq(url='http://www.baidu.com')
    print(doc('title'))
Copy the code
  • file
    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    print(doc(tagname))
Copy the code

1.3.2 Selecting a Node

  • Get the current node
    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    doc('#main #top')
Copy the code
  • Get child nodes
    • Write it in doc layer by layer
    • After the parent tag is obtained, use the children method
    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    doc('#main #top').children()
    
Copy the code
  • Get the parent node
    • After getting the current node, use the parent method
  • Get sibling nodes
    • After getting the current node, use the siblings method

1.3.3 Obtaining Attributes

    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    a = doc('#main #top')
    print(a.attrib['href'])
    print(a.attr('href'))
Copy the code

1.3.4 Obtaining Content

    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    div = doc('#main #top')
    print(a.html())
    print(a.text())
Copy the code

1.3.5 sample

from pyquery import PyQuery as pq # 1. You can load an HTML string, or an HTML file, or a URL, D =pq("< HTML ><title>hello</title></ HTML >") d=pq(filename=path_to_html_file) d=pq(url='http://www.baidu.com') note: Here the URL seems to have to write all # 2.html() and text() -- get the corresponding HTML block or text block, P = pq (" < head > < title > hello < / title > < / head > ") p (" head "). The HTML # () returns < title > hello < / title > p (" head "). The text () return to the hello # 3 #. Get elements based on HTML tags, D = pq (' < div > < p > test 1 < / p > < p > test 2 < / p > < / div > ') d (' p ') # return / < p > < p > print d # (' p ') returns 1 < / p > < p > test < / p > < p > test 2 print D ('p').html()# return test 1 # Note: when more than one element is retrieved, the HTML () method returns only the corresponding content block of the first element # 4.eq(index) -- the specified element is retrieved based on the given index number. Print d('p').eq(1).html() # return test 2 # 5.filter() D = pq (" < div > < p id = '1' > test 1 < / p > < p class = '2' > test 2 < / p > < / div > "), d (" p "). The filter (' # 1 ') # returns [< 1 > p#] d (" p "). The filter (' 2 ') [<p.2>] # 6.find() D = pq (" < div > < p id = '1' > test 1 < / p > < p class = '2' > test 2 < / p > < / div > "), d (' div '). The find (' p ') # returns [< 1 > p#, <p.2>] d('div').find('p').eq(0)# return [<p#1>] #7. Get elements directly from class name, id name, for example: D = pq (" < div > < p id = '1' > test 1 < / p > < p class = '2' > test 2 < / p > < / div > "), d (' # 1 '). The HTML () returns the test # 1 d (' 2 '). The HTML () return to the test # 2 # 8. Get the attribute value, for example: D = pq (" < p id = 'my_id' > < a href = "http://hello.com" > hello < / a > < / p > "), d (' a '). Attr (' href ') # back to http://hello.com Attr ('id')# return my_id # 9. Attr ('href', 'http://baidu.com') to baidu # 10.addClass(value) D =pq('<div></div>') d.addClass('my_class')# return [<div. My_class >] # 11.hasClass(name) # return whether the element contains the given class, e.g. D =pq("<div class='my_class'></div>") d.hasClass('my_class')# return True # 12. Children (selector=None) D = pq (" < span > < p id = '1' > hello < / p > < p id = '2' > world < / p > < / span > "), dc hildren # () returns [< 1 > p#, < 2 > p#], dc hildren (' # 2 ') # returns [< 2 > p#] # 13. Parents (selector = None), access to the parent element, example: D = pq (" < span > < p id = '1' > hello < / p > < p id = '2' > world < / p > < / span > "), d (" p "). Parents () # returns [< span >] D (' # 1 '). Parents (' span ') # return / < span > d (' # 1 '). Parents (" p ") return to [] # # 14. Clone copies of () - returns a node # 15. Empty () - # remove node content 16. NextAll (selector=None) -- return all subsequent elements, e.g. D = pq (" < p id = '1' > hello < / p > < p id = '2' > world < / p > < img SCR = "/ >"), d (' p: first). NextAll # () returns [< 2 > p#, The < img >] d (' p: last). NextAll # () returns [the < img >] # 17. Not_ (selector) - returns the does not match the selector elements, example: D = pq (" < p id = '1' > test 1 < / p > < p id = '2' > test < / p > "2) d (" p"). Not_ (' # ') # returns [< 1 > p#]Copy the code

Use of multithreading

1. The introduction of

All the crawlers we’ve written before are single threaded, right? How is that enough? Once a place gets stuck, isn’t that waiting forever? To do this we can use multi-threading or multi-processing.

It’s not recommended, but you can check it out if you want, so you don’t want to waste time reading it

2. How to use it

Crawlers use multiple threads to process network requests, use threads to process urls in the URL queue, and then store the results returned by urls in another queue, which other threads read and write to files

3. Main components

3.1 URL Queue and result queue

Place the url to be climbed in a Queue, using the standard library Queue. The result after accessing the URL is stored in the result queue

Initialize a URL queue

from queue import Queue
urls_queue = Queue()
out_queue = Queue()
Copy the code

3.2 class wrapping

Use multiple threads to fetch the URL from the URL queue and process it:

import threading

class ThreadCrawl(threading.Thread):
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            item = self.queue.get()
Copy the code

If the queue is empty, the thread blocks until the queue is no longer empty. Once a piece of data in a queue is processed, the queue needs to be notified that it has been processed

3.3 Functional Packaging

from threading import Thread
def func(args)
    pass
if __name__ == '__main__':
    info_html = Queue()
    t1 = Thread(target=func,args=(info_html,)
Copy the code

3.4 the thread pool

Import threading import time import queue class Threadingpool(): def __init__(self,max_num = 10): self.queue = queue.Queue(max_num) for i in range(max_num): self.queue.put(threading.Thread) def getthreading(self): return self.queue.get() def addthreading(self): self.queue.put(threading.Thread) def func(p,i): time.sleep(1) print(i) p.addthreading() if __name__ == "__main__": p = Threadingpool() for i in range(20): thread = p.getthreading() t = thread(target = func, args = (p,i)) t.start()Copy the code

4. Common methods in Queue module:

Python’s Queue module provides synchronous, thread-safe Queue classes, including FIFO (first in, first out) Queue, LIFO (last in, first out) Queue, and PriorityQueue. These queues implement locking primitives that can be used directly in multiple threads. Queues can be used for synchronization between threads

  • Queue.qsize() returns the size of the Queue
  • Queue.empty() returns True if the Queue is empty, False otherwise
  • Queue.full() returns True if the Queue is full, False otherwise
  • Queue. Full corresponds to maxsize
  • Queue.get([block[, timeout]]) Gets the Queue, timeout wait time
  • Queue. Get_nowait () quite a Queue. Get (False)
  • Queue. Put (item) Write to Queue, timeout Wait time
  • Queue. Put_nowait (item) equivalent to Queue. Put (item, False)
  • Queue.task_done() After a task has been completed, the queue.task_done () function sends a signal to the Queue that the task has completed
  • Queue.join() actually means to wait until the Queue is empty before doing anything else