This is the 24th day of my participation in the August Text Challenge.More challenges in August

It’s mooncake day, and it’s mooncake time again. Python will help you pick your favorite mooncake flavor

Target Website:A treasure

Tool use

Development tools: pycharm development environment: python3.7, Windows10 using toolkit: requests, LXML

Focus on learning

  • A get request
  • Get Web data
  • Data extraction method

Project idea analysis

Taobao website needs login to obtain, login can try to parse taobao interface, white and white here directly use the cookie request header to maintain the state, after login to obtain (but taobao does not need to login recently, you can try by yourself)

Find the keyword you need to search, the Mid-Autumn Festival is coming (white and white here search is moon cake)

Page number of Taobao is rendered by URL, which can be simplified to obtain page number of goods through URL

To simplify before https://s.taobao.com/search?q=%E6%9C%88%E9%A5%BC&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_2 0210829 & ie = utf8 & bcoffset = 3 & ntoffset = 3 & p4ppushleft = 2% 2 c48 & s = https://s.taobao.com/search?q= after 44 simplified & s = {} {}Copy the code

Q is the keyword to search for, s is the number of pages you want to fetch send a web request through the Requests tool to retrieve web page data

Key = "moon cakes" for I in range (1, 4) : url = 'https://s.taobao.com/search?q= & s = {} {}'. The format (key, STR (I * 44)) get_data (url)Copy the code

The data obtained is HTML data, which can be extracted by xpath, regular, PyQuery, and BS4 to select the appropriate data for your own use

Extract data in a canonical manner



Taobao data is existing in JSON data after extraction can be obtained through the dictionary value

Extracted data:

  • The price
  • The number of payment
  • The title
  • The store
  • place
data = re.findall('"auctions":(.*?) ,"recommendAuctions', response.text)[0] for info in json.loads(data): Item = {} item [' url '] = info [' detail_url] item [' title '] = info [' raw_title] item [' image address] = info [' pic_url '] item = [' price '] Item info [' view_price] [' location '] = info [' item_loc] item [' buy '] = info. Get (' view_sales) item = [' comments'] Info ['comment_count'] item[' shop '] = info[' Nick ']Copy the code



Finally, the data is saved in a CSV table

def save_data(data): F = open(' file.csv ', "a", newline="", encoding=' UTF-8 ') csv_writer = csv.DictWriter(f, fieldnames=[' title ', 'price ',' number of buyers ', 'place', 'site', 'address' images, the 'comments',' shop ']) csv_writer. Writerow (data)Copy the code

Easy source sharing

Headers = {'referer': 'https://s.taobao.com/', 'cookie': '', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit Chrome/ 577.36 (KHTML, like Gecko) Safari/ 577.36 ',} def save_data(data): F = open(' file.csv ', "a", newline="", encoding=' UTF-8 ') csv_writer = csv.DictWriter(f, fieldnames=[' title ', 'price ',' number of buyers ', 'place', 'site', 'address' images, the 'comments',' shop ']) csv_writer. Writerow (data) def get_data (url) : response = requests.get(url, headers=headers) print(response.text) data = re.findall('"auctions":(.*?) ,"recommendAuctions', response.text)[0] for info in json.loads(data): Item = {} item [' url '] = info [' detail_url] item [' title '] = info [' raw_title] item [' image address] = info [' pic_url '] item = [' price '] Item info [' view_price] [' location '] = info [' item_loc] item [' buy '] = info. Get (' view_sales) item = [' comments'] Item info [' comment_count] [' shops'] = info [' Nick '] print (item) save_data (item) if __name__ = = "__main__ ': File = open(' csv.csv ', "w", encoding=" utF-8-sig ", newline= "") csv_head = csv.writer(file) # header = [' title ', 'price ', Csv_head. Writerow (header) key = "中 国" for I in range(1, 4): url = 'https://s.taobao.com/search?q={}&s={}'.format(key, str(i*44)) get_data(url) time.sleep(5)Copy the code

I am white and white I, a program yuan like to share knowledge ❤️ if there is no contact with this piece of programming friends see this blog, found that can not program or want to learn, you can directly leave a message + private I ah [thank you very much for your likes, favorites, attention, comments, one key four even support]