Preface:

Bring you a Taobao product data crawler on Thursday. Incidentally according to the old rules to capture the data visualization wave. Without further ado, let’s begin happily

The development tools

Python version: 3.6.4

Related modules:

DecryptLogin module;

Pyecharts module;

And some of the modules that come with Python.

Environment set up

Install Python and add it to the environment variable, and the PIP will install the appropriate module.

Data crawl

Since said is the simulation login related crawler small case, the first natural is to achieve a Taobao simulation login. This is done using our open source DecryptLogin library in three lines of code:

@staticmethod def login(): lg = login.login () infos_return, session = lg.Taobao () return session

And, by the way, people often want me to add cookies to the DecryptLogin library. You can do this with two more lines of code:

if os.path.isfile('session.pkl'):
    self.session = pickle.load(open('session.pkl', 'rb'))
else:
    self.session = TBGoodsCrawler.login()
    f = open('session.pkl', 'wb')
    pickle.dump(self.session, f)
    f.close()

I really don’t want to add this feature to the library, but I would like to add some other crawl-related features later, but we’ll talk about that later. Okay, I’m getting off the point. Let’s get to the point. Then, we go to the webpage version of Taobao to catch a wave of bags. For example, when F12 opens the developer tool, it can randomly type something in the product search bar of Taobao, like this:

A global search for a keyword such as search reveals links like this:

Let’s see what data it returns:

I think that’s right. In addition, if you can’t find the API, you can try to click the next product button in the upper right corner again:

This will definitely catch the request interface. A quick test shows that although it may seem like a lot of parameters to carry to request this interface, only two parameters must actually be submitted, namely:

Q: Product name S: Offset of the current page number

Well, based on this interface and our test results, we can now happily start fetching Taobao product data. Specifically, the main code implements the following:

"' external call" 'def run (self) : search_url =' https://s.taobao.com/search? 'while True: goods_name = input (' please type in the name of the commodity information to grab: ') offset = 0 page_size = 44 goods_infos_dict = {} page_interval = random.randint(1, 5) page_pointer = 0 while True: params = { 'q': goods_name, 'ajax': 'true', 'ie': 'utf8', 's': str(offset) } response = self.session.get(search_url, params=params) if (response.status_code ! = 200). break response_json = response.json() all_items = response_json.get('mods', {}).get('itemlist', {}).get('data', {}).get('auctions', []) if len(all_items) == 0: break for item in all_items: if not item['category']: continue goods_infos_dict.update({len(goods_infos_dict)+1: { 'shope_name': item.get('nick', ''), 'title': item.get('raw_title', ''), 'pic_url': item.get('pic_url', ''), 'detail_url': item.get('detail_url', ''), 'price': item.get('view_price', ''), 'location': item.get('item_loc', ''), 'fee': item.get('view_fee', ''), 'num_comments': item.get('comment_count', ''), 'num_sells': item.get('view_sales', '') } }) print(goods_infos_dict) self.__save(goods_infos_dict, goods_name+'.pkl') offset += page_size if offset // page_size > 100: break page_pointer += 1 if page_pointer == page_interval: time.sleep(random.randint(30, 60)+random.random()*10) page_interval = random.randint(1, 5) page_pointer = 0 else: Time.sleep (random. Random ()+2) print('[INFO]: print('[INFO]: print('... ' % (goods_name, len(goods_infos_dict)))

Data visualization

Here let’s visualize a wave of milk tea data that we’ve caught. Let’s take a look at the nationwide distribution of milk tea sellers on Taobao:

Unexpectedly, the most milk tea shops are in Guangdong Province. T_T

Let’s take a look at the top 10 sellers of milk tea on Taobao:

And the top 10 milk tea shops with the most comments on Taobao:

Take a look at the percentage of items in these stores that require shipping versus those that don’t:

Finally, take a look at the price range of milk tea related commodities:

This is the end of the article, thank you for watching, follow me every day to share Python simulation login series, the next article to share Jingdong commodity data crawler

To thank you readers, I’d like to share with you some of my recent collection of dry programming goodies and give something back to every reader in the hope of helping you out.

Dry goods mainly include:

① More than 2000 Python e-books (mainstream and classic books are available)

(2) The Python Standard Library (Chinese version)

③ project source code (40 or 50 interesting and classic hands-on project and source code)

④Python basic introduction, crawler, web development, big data analysis video (suitable for small white learning)

⑤ Python Learning Roadmap (Goodbye to Slow Learning)

All done~ complete source code + dry goods see personal profile or private message to obtain the relevant files.