In the previous articles, we learned how to write a Spider to get all the article links on a web page and their corresponding page target information. In this article, we will focus on the Item in Scrapy.

Before introducing Item, we need to know that the main goal of web crawler is to extract structured data from unstructured data sources. After extracting structured data, how to return these data? The easiest way to do this is to put these fields into a dictionary, which is then returned to Scrapy. As useful as dictionaries are, there is something structural about dictionaries that they lack. For example, we are prone to error by typing the names of fields. For example, we define one field called comment_nums, but in another crawler we pass that field as comment_num with the s missing. An error will occur when dictionaries are processed uniformly in the pipeline.

To fully format fields, Scrapy provides Item classes that let us specify fields ourselves. For example, in our Scrapy crawler project, we define an Item class that contains the title, release_date, URL, etc., so that the Item class is instantiated by various crawlers, so that it is not easy to make errors. Because we’ve defined the field in one place, and the field is unique.

This Item is a bit like a dictionary, but it’s a bit more versatile than a dictionary. When we instantiate the Item class, we use the parse() function to obtain the Item class from the main Spider file. We yield the Item class and Scrapy the Item class. It loads the Item directly into the pipeline. In this way, we can directly in the pipeline to save data, deduplication and other operations. These are the benefits Item brings to us.

Next we go to the items.py file to define the item, as shown below.

By default, the sample code is provided in this file. You can write code directly into the sample code, or you can customize a new class that looks like the sample code. This class needs to inherit Item from scrapy, which is already given by default, scrapy.Item. Now we define the specific fields in this Item according to the fields of the target information we want to obtain. The field extraction of specific target information has also been mentioned in previous articles, including title, release_date, URL, front_imG_URL, tag, voteup_num, collection_num, comment_num, content, etc. As shown in the figure below.

In the case of Item, it has only one type, Field. This Field means that any data type passed in can be received. In this way, it does resemble a dictionary. In this file, it’s all about changing fields, and the right side of Item is scrapy.field (). Due to the need for constant copying, here is a shortcut Ctrl+ D in Pycharm that automatically copies the line of code where the mouse cursor is located. This shortcut can help us to copy the code very quickly, which is equivalent to Ctrl+ C and Ctrl+ V in Windows.

We have defined all items in the crawler framework, and now we can go to the crawler body file to fill in the specific item values.


Have you learned anything after reading this article? Please forward to share to more people, want to learn more, please pay attention to “IT sharing home”!