Get to know Scrapy and be a good citizen in a world full of crawlers

Welcome to your Scrapy tour. In this article, we aim to transform you from a Scrapy beginner with little or no experience into a Scrapy expert with the confidence to use this powerful framework to crawl large data sets from the web or other sources. This article introduces Scrapy and tells you some great things you can do with it.

1.1 first Scrapy

Scrapy is a robust networking framework that pulls data from a variety of data sources. As a regular web user, you’ll often find yourself needing to get data from a web site and browse it using a spreadsheet program like Excel to access data offline or perform calculations. As a developer, you need to constantly consolidate data from multiple sources, but you are well aware of the complexity of obtaining and extracting data. No matter how difficult or easy it is, Scrapy can help you complete data extraction operations.

Scrapy has years of experience extracting large amounts of data in a robust and efficient way. With Scrapy, you can do with a simple setup what other crawler frameworks require with many classes, plug-ins, and configuration items.

From a developer’s perspective, you’ll also appreciate Scrapy’s event-based architecture. It allows us to cascade data cleaning, formatting, decorating, and storing that data in a database with minimal performance degradation as long as we do it right. In this article, you’ll learn how to do just that. Technically, since Scrapy is event-based, it allows us to split throughput latency with smooth operations when we have thousands of open connections. To take an extreme example, suppose you need to pull listings from a site with summary pages, each of which contains 100 listings. Scrapy makes it very easy to execute 16 requests in parallel across the site, assuming that each request takes an average of one second to complete, and you can crawl 16 pages per second. If you multiply that by the number of listings per page, you get 1,600 listings per second. Imagine if every listing had to perform a write in a massive parallel cloud storage, each write would take an average of 3 seconds (a very bad idea). To support a throughput of 16 requests per second, we need to run 1600 × 3 = 4800 write requests in parallel. For a traditional multithreaded application, you need to switch to 4800 threads, which is a very bad experience for you and for the operating system. In Scrapy’s world, 4,800 concurrent requests can be handled as long as the operating system is fine. In addition, Scrapy memory requirements are close to the amount of data you need for your property, and for multithreaded applications, the overhead per thread is significant relative to the size of the property.

In short, slow or unpredictable websites, databases, or remote apis won’t be devastating to Scrapy’s performance because you can run multiple requests in parallel and manage them through a single thread. This means lower hosting costs, opportunities to collaborate with other applications, and simpler code (with no synchronization requirements) compared to traditional multithreaded applications.

1.2 More reasons to like Scrapy

Scrapy has been around for more than five years, and is mature and stable. In addition to the performance benefits mentioned in the previous section, here are some reasons to love Scrapy.

Scrapy identifies fragmentary HTML

You can use Beautiful Soup or LXML directly in Scrapy, but Scrapy also provides a more advanced XPath (primary) interface on top of LXML, selectors. It can handle broken HTML code and messy coding more efficiently.

community

Scrapy has a vibrant community. Just look at https://groups. Google.com/ forum/#! Forum/scrapy – the users on the mailing list, as well as the Stack Overflow website (http:// stackoverflow.com/questions/tagged/ scrapy) can know the thousands of issues. Most questions can be answered in a matter of minutes. More community resources are available at http://scrapy.org/ community/.

Well-organized code maintained by the community

Scrapy requires organizing your code in a standard way. You write a small number of Python modules called crawlers and pipes, and any future improvements are automatically picked up from the engine itself. If you search the Internet, there are quite a few professionals with Scrapy experience. That said, you can easily find someone to maintain or extend your code. Whoever joins your team doesn’t need a long learning curve to understand what’s special about your custom crawler.

More and more high quality features

If you publish a quick scan (http://doc.scrapy.org/en/latest/ news. HTML), you will note that both on the function, or on the stability/bug fixes, Scrapy is in constant growth.

1.3 Master the importance of automated data crawling

For most of us, the curiosity and spiritual satisfaction of mastering a cool technique like Scrapy is enough to motivate us. Surprisingly, while learning about this excellent framework, we can also enjoy the benefits of starting the development process with data and community, rather than code.

1.3.1 Develop robust and high-quality applications and provide reasonable planning

To develop modern, high-quality applications, we need real big data sets, and if possible, this should be done before we start writing code. Modern software development is to process a large amount of imperfect data in real time and extract knowledge and valuable information from it. When we develop software and apply it to large data sets, small errors and omissions are hard to detect and can lead to costly bad decisions. For example, when doing demographic research, it is easy to ignore the data of the entire state simply because the state name is too long. During development and even earlier design exploration, careful capture and use of production-quality real world big data sets can help us find and fix errors and make informed engineering decisions.

As another example, let’s say you want to design an Amazon-style “if you like this, you might like that” recommendation system. If you can crawl and collect real-world data sets before you start, you’ll quickly realize the problems associated with invalid entries, discontinued items, duplicates, invalid characters, and performance bottlenecks caused by skewness distribution. This data will force you to design algorithms that are robust enough to handle everything from items purchased by thousands of people to new items sold at zero. Isolated software development can be confronted with ugly real-world data after a few weeks of development. While these two approaches may converge eventually, the ability to provide you with a commitment to progress estimates, as well as the quality of your software, will make a significant difference as the project progresses. Starting with data leads to a more enjoyable and predictable software development experience.

1.3.2 Rapid development of high-quality minimum viable products

For startups, large sets of real data are even more necessary. You’ve probably heard of the “lean startup,” a term coined by Eric Ries to describe the process of developing a business in extremely uncertain conditions like a technology startup. A key concept of the framework was a Minimum Viable Product (MVP), a Product with limited functionality that could be rapidly developed and released to a limited number of customers to test responses and validate business assumptions. Based on the feedback received, the startup may choose to pursue further investment or move on to other, more promising directions.

At some point in the process, it’s easy to overlook the problem of connecting tightly to the data, which is where Scrapy comes in. For example, when we invite potential customers to try out our mobile apps, as developers or business owners, we ask them to evaluate those features and imagine how the app should look when it’s finished. For those who are not experts, this is probably too much imagination. This gap is equivalent to an app showing only “Product 1”, “Product 2”, and “user 433”, Another app offers “Samsung UN55J6200 55-inch TV,” five-star reviews from user “Richard S,” and a valid link that takes you straight to the product details page (though we haven’t actually written that page yet). It’s hard to be objective about the functionality of an MVP product unless you use real and exciting data.

One reason some startups treat data as an afterthought is that it’s expensive to collect. Sure, we often need to develop forms and administrative interfaces and spend time typing data, but we can also use Scrapy to crawl sites before writing code.

1.3.3Google doesn’t use forms, crawls to scale

When it comes to forms, let’s take a look at how it affects product growth. Imagine if Google’s founders had created the first version of their engine with a form that every webmaster had to fill out, requiring them to copy and paste text from every page of the site. They then need to accept licensing agreements that allow Google to process, store and display their content and strip out most of the advertising profits. Can you imagine how much time and effort it would take to explain the idea and convince people to participate in the process? Even if the market were desperate for a good search engine (and it is), it wouldn’t be Google because it’s growing too slowly. Even the most sophisticated algorithms can’t make up for the lack of data. Google uses web crawlers to jump links from page to page and populate its vast database. Webmasters don’t have to do anything. In fact, it takes some effort to prevent Google from indexing your pages.

While the idea of Google using forms sounds ridiculous, how many forms does a typical website require users to fill out? Login forms, new listings forms, checkout forms, and so on. How many of these forms are hindering app growth? If you know your audience well enough, you probably already have clues about other sites they use and probably already have accounts on. For example, a developer is likely to have Stack Overflow and GitHub accounts. So, with their permission, could you grab these sites and automatically populate them with photos, profiles, and a small selection of recent articles as long as they provide you with a username? Can you do a quick text analysis of some of the articles they are most interested in and adjust the navigation structure of the site and suggested products and services based on that? I hope you can see how you can use automated data fetching alternative forms to better serve your audience and grow your site.

1.3.4 Discover and integrate your ecosystem

Capturing data will naturally allow you to discover and consider the community’s relationship to your efforts. When you grab a data source, questions naturally arise: Do I trust their data? Do I trust the company that gets the data? Do I need to communicate with them to get more formal cooperation? Do I have a competitive or cooperative relationship with them? How much will it cost me to get this data from other sources? These business risks are there anyway, but the capture process can help us recognize them early and develop mitigation strategies.

You’ll also find yourself wondering what kind of feedback you can bring to these sites and communities. If you can give them free traffic, they should be happy. On the other hand, if your app doesn’t bring some value to your data source, your relationship may be short-lived unless you communicate with them and find a way to collaborate. By capturing data from different sources, you need to be prepared to develop products that are more friendly to the existing ecosystem, respect existing market players, and disrupt the current market order only when it’s worth the effort. Existing players might also help you grow faster, say if you have an app that uses data from two or three different ecosystems, each with 100,000 users, and your service might end up connecting those 300,000 users in a creative way that benefits each ecosystem. For example, if you start a startup that connects rock and roll with the T-shirt printing community, your company will eventually become a fusion of the two ecosystems, and both you and the community will benefit and grow from it.

1.4 Be a good citizen in a reptilian world

There are a few other things to be aware of when developing crawlers. Irresponsible web crawlers can be unpleasant and even illegal in some cases. Two very important things to avoid are denial-of-service (DoS) attacks and copyright infringement.

In the first case, a typical visitor might visit a new page every few seconds. A typical web crawler might download dozens of pages per second. That’s more than 10 times more traffic than a typical user would generate. This can make site owners very unhappy. Please use the speed limit to reduce the amount of traffic you generate to acceptable levels for ordinary users. Response times should also be monitored, and if they increase, crawler strength should be reduced. The good news is that Scrapy provides out-of-the-box implementations of all of these features.

For copyright issues, obviously you need to look at the copyright notices of every site you grab and make sure you understand what is and isn’t allowed. Most sites allow you to handle information about their site, as long as you don’t republish it under your own name. There’s a nice User-Agent field in your request that lets the webmaster know who you are and what you’re doing with their data. Scrapy uses the BOT_NAME parameter as the user-agent by default when making requests. If user-Agent is a URL or identifies your application, webmasters can visit your site to learn more about how you use their data. Another very important aspect is to allow any webmaster to block you from accessing specified areas of their site. For robots based on Web standards. TXT file file example (see www.google.com/robots.txt), Scrapy respect is provided the function of the Web site administrator (RobotsTxtMiddleware). Finally, it’s a good idea to provide webmasters with ways to say what they don’t want in your crawlers. At the very least, the webmaster must be able to easily find ways to communicate and express concerns with you.

1.5Scrapy isn’t what

Finally, it’s easy to misunderstand what Scrapy can do for you, mainly because the term data scraping is a bit vague with its related terms, many of which are used interchangeably. I will try to make these aspects clearer to prevent confusion and save you some time.

Scrapy is not an Apache Nutch, that is, it is not a general-purpose web crawler. If Scrapy visits a site it knows nothing about, it won’t be able to do anything meaningful. Scrapy is used to extract structured information, requiring manual intervention to set up appropriate XPath or CSS expressions. Apache Nutch, on the other hand, takes generic pages and extracts information from them, such as keywords. It may be more suitable for some applications and less suitable for others.

Scrapy is not Apache Solr, Elasticsearch, or Lucene, in other words, it is independent of search engines. Scrapy is not intended to provide you with references to documents that contain “Einstein” or other words. You can use Scrapy to extract data and then insert it into Solr or Elasticsearch. We’ll cover this at the beginning of Chapter 9, but this is just a method of using Scrapy, not a feature embedded in Scrapy.

Finally, Scrapy is not a database like MySQL, MongoDB, or Redis. It neither stores data nor indexes it. It is only used to extract data. Even so, you might be able to insert your Scrapy data into a database, and it supports a wide variety of databases, making your life a lot easier. Scrapy, however, is not a database, and its output could easily be changed to just a file on disk, or even nothing at all — though I’m not sure how useful that would be.

Master Python Crawler Framework Scrapy

Master Python crawler framework Scrapy“

A: Dimitrios Kouzis-Loukas chopsticks

Click on the cover to buy paper books at https://item.jd.com/12292223.html

Python3 scrapy tutorial, a comprehensive analysis of web crawler technology implementation principle, through the crawling examples demonstrate the application of scrapy, covering from the desktop to the mobile side of the crawling, real-time crawling, all contents.

This book covers the basics of Scrapy, discussing how to extract data from arbitrary sources, how to clean it up, and how to use Python and third-party apis to tailor it to your needs. The book also explains how to efficiently feed crawled data into databases, search engines, and streaming data processing systems (such as Apache Spark). By the end of this book, you’ll know how to crawl data and use it in your own applications.

Stretch recommended

Click on keywords to read more new books:

Python | | machine learning Kotlin Java | | | | mobile development robots contests | Web front-end | book list

If you reply “follow” in the background of “Asynchronous books”, you can get 2000 online video courses for free. Recommend friends to pay attention to according to the prompts to get a gift book link, free of charge asynchronous books. Come and join us!

Scan the qr code above and reply “follow” to participate in the event!