Scrapy is an open source python framework built specifically for web scraping by Scrapinghub co-founders Pablo Hoffman and Shane Evans. You might be asking yourself, “What does that mean?”
It means that Scrapy is a fully fledged web scraping solution that takes a lot of the work out of building and configuring your spiders, and best of all, it seamlessly deals with edge cases that you probably haven’t thought of yet.
Within minutes of installing the framework, you can have a fully functioning spider scraping the web. Out of the box, Scrapy spiders are designed to download HTML, parse and process the data and save it in either CSV, JSON or XML file formats.
There are also a wide range of built-in extensions and middlewares designed for handling cookies and sessions as well as HTTP features like compression, authentication, caching, user-agents, robots.txt and crawl depth restriction. Scrapy also makes it very easy to extend through the development of custom middlewares or pipelines to your web scraping projects which can give you the specific functionality you require.
One of the biggest advantages of using the Scrapy framework is that it is built on Twisted, an asynchronous networking library. What this means is that Scrapy spiders don’t have to wait to make requests one at a time. Instead, they can make multiple HTTP requests in parallel and parse the data as it is being returned by the server. This significantly increases the speed and efficiency of a web scraping spider.
The learning curve to Scrapy is a bit steeper than, for example, learning how to use BeautifulSoup. However, the Scrapy project has excellent documentation and an extremely active ecosystem of developers on GitHub and StackOverflow who are always releasing new plugins and helping you troubleshoot any issues you are having.
If you’d like to build your first Scrapy spider, then be sure to check out the Learn Scrapy tutorials.