How to Scrape Web Pages with Scrapy and Python
Web crawling is a fun and rewarding skill to develop. Here’s how to get a basic web crawler set up with the Scrapy framework for Python.
First you’ll need to install Scrapy. On a unixy platform, just execute pip install scrapy
from the terminal and you will probably be ready to move on.
Installation of Scrapy on Windows is a bit more… uh, involved, but that shouldn’t come as much of a surprise. See my notes at the bottom of the page for more info.
Phew! Now that the hard work is out of the way, it’s smooth sailing all the way to Scrapetown.
Let’s create a new project with scrapy startproject my_scraping_project
. This command creates a skeleton of a Scrapy spider we can augment with custom code.
Then we’ll need to create the boilerplate code for our spider. Let’s create a spider to scrape the website http://quotes.toscrape.com.
Navigate into the my_scraping_project
directory and enter the command scrapy genspider quotes quotes.toscrape.com
. Scrapy will take care of the rest.
After creating the project and generating a spider, you should have the the following directory structure.
.. └───my_scraping_project ├───scrapy.cfg └───my_scraping_project ├───items.py ├───middlewares.py ├───pipelines.py ├───settings.py ├───spiders │ ├───quotes.py │ └───__pycache__ └───__pycache__
There are a number of files here, most of which are a bit outside the scope of this blog post. For now, let’s focus on the settings.py file. This is where you can control how your spider handles the robots.txt rules, and the user agent your spider portrays.
# Obey robots.txt rules ROBOTSTXT_OBEY = True USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
Next let’s take a look at the quotes.py file. Initially this is just a skeleton class, but eventually it will contain code specific to our spider.
# -*- coding: utf-8 -*- import scrapy class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): pass
The scraping code will eventually live in the parse
function.
The QuotesSpider class contains a couple of important variables. Most notable is the allowed_domains
array. By specifying the domain name here, we limit the scope of the spider’s crawl. If the spider encounters a link that’s not part of the quotes.toscrape.com domain, it won’t visit it.
Next let’s take a look at some basic scraping syntax. We’ll do this through the interactive Scrapy shell. The Scrapy shell is a REPL where you can test your scraping code. This way you know everything works correctly before you commit the code to the quotes.py file.
Enter scrapy shell
in your terminal window. You’ll see a bunch of debug messages. You can safely ignore them for now.
Type the following command into the shell.
fetch("http://quotes.toscrape.com")
You should find a DEBUG Crawled (200) success message in the output. This means that Scrapy successfully captured the page.
Scrapy will automatically store the request you made in the request
shell variable; see details of the request by typing request
into the shell to access the variable.
>>> request <GET http://quotes.toscrape.com>
Likewise, the results of the fetch are stored in the shell’s response
variable.
>>> response
<200 http://quotes.toscrape.com>
But there’s a lot more to response
than that – in fact it stores the entire page HTML from our GET request. We can access details of the page by parsing the response. This is done by specifying either XPath or CSS selectors.
What’s the difference? XPath selectors target nodes in the HTML document, while CSS selectors specify the styles applied to the document. My general recommendation is to favor XPath selectors over CSS. Not only is it more difficult to make advanced selections with CSS selectors, but all CSS selections are translated to XPath at runtime anyway.
Let’s take a look at the page in Firefox or Chrome. The easiest way to see the source for the H1
tag is to right-click on the title and choose inspect.
<div class="col-md-8"> <h1> <a href="/" style="text-decoration: none">Quotes to Scrape</a> </h1> </div>
This shows us that the text “Quotes to Scrape” is nested inside an a
tag, which is itself nested inside the H1
tag, and so on. The XPath to access that text is //h1/a/text()
.
Let’s break that XPath down. The //h1
means “all H1 tags in the document” – and, naturally, there should only be one. Next, //h1/a
means “the a tag in all H1 tags.” And by now you’ve probably guessed that //h1/a/text()
means “the text contained in the a tag in all H1 tags.”
There are automated ways to generate the XPaths with various browser plugins – but I’ve found they don’t produce “clean” XPath selectors. They’re too specific, and difficult to work with in loop constructs. Experiment and find what works best for you.
So let’s plug this XPath into Scrapy’s xpath()
method. This is a shortcut Scrapy provides that makes acquiring the values more convenient.
>>> response.xpath('//h1/a/text()') [<Selector xpath='//h1/a/text()' data='Quotes to Scrape'>]
Hmmm… close, but no cigar. To get the actual text “Quotes to Scrape” we can use either the extract()
or extract_first()
methods.
The extract()
method will produce the results in an array, which is useful in some cases – but not this one. To get the string value, we’ll instead choose the extract_first()
method.
>>> response.xpath('//h1/a/text()').extract_first() 'Quotes to Scrape'
Dude! Sweet!
Now that we have a valid XPath selector that captures the text we want, let’s save it for reuse. Copy and paste it into the parse function in quotes.py
.
Continue this way until you have all the desired data points. The below snippet of code includes a more advanced example of an XPath selector for the tags, extracted into an array.
def parse(self, response): h1_tag = response.xpath('//h1/a/text()').extract_first() tags = response.xpath('//*[@class="tag-item"]/a/text()').extract() yield {'H1 Tag': h1_tag, 'Tags': tags}
We then yield a dictionary of the results.
Almost done. Now we’re ready to execute the spider. First, navigate to the top level directory my_scraping_project
, then run scrapy crawl quotes
.
In the output, you should find:
... {'Tags': ['love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile'], 'H1 Tag': 'Quotes to Scrape'} ...
There you have it. A functional web crawler that can scrape websites into individual, easily digestible data points. These can then be saved to e.g. .csv files, a database, or reused elsewhere in an application.
Have fun.
Appendix
If you had problems installing Scrapy on Windows 10, you are not alone. In my situation, I encountered an error stating a required Microsoft Visual C++ 14.0 binary package was unable to be located. If you get this error message too…
- See this Q/A about a similar issue. https://stackoverflow.com/questions/35025437/trouble-installing-scrapy-on-windows-64-bit
- The following site has precompiled Python library binaries for Windows: https://www.lfd.uci.edu/~gohlke/pythonlibs/. Find the right one and then use the
pip install library-name.whl
syntax (see link above) to install it. - Alternatively, and the option I chose, install the VS Studio 2017 Build Tools from here (warning, this will install about 4.5GB of stuff): https://visualstudio.microsoft.com/visual-cpp-build-tools/
- You may also get a message about a missing pypiwin32 library. Run
pip install pypiwin32
in a terminal to install it.