Web scrapers are invaluable tools for journalists, because they allow us to automate the retrieval of data from websites, PDFs, and the like. Once we’ve got that data tucked away in a spreadsheet or database, we can start working with it.
Here’s Scraperwiki’s tutorials (in this case, scrapers written in the programming language Python).
I don’t get an error, but when I press the letter T to start the process, nothing happens. I’m sending them an error report. Firefox 15.0.1 on Mac OSX.
Glen McGregor (national affairs reporter with the Ottawa Citizen newspaper) wrote a helpful article about scraping for journalists:
But the best way and most effective approach to real web‑scraping is to write your own custom computer scripts. Often, these are the only way to extract data from online databases that require user input, such as the vehicle recalls list or restaurant inspections site.
To do this, you will need to learn a little bit of computer programming using a language such as Python, Ruby, Perl or PHP. You only to need to choose one.
Python, named after Monty not the snake, is my favourite for its simple syntax and great online support from Pythonistas. Ruby is also popular with data journalists. …
A program to scrape the vehicle recalls database would be written to submit a search term to the Transport website from a list of vehicle makes. It would capture the list of links the web server returns, then another part of the program would open each of these links, read the data, strip out all the HTML tags, and save the good stuff to a file.
Depending on the number of records and the speed of the server, it might take hours to run the program and assemble all the data in a single file. (For journalists not inclined to learn a computer language, Scraperwiki.com brings together programmers with people who need scraping work done.)
Read more > here.
After showing us the basics of scraping a Web page, Nathan provides a script so that we can scrape a year’s worth of data from 365 separate Web pages (nice!).
But don’t think you have to type the script with your own little fingers. Like all good code book authors, Nathan has provided the code from the book as a download:
> Downloads page for Visualize This
NOTE that you will need to edit the second line of the script (use a plain-text editor) because of the error on page 32 (see explanation of the error).
> See all posts in this blog about this book
p. 31 DOES NOT WORK:
p. 32 DOES NOT WORK:
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
p. 33 FURTHER EXPLANATION:
After you have found that the value you want (maximum temperature, which is 26°F) is enclosed by span tags with class=”nobr”, you need to know how to find out WHICH class=”nobr” you will be scraping. Nathan tells you it’s nobrs … but how can you find that number (5) for yourself? (I will assume you know how arrays work.)
Memo to self: Journalism students are not likely to understand what an array is and how it works.
*You may see the temperatures in °C, depending on which country you’re in.