The Data Science Toolkit, truly open tools for data. Via Brian Abelson.
Web scrapers are invaluable tools for journalists, because they allow us to automate the retrieval of data from websites, PDFs, and the like. Once we’ve got that data tucked away in a spreadsheet or database, we can start working with it.
Here’s Scraperwiki’s tutorials (in this case, scrapers written in the programming language Python).
(via macloo)
I don’t get an error, but when I press the letter T to start the process, nothing happens. I’m sending them an error report. Firefox 15.0.1 on Mac OSX.
“With Haystax, it’s easy to collect information from online databases and tables.”
Hat tip to Life and Code.
A tutorial written by Ben Welsh of the L.A. Times.
NOTE: Before you try this —
from BeautifulSoup import BeautifulSoup
— read this.
Glen McGregor (national affairs reporter with the Ottawa Citizen newspaper) wrote a helpful article about scraping for journalists:
But the best way and most effective approach to real web‑scraping is to write your own custom computer scripts. Often, these are the only way to extract data from online databases that require user input, such as the vehicle recalls list or restaurant inspections site.
To do this, you will need to learn a little bit of computer programming using a language such as Python, Ruby, Perl or PHP. You only to need to choose one.
Python, named after Monty not the snake, is my favourite for its simple syntax and great online support from Pythonistas. Ruby is also popular with data journalists. …
A program to scrape the vehicle recalls database would be written to submit a search term to the Transport website from a list of vehicle makes. It would capture the list of links the web server returns, then another part of the program would open each of these links, read the data, strip out all the HTML tags, and save the good stuff to a file.
Depending on the number of records and the speed of the server, it might take hours to run the program and assemble all the data in a single file. (For journalists not inclined to learn a computer language, Scraperwiki.com brings together programmers with people who need scraping work done.)
Read more > here.
ProPublica’s series of how-to guides explaining how they collected the data for their searchable app about pharmaceutical company payments to doctors (December 2010).
Results from Data Scraping
Okay, this is even better than the first one. I modified Nathan’s script to scrape both the maximum and minimum temperatures for 365 days (meaning 365 Web pages!) and dumped them into one comma-delimited text file. Then I imported it into Excel to make this graph. I just used the Excel chart tools to make it (Excel for Mac 2011).
Python (partial):
# Get temperature from page
soup = BeautifulSoup(page)
# maxTemp = soup.body.nobr.b.string
maxTemp = soup.findAll(attrs={"class":"nobr"})[5].span.string
minTemp = soup.findAll(attrs={"class":"nobr"})[8].span.string
# Above I added a scrape for lowest temperature too
Results from Data Scraping
So I’m pretty happy with today’s work: In a little less than 3 hours (including blogging about all this and looking up lots of related stuff), I was able to use Python to scrape 365 Web pages and export a comma-delimited file of the maximum recorded temperature for every day in 2011 for Gainesville, Florida.
I opened the file with Excel and used the built-in chart tools to create the graphic above, which is quite simple — but it’s showing all the data from that scrape! So cool!
You can view a Google Spreadsheets version > here.
After showing us the basics of scraping a Web page, Nathan provides a script so that we can scrape a year’s worth of data from 365 separate Web pages (nice!).
But don’t think you have to type the script with your own little fingers. Like all good code book authors, Nathan has provided the code from the book as a download:
> Downloads page for Visualize This
NOTE that you will need to edit the second line of the script (use a plain-text editor) because of the error on page 32 (see explanation of the error).
> See all posts in this blog about this book
p. 31 DOES NOT WORK:
urllib2.urlopen(“www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html”)
DOES WORK:
urllib2.urlopen(“http://www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html”)
p. 32 DOES NOT WORK:
from BeautifulSoup import BeautifulSoup
DOES WORK:
from bs4 import BeautifulSoup
p. 33 FURTHER EXPLANATION:
After you have found that the value you want (maximum temperature, which is 26°F) is enclosed by span tags with class=”nobr”, you need to know how to find out WHICH class=”nobr” you will be scraping. Nathan tells you it’s nobrs[5] … but how can you find that number (5) for yourself? (I will assume you know how arrays work.)
Memo to self: Journalism students are not likely to understand what an array is and how it works.
*You may see the temperatures in °C, depending on which country you’re in.
Scraping a Web page
So I’m on page 31 of Nathan’s book, and finally, everything is working. Ah!
Click the image above to see it full-size and readable. That is what the Python library BeautifulSoup can do for you. That is how Web pages are scraped.
But I realized something. The power of those few lines is obvious to me because I know HTML. I know what img and src mean. A lot of journalists have never learned HTML, so they would probably look at that and say, “Huh?”
Memo to self: Before teaching how to scrape, I must ensure that students know basic HTML.