Baby Steps in Data Journalism

Starting from zero, this Tumblr provides tools, links and how-to information for people just beginning to explore data journalism.
Recent Tweets @macloo
Posts tagged "scrape"



Software to extract data from PDFs. #journalism #data

because into each life a little PDF must fall more’s the pity


The Data Science Toolkit, truly open tools for data. Via Brian Abelson.


Web scrapers are invaluable tools for journalists, because they allow us to automate the retrieval of data from websites, PDFs, and the like.  Once we’ve got that data tucked away in a spreadsheet or database, we can start working with it.  

Here’s Scraperwiki’s tutorials (in this case, scrapers written in the programming language Python).  

(via macloo)

Asker modfetish Asks:
Haystax. Is this bookmarklet working for you? throwing an error to my console in chrome and FF and multiple computers.
babydatajournalism babydatajournalism Said:

I don’t get an error, but when I press the letter T to start the process, nothing happens. I’m sending them an error report. Firefox 15.0.1 on Mac OSX.

"With Haystax, it’s easy to collect information from online databases and tables."

Hat tip to Life and Code.

A tutorial written by Ben Welsh of the L.A. Times.

NOTE: Before you try this —

from BeautifulSoup import BeautifulSoup

read this.

Glen McGregor (national affairs reporter with the Ottawa Citizen newspaper) wrote a helpful article about scraping for journalists:

But the best way and most effective approach to real web‑scraping is to write your own custom computer scripts. Often, these are the only way to extract data from online databases that require user input, such as the vehicle recalls list or restaurant inspections site.

To do this, you will need to learn a little bit of computer programming using a language such as Python, Ruby, Perl or PHP. You only to need to choose one.

Python, named after Monty not the snake, is my favourite for its simple syntax and great online support from Pythonistas. Ruby is also popular with data journalists. …

A program to scrape the vehicle recalls database would be written to submit a search term to the Transport website from a list of vehicle makes. It would capture the list of links the web server returns, then another part of the program would open each of these links, read the data, strip out all the HTML tags, and save the good stuff to a file.

Depending on the number of records and the speed of the server, it might take hours to run the program and assemble all the data in a single file. (For journalists not inclined to learn a computer language, brings together programmers with people who need scraping work done.)

Read more > here.

ProPublica’s series of how-to guides explaining how they collected the data for their searchable app about pharmaceutical company payments to doctors (December 2010).

Results from Data Scraping

Okay, this is even better than the first one. I modified Nathan’s script to scrape both the maximum and minimum temperatures for 365 days (meaning 365 Web pages!) and dumped them into one comma-delimited text file. Then I imported it into Excel to make this graph. I just used the Excel chart tools to make it (Excel for Mac 2011).

Python (partial):

      # Get temperature from page
      soup = BeautifulSoup(page)
      # maxTemp = soup.body.nobr.b.string
      maxTemp = soup.findAll(attrs={"class":"nobr"})[5].span.string
      minTemp = soup.findAll(attrs={"class":"nobr"})[8].span.string
      # Above I added a scrape for lowest temperature too 

Results from Data Scraping

So I’m pretty happy with today’s work: In a little less than 3 hours (including blogging about all this and looking up lots of related stuff), I was able to use Python to scrape 365 Web pages and export a comma-delimited file of the maximum recorded temperature for every day in 2011 for Gainesville, Florida.

I opened the file with Excel and used the built-in chart tools to create the graphic above, which is quite simple — but it’s showing all the data from that scrape! So cool!

You can view a Google Spreadsheets version > here.

After showing us the basics of scraping a Web page, Nathan provides a script so that we can scrape a year’s worth of data from 365 separate Web pages (nice!).

But don’t think you have to type the script with your own little fingers. Like all good code book authors, Nathan has provided the code from the book as a download:

> Downloads page for Visualize This

NOTE that you will need to edit the second line of the script (use a plain-text editor) because of the error on page 32 (see explanation of the error).

> See all posts in this blog about this book