Baby Steps in Data Journalism

Starting from zero, this Tumblr provides tools, links and how-to information for people just beginning to explore data journalism.
Recent Tweets @macloo
Posts tagged "tools"


Idea for teaching:

If each student sets up a free GitHub account —

They can make Gists, like this:

Could these be used for peer grading? Easy to share.

Once the student has a GitHub account, he/she can write code in Codepen ( and automatically save to Gist from there.

The “Visualize Execution” button below the code window lets you see a step-by-step graphical representation of what the code does. Wow! 

This is new for Overview:

This allows use of Overview without uploading the documents to DocumentCloud, and makes it much easier to import data from sources such as Twitter.



The Data Science Toolkit, truly open tools for data. Via Brian Abelson.

"With Haystax, it’s easy to collect information from online databases and tables."

Hat tip to Life and Code.

A very clear and simple guide that demonstrates WHY regular expressions are useful when you need to clean some data (that is, make it consistent).

You use a text-editor program to do this (e.g. TextWrangler on the Mac). There is no programming involved.

Dan Nguyen is a developer/journalist for ProPublica, a non-profit investigative news organization. In this post, he introduces the following tools:

  1. Web inspector
  2. Google Refine
  3. Regular expressions

So none of these tools or concepts involve programming … yet. But they’re immediately useful on their own, opening new doors to useful data just enough to interest beginners into going further.

— Dan Nguyen

Glen McGregor (national affairs reporter with the Ottawa Citizen newspaper) wrote a helpful article about scraping for journalists:

But the best way and most effective approach to real web‑scraping is to write your own custom computer scripts. Often, these are the only way to extract data from online databases that require user input, such as the vehicle recalls list or restaurant inspections site.

To do this, you will need to learn a little bit of computer programming using a language such as Python, Ruby, Perl or PHP. You only to need to choose one.

Python, named after Monty not the snake, is my favourite for its simple syntax and great online support from Pythonistas. Ruby is also popular with data journalists. …

A program to scrape the vehicle recalls database would be written to submit a search term to the Transport website from a list of vehicle makes. It would capture the list of links the web server returns, then another part of the program would open each of these links, read the data, strip out all the HTML tags, and save the good stuff to a file.

Depending on the number of records and the speed of the server, it might take hours to run the program and assemble all the data in a single file. (For journalists not inclined to learn a computer language, brings together programmers with people who need scraping work done.)

Read more > here.

Last weekend I was looking for ways to extract Twitter search data in a structured, easily manageable format. The two APIs I was using (Twitter Search and Backtweets) were giving good results – but as a non-developer I couldn’t do much with the raw data they returned. Instead, I needed to get the data into a format like CSV or XLS.

Some extensive Googling led me to this extremely useful post on Labnol, where I learnt about how to use the ImportXML function in Google Spreadsheets. Before too long I’d cracked my problem. In this post I’m going to explain how you can do it too.

Click the link to learn how!

Damn, this is a VERY LONG list of links! But the resources here have great value. You owe it to yourself to at least scan the list and marvel at the wonderfulness that good journalism people make available — free of charge — to all of us.

Thank you, NICAR! And thanks to Chrys Wu for making this wonderful list.

ScienceOnline Bay Area event: Data Visualization and Data Journalism in Science, 19 April 2012.

Click the image to get the links.