Baby Steps in Data Journalism

Starting from zero, this Tumblr provides tools, links and how-to information for people just beginning to explore data journalism.
Recent Tweets @macloo
Posts tagged "book"

Finally! A free online book about how to do data journalism, written by experts and practitioners.

Both of these are introductory texts in computer science:

Introduction to Computing: Explorations in Language, Logic, and Machines, by David Evans, associate professor of computer science, University of Virginia (this book uses Python only in two chapters at the end)

How to Think Like a Computer Scientist, by Jeffrey Elkner, Allen B. Downey, and Chris Meyers (this book uses Python throughout; it’s online only, no PDFs)

Allen [Downey] had already written a first-year computer science textbook, How to Think Like a Computer Scientist. When I read this book, I knew immediately that I wanted to use it in my class. It was the clearest and most helpful computer science text I had seen. It emphasized the processes of thought involved in programming rather than the features of a particular language. Reading it immediately made me a better teacher.

How to Think Like a Computer Scientist was not just an excellent book, but it had been released under the GNU public license, which meant it could be used freely and modified to meet the needs of its user. Once I decided to use Python, it occurred to me that I could translate Allen’s original Java version of the book into the new language.

— Jeffrey Elkner, from the Preface, How to Think Like a Computer Scientist

Elkner is a high school math and computer science teacher in the Arlington County, Virginia, public schools.

Nowadays, it’s not out of the ordinary that I spend just as much time getting data in the format that I need as I do putting the visual part of a data graphic together. Sometimes I spend more time getting all my data in place.

Here’s a recap if you are just joining me: I started this blog on April 8, 2012, because I am reading Nathan’s book about how to make data graphics. I think data graphics are very important for journalism, now and in the future.

So when I reached the first exercise in the book (around page 31), I decided to open up Python, a programming language, on my MacBook and play along.

I had an older version of Python, so I installed a new one.

Then I installed something called ActiveTcl, which Nathan does not mention.

Finally, I installed a Python library called Beautiful Soup, which lets us scrape data from Web pages. This is done in Nathan’s first exercise. I ran into some problems, but I managed to solve them.

I managed to complete the exercise, in which one scrapes data from 365 separate Web pages (hooray!) Then I made a graphic from my data, using MS Excel.

I have indulged in a few distractions along the way:

That’s everything up to now.

I would be remiss if I did not link to Nathan’s wonderful blog, which “explores how designers, statisticians, and computer scientists are using data to understand ourselves better — mainly through data visualization.”

Results from Data Scraping

So I’m pretty happy with today’s work: In a little less than 3 hours (including blogging about all this and looking up lots of related stuff), I was able to use Python to scrape 365 Web pages and export a comma-delimited file of the maximum recorded temperature for every day in 2011 for Gainesville, Florida.

I opened the file with Excel and used the built-in chart tools to create the graphic above, which is quite simple — but it’s showing all the data from that scrape! So cool!

You can view a Google Spreadsheets version > here.

After showing us the basics of scraping a Web page, Nathan provides a script so that we can scrape a year’s worth of data from 365 separate Web pages (nice!).

But don’t think you have to type the script with your own little fingers. Like all good code book authors, Nathan has provided the code from the book as a download:

> Downloads page for Visualize This

NOTE that you will need to edit the second line of the script (use a plain-text editor) because of the error on page 32 (see explanation of the error).

> See all posts in this blog about this book

p. 31 DOES NOT WORK:

urllib2.urlopen(“www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html”)

DOES WORK:

urllib2.urlopen(“http://www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html”)

p. 32 DOES NOT WORK:

from BeautifulSoup import BeautifulSoup

DOES WORK:

from bs4 import BeautifulSoup

p. 33 FURTHER EXPLANATION:

After you have found that the value you want (maximum temperature, which is 26°F) is enclosed by span tags with class=”nobr”, you need to know how to find out WHICH class=”nobr” you will be scraping. Nathan tells you it’s nobrs[5] … but how can you find that number (5) for yourself? (I will assume you know how arrays work.)

  1. View Source on the HTML page you want to scrape.
  2. Command-F to find text in the source.
  3. Type (in this case) the class you’re seeking: nobr
  4. Find repeatedly and count until you reach the maximum temperature value (26°F).* On the example page, you will have counted to 6. Why then does Nathan tell us to use 5? Because items in an array are numbered starting at 0. So the first item in your array named nobrs would be nobrs[0], and the sixth item is nobrs[5].

Memo to self: Journalism students are not likely to understand what an array is and how it works.

*You may see the temperatures in °C, depending on which country you’re in.

I wanted to do some of the things recommended in Nathan Yau’s excellent book Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2011: Wiley). So after I had about 120 lines of notes in a text editing program, I thought: “Hey, with so many links and stuff, I should turn this into a Tumblr!” So here it is.

Click the book cover to see it on Amazon.com.

Check out Nathan’s popular blog, FlowingData.