In my last post I wrote about how I learned to use Google Fusion tables to create an interactive map.

I made another map this morning, featuring some data that fellow interhacktive, Sam Creighton, sent my way yesterday.

This one visualises requests that Google have received from governments the world over, asking for user data to be handed over. The data I mapped is for the period Jan 1st – Jun 30th 2012.

Some interesting figures from a cursory look at the data:

  • Google complied with 90% of requests from the US
  • Hungary, Russia and Turkey submitted a combined total of 262 requests from Jan – Jun, but Google didn’t comply with any of them
  • The US submitted more requests in the first 5 months of this year (7,969) than it did in the entirety of 2009 and 2010 combined (7,867).

Geopolitics aside, the last point is particularly interesting as it hints at how rapidly the digital world has grown, even over the last few years.

Google blogged about the rise in government requests earlier this week:

The information we disclose is only an isolated sliver showing how governments interact with the Internet, since for the most part we don’t know what requests are made of other technology or telecommunications companies. But we’re heartened that in the past year, more companies like DropboxLinkedInSonic.net and Twitter have begun to share their statistics too. Our hope is that over time, more data will bolster public debate about how we can best keep the Internet free and open.

Advertisements

I’ve heard a lot about Google Fusion Tables over the last few weeks. Simon Rogers, editor of the Guardian Datablog, must have mentioned them 10 times when he came to talk at City last month. Some of our other lecturers, including Paul Bradshaw and Gary Moskowitz have also mentioned them in class or in passing.

As a fledgling interactive journalist, they sound pretty important. Especially when I don’t really know what they are.

Being the industrious student that I am, I thought I would ignore all of the immediate work I should have been doing over the weekend, and instead have a play around.

I thought I would keep things simple and create an interactive map – the type of thing you see on the Guardian all the time.

I knew I would need some data so I decided to head over to whatdotheyknow.com to see if anyone had kindly submitted a single freedom of information request to all of London’s borough councils.

Luckily for me, a man called Hugh Roberts had recently used whatdotheyknow to FOI 165 councils across the country about dog shit.

Success!

Hugh had asked each of the 165 councils (including the London borough ones):

    • how many complaints have you received about dog fouling from 2005 to 2012?
    • how many fixed penalty notices for dog fouling did you issue during the same period?
    • what were the dates of any dog fouling campaigns the council participated in?

I decided to discard the third question for the moment and focus on the first two.

I created a Delicious bundle consisting of every whatdotheyknow response page for each of London’s 33 boroughs.

Unfortunately, Hugh probably didn’t attend Heather Brooke’s excellent talk on the FOI act last month, so he hadn’t known to ask the councils to provide the data in Excel format. This meant laboriously trawling through every response letter, deciphering the myriad of different ways that each council had chosen to display its information.

The irony that I was doing the digital equivalent of sifting through dog shit wasn’t lost on me.

I then totted up the figures and put them into a spreadsheet:

Not every council had successfully responded to Hugh, so I only had data for 26 of the 33 London boroughs. Once I had finished with the spreadsheet, I exported it as a CSV (comma separated values) file.

Next, I went to my Google Drive and clicked on Create –> More –> Fusion Table (experimental). I uploaded my CSV file, gave the table a name and a description, and clicked Finish.

The data was then imported into a Google table:

Then I went to research.google.com/tables and searched for ‘London boroughs’. I clicked on a result that looked like it would contain the borough outlines and copied the URL.

Back on my original table, I clicked on File –> Merge, and pasted the URL into the box that appeared. After clicking Next, I made sure that the ‘Borough’ categories matched up from both tables, clicked Next again, made sure every tick box was ticked, and finally clicked Merge.

Google then fuses (see where the fusion aspect comes into it? Clever stuff!) the two tables together. In the resulting window, there’s a tab called ‘Map of Outline’. Click it, and hey presto, your dog shit data should now be lovingly housed by beautiful, red borough boundaries.

Go to Tools –> Map styles to have a play with the colour and border settings.

If you can’t be bothered to make one yourself, have a look at my end result by clicking on the screenshot below (unfortunately there’s no way to embed the final product on WordPress blogs). I set up a gradient colour system so you can quickly glean a bit more information from it before interrogating the figures further.

As with all data journalism, the data is useless unless you use it to tell a story. My main aim was to learn a bit about the Google Fusion software and how to transform data from a spreadsheet into something more interesting, but I’ll follow this post up with some ideas on where I might head next if I was writing a story.

From what I can gather, data journalism seems to be both an art and a science. All of us on City’s Interactive MA will eventually approach similar problems in different ways.

The other students on the course are also running their own data blogs. They’re all listed in my blogroll on the right of every page, but here are some of their first posts on data journalism.

In our first session with Paul Bradshaw, we scraped the website horsedeathwatch.com into a Google spreadsheet.

If you haven’t signed up already, then register with Google Drive. Once you’re done, log in and click on the large red “Create” button in the left sidebar. Select “Spreadsheet”:

You’ll then be taken to your blank spreadsheet. Cell A1 should automatically have been selected. If not, click on it:

 

 

 

 

 

 

 

 

Next, go to horsedeathwatch.com and copy the URL. We’re going to use Google Docs’ ability to scrape raw data from HTML tables into your newly created spreadsheet.

In cell A1, type:

=importHTML(“

Paste the URL of the website you’re scraping – in our case http://www.horsedeathwatch.com and close the quotes. Your cell should now contain:

=importHTML(“http://www.horsedeathwatch.com”

Next, put a comma after the quotation marks and type “table” (including the quotes).

FInally, put another comma, and type “1”) (including the quotes and closed bracket). The final formula should look like this:

=importHTML(“http://www.horsedeathwatch.com”,”table”,”1″)

Press enter, and the data will be pulled into your spreadsheet, which should appear as below:

 

 

 

 

 

 

 

 

 

 

 

 

 

It’s worth then copying this data into a new sheet before saving the document, as at the moment your scrape is based on a live website. If the website changes, your data changes, so if you’ll be using this data for a particular story you’ll want to make sure it’s safe.

To do this, select all data (cmd+c on Macs, ctrl+c on Windows), click on Edit, then Copy. Then open a new sheet by clicking on the plus button in the bottom left corner. Click on cell A1 of the new sheet, click Edit, then Paste Special, then Paste Values Only.

Some other “import” functions include:

=importdata (grabs data from a .txt or .csv file)

=importfeed (grabs data from an RSS feed)

=importxml (grabs data from an XML file).

courtesy of wickes.co.uk

It’s Sunday night. I have had four days to ruminate on what I learned in my first data journalism session with Paul Bradshaw, who runs the Online Journalism Blog and can be followed @paulbradshaw.

By lunch I was ready to throw my computer out of the window and quickly follow suit, but in the afternoon something wonderful happened – I made a scraper!

“What’s a scraper?” I hear all one of you shout? A scraper is a small computer program that pulls bits of information from a website and puts them into a nice, neat table for you.

Using the Olympic Games website as an example, if you wanted to collect data (name, age, country, sport and result, for example) on each and every athlete that competed in this year’s games, it would take a long, long time. Longer than a Leonard Cohen song, even (© Armando Iannucci).

Almost 11,000 athletes from 203 countries competed and each of them has a unique page on the london2012.com site. That’s a lot of browsing.

A scraper cuts out the slow middle man – you – and quickly pulls the required information into a Google spreadsheet or similar program.

In my next post I’ll cover how we created a simple scraper that collated data from horsedeathwatch.com. Lovely.

“What exactly is data journalism?” my flatmate asked earlier this evening.

“Ermmm…it’s sort of like normal journalism, only with lots of Excel spreadsheets,” I reply.

“No. Wait. That’s a shit way of describing it. There may not actually be many spreadsheets at all. Although there probably will be. I think. It’s sort of…you know…Google Fusion tables and Python and Ruby and scraping and APIs and…you know…all that stuff? But Excel is definitely involved. I think.”

And therein lies the problem: most aspects of data journalism are new to me. And when I say ‘most’ I mean ‘nearly all’.

This blog documents my changing ability to define data journalism, from a series of ‘ums and ahs’ to something that will fit into 140 characters.