A brief Storify demonstrating a typical sort of conversation I have had many times over the past year, with members of the data journalism community. I have yet to come away without answers to my questions or new ways of trying things.

Because this blog isn’t self-hosted, you’ll have to click on the image below to see the proper interactive version of the Storify.

Data journalism: a conversation (with tweets) · henryctaylor · Storify

Got some messy data that you need to clean up? Google Refine (technically known as Open Refine, but no one calls it that) is probably your best bet.

Using Google Refine, you can quickly transform unformatted, unrefined data like this: 

Screen Shot 2013 05 06 at 18 22 00

…into a formatted, column-separated table in Excel, like so:

Screen Shot 2013 05 06 at 18 22 11

The example I’m using isn’t a particularly good one, as it wouldn’t take long to copy, paste and edit the data in Excel, or even manually type the data out by hand, but the cleanup techniques work wonders if you’re working with a large dataset that contains lots of rows and/or columns.

To start with, download and install Refine, and open it up (it launches in your default browser). Head over to http://www.thepaperboy.com/uk-top-10-newspapers.cfm, then select and copy the “Top 10 UK Daily Newspapers” data, including the title.

In Refine, click on Create Project, then Clipboard, and paste your data into the box. Then click Next.

Your project should now look something like this:

Screen Shot 2013 05 06 at 18 34 49

Don’t worry about any of the options in the bottom half of the window; for now just change the project name from “clipboard” to something else and click Create Project. 

Your data has now been imported into Refine, and should look like this:

Screen Shot 2013 05 06 at 18 37 18

As you can see, there are some problems with our data, and it needs a bit of cleaning up before any visualisation can take place.

Problems:

  1. All of the data is in one column
  2. The current column is incorrectly named
  3. The 1st row “Top 10 UK Daily Newspapers” isn’t needed
  4. Unnecessary parentheses around each circulation figure
  5. Numbers before each newspaper name


Problem 1: All of the data is in one column

Refine allows you to split an existing column into 2 or more columns. A column can be split by what’s known as a “separator”, which simply means a character, or number of characters, that appear in each row. We can see that for each row of data, between the end of the newspaper name and the start of the circulation figure, we have a left parenthesis. This is our separator.

Click on the down arrow at the top of Column 1 and click Edit Column > Split into several columns.

In the Separator box, type a left parenthesis: (

Underneath, in the Split into box, type 2, as we want to end up with 2 separate columns – one containing the newspaper name and the other containing the circulation figures.

Click OK.

Screen Shot 2013 05 06 at 18 50 08

The data has now been split into two separate columns and should look like this (handily, the left parenthesis has also been removed, meaning we don’t have to remove it later on):

Screen Shot 2013 05 06 at 19 29 09

 

Problem 2: The columns are incorrectly named

Renaming columns in Refine is easy: simply click on the down arrow at the top of the column and click Edit column > Rename this column. Call your columns “Newspaper” and “Circulation”.

 

Problem 3: The 1st row “Top 10 UK Daily Newspapers” isn’t needed

Refine doesn’t automatically let you delete individual rows, but there is a way around this; we need to create a text facet. A text facet is probably Refine’s most simple feature, and is similar to a filter in Excel.

To remove the top row, click on either the star or flag to the left of Top 10 UK Daily Newspapers, then click on the drop-down arrow for the column titled “All”. Click Facet and then either Facet by star or Facet by flag, depending on which one you chose initially. On the left of the window, you’ll now see your text facet, for which you have two options: True or False. Click True to display only the flagged/starred rows. Finally, to remove the row, click on the drop-down arrow for the All column again and select Edit rows > Remove all matching rows. Click the x next to your “Flagged Rows” facet on the left to return to the main view.


Problem 4: Unnecessary parentheses around each circulation figure

The left parenthesis was removed earlier when we split the first column, but the circulation figures are still followed by a right parenthesis. To remove this we need to write a GREL expression, which means a bit of coded instruction that performs a specific task.

Hit the drop-down arrow on the Circulation column, and select Edit Cells > Transform.

In the box that appears, write the following:

replace(value, “)”, “”)

Screen Shot 2013 05 06 at 20 02 26

This tells Refine to look for any occurrence of the right parenthesis ) character, and replace it with “nothing”, meaning it gets deleted. Notice the end result appears in the preview window before you hit OK. If your GREL expression isn’t working properly, make sure you haven’t used too few or too many spaces. Whitespace is this programming language’s worst enemy.

 

Problem 5: Numbers before each newspaper name

Our data is now starting to take shape, but before we can go ahead and export, we need to remove those unnecessary numbers that appear before each newspaper name.

Screen Shot 2013 05 06 at 20 08 03

In order to do this, we need to use another GREL expression, but this one tells Refine to look for any numbers in the column and remove them.

Click the drop-down arrow at the top of the Newspaper column and select Edit Cells Transform again. This time, we’re going to use this expression:

replace(value, /[0-9]/, “”)

The two forward slashes donate a “regular expression” (otherwise known as regex) that instructs Refine to find and recognise any matching string of characters, words, or numbers, depending on the pattern that the user wants to identify.

As we want to remove any number, the regex that we have used is /[0-9]/, which instructs refine to locate any number between 0 and 9. The second part of our GREL expression works as before, replacing these numbers with “nothing”, which results in them being deleted.

Note: If you’ve come across regex before, you can also use \d in place of [0-9] to remove the numbers, although be aware that this isn’t directly equivalent. \d matches both 0123456789 characters and other digit characters too, including Eastern Arabic numerals, for example.

Screen Shot 2013 05 06 at 20 24 56

Our data is now cleaned up and ready to export! To do this, simply click on Export in the top right hand corner of the Refine page, and select your choice of Tab-separated value (.tsv), Comma-separated value (.csv), HTML table, Excel spreadsheet (.xls) or ODF Spreadsheet (for use with Open Office). Open it up in your program of choice and you’re ready to go.

As I mentioned before, the dataset in this example isn’t especially messy or complex, but when you inevitably come across a large set of badly formatted data, experiment with some of the techniques outlined above to see if you can clean things up before moving over to Excel.

Inspired by the chaps over at Google, Twitter unveiled their first Transparency Report in July last year, which aimed to shed more light on:

  • government requests for user information
  • government requests to withhold content, and
  • DMCA (Digital Millennium Copyright Act) takedown notices received from copyright holders.

The report also described how and when Twitter responded to these requests, but the take-home message was the rapidly increasing number of government appeals for information; “more in the first half of 2012 than in the entirety of 2011.”

6 months later in January this year, Twitter rolled out transparency.twitter.com – a new home for their transparency report and part of a drive to make the data more accessible.

Within a couple minutes of this tweet…

Screen Shot 2013 05 04 at 16 07 45

… I had gone onto the site and quickly copied the data for both removal requests and copyright notices into a Google spreadsheet. The data was split into 2 sets (Q1 – Q2 and Q3 – Q4) and so required a brief bit of formatting before it was ready to visualise.

Screen Shot 2013 05 04 at 17 26 06

After spending 10 minutes or so wrangling the data into the right format (read my earlier post regarding Google Fusion tables and getting hold of shape files) I had the following data, ready to chart, plot, map, or do whatever was needed:

Screen Shot 2013 05 05 at 13 20 19

Screen Shot 2013 05 04 at 17 31 41

There are a number of ways to visualise the above datasets, but to ensure a quick turnaround of the data, I opted for the built in Google options of Google Charts and Google Fusion tables. Both tools are free.

If you’re familiar with creating charts and graphs in Excel, Google’s Chart Wizard should be fairly intuitive. Even if you’re a bit rusty, the wizard previews any changes you make to your chart before you finalise, so play around with the options until you’ve got your visualisation the way you want it.

Be wary of choosing an unnecessary and overcomplicated type of chart; with relatively simple datasets like these, if your visualisation isn’t helping you to distinguish between different variables or interpret the data in a more efficient way, it’s not doing its job properly.

The ease and speed with which someone can use Google’s free tools to quickly turn around data meant that in 15 minutes of Twitter’s original tweet, I had produced and tweeted this:

Screen Shot 2013 05 05 at 13 36 23

Screen Shot 2013 05 05 at 13 43 54

…and in less than 60 minutes, I had produced and tweeted this:

Screen Shot 2013 05 05 at 13 43 26

Screen Shot 2013 05 05 at 13 44 16

As I don’t have a self-hosted WordPress blog, you’ll have to click on the above image to link to the actual Fusion map.

This fast turnaround of data is the sort of style that the Guardian datablog utilises effectively every day, thereby keeping their posts newsworthy. While custom-coded visualisations are often very impressive, they take time and effort to create. For simple data that you’ve accessed soon after its release, this is the way to go.

While working on The Hackney Post during City’s production weeks, I was seconded away from the web team for a day or two in order to create some visualisations for the paper.

This requires a slightly different approach to data visualisation. On the web you don’t necessarily need to display the data points on a bar graph, for example, as the reader can hover their mouse over the relevant bar and see the specific data point in a pop-up window. The same can be done with Google Fusion maps and other types of embedded visualisations to display more information in a cleaner way.

The same approach obviously doesn’t work for print, which means you have to think a little more carefully about how to cram all of that vital data into a visualisation without ruining its readability and therefore the reader’s subsequent comprehension.

A knowledge of photoshop can come in handy here, as you can tweak and fine-tune to your heart’s content – much more so than with the built-in options available to you with most free web tools.

For a story on drug raids in Hackney that was to appear in the next day’s paper, I created the map below, using crime rates available from the Met Police, a screenshot of a Google Fusion map I created containing the colour-coded Hackney Ward boundaries by crime rate, and the locations of those recently arrested under a Hackney Police operation.

hackney-crime-map-for-paper

The font for each ward name was chosen as it matched the Hackney Post’s new logo, which I also created as part of the paper’s redesign:

Hackney-Post-logo700

While the key could be a little clearer, the image caption in the paper also clarified what the reader was looking at, and how to interpret the map.

A versatile data journalist is also one that can work with traditional print media. Amassing any sort of transferrable skills in data journalism will make you a worthwhile addition to any organisation.

Follow

Get every new post delivered to your Inbox.

Join 581 other followers