Cleaning up data in Google Refine

Got some messy data that you need to clean up? Google Refine (technically known as Open Refine, but no one calls it that) is probably your best bet.

Using Google Refine, you can quickly transform unformatted, unrefined data like this: 

Screen Shot 2013 05 06 at 18 22 00

…into a formatted, column-separated table in Excel, like so:

Screen Shot 2013 05 06 at 18 22 11

The example I’m using isn’t a particularly good one, as it wouldn’t take long to copy, paste and edit the data in Excel, or even manually type the data out by hand, but the cleanup techniques work wonders if you’re working with a large dataset that contains lots of rows and/or columns.

To start with, download and install Refine, and open it up (it launches in your default browser). Head over to http://www.thepaperboy.com/uk-top-10-newspapers.cfm, then select and copy the “Top 10 UK Daily Newspapers” data, including the title.

In Refine, click on Create Project, then Clipboard, and paste your data into the box. Then click Next.

Your project should now look something like this:

Screen Shot 2013 05 06 at 18 34 49

Don’t worry about any of the options in the bottom half of the window; for now just change the project name from “clipboard” to something else and click Create Project. 

Your data has now been imported into Refine, and should look like this:

Screen Shot 2013 05 06 at 18 37 18

As you can see, there are some problems with our data, and it needs a bit of cleaning up before any visualisation can take place.

Problems:

  1. All of the data is in one column
  2. The current column is incorrectly named
  3. The 1st row “Top 10 UK Daily Newspapers” isn’t needed
  4. Unnecessary parentheses around each circulation figure
  5. Numbers before each newspaper name


Problem 1: All of the data is in one column

Refine allows you to split an existing column into 2 or more columns. A column can be split by what’s known as a “separator”, which simply means a character, or number of characters, that appear in each row. We can see that for each row of data, between the end of the newspaper name and the start of the circulation figure, we have a left parenthesis. This is our separator.

Click on the down arrow at the top of Column 1 and click Edit Column > Split into several columns.

In the Separator box, type a left parenthesis: (

Underneath, in the Split into box, type 2, as we want to end up with 2 separate columns – one containing the newspaper name and the other containing the circulation figures.

Click OK.

Screen Shot 2013 05 06 at 18 50 08

The data has now been split into two separate columns and should look like this (handily, the left parenthesis has also been removed, meaning we don’t have to remove it later on):

Screen Shot 2013 05 06 at 19 29 09

 

Problem 2: The columns are incorrectly named

Renaming columns in Refine is easy: simply click on the down arrow at the top of the column and click Edit column > Rename this column. Call your columns “Newspaper” and “Circulation”.

 

Problem 3: The 1st row “Top 10 UK Daily Newspapers” isn’t needed

Refine doesn’t automatically let you delete individual rows, but there is a way around this; we need to create a text facet. A text facet is probably Refine’s most simple feature, and is similar to a filter in Excel.

To remove the top row, click on either the star or flag to the left of Top 10 UK Daily Newspapers, then click on the drop-down arrow for the column titled “All”. Click Facet and then either Facet by star or Facet by flag, depending on which one you chose initially. On the left of the window, you’ll now see your text facet, for which you have two options: True or False. Click True to display only the flagged/starred rows. Finally, to remove the row, click on the drop-down arrow for the All column again and select Edit rows > Remove all matching rows. Click the x next to your “Flagged Rows” facet on the left to return to the main view.


Problem 4: Unnecessary parentheses around each circulation figure

The left parenthesis was removed earlier when we split the first column, but the circulation figures are still followed by a right parenthesis. To remove this we need to write a GREL expression, which means a bit of coded instruction that performs a specific task.

Hit the drop-down arrow on the Circulation column, and select Edit Cells > Transform.

In the box that appears, write the following:

replace(value, “)”, “”)

Screen Shot 2013 05 06 at 20 02 26

This tells Refine to look for any occurrence of the right parenthesis ) character, and replace it with “nothing”, meaning it gets deleted. Notice the end result appears in the preview window before you hit OK. If your GREL expression isn’t working properly, make sure you haven’t used too few or too many spaces. Whitespace is this programming language’s worst enemy.

 

Problem 5: Numbers before each newspaper name

Our data is now starting to take shape, but before we can go ahead and export, we need to remove those unnecessary numbers that appear before each newspaper name.

Screen Shot 2013 05 06 at 20 08 03

In order to do this, we need to use another GREL expression, but this one tells Refine to look for any numbers in the column and remove them.

Click the drop-down arrow at the top of the Newspaper column and select Edit Cells Transform again. This time, we’re going to use this expression:

replace(value, /[0-9]/, “”)

The two forward slashes donate a “regular expression” (otherwise known as regex) that instructs Refine to find and recognise any matching string of characters, words, or numbers, depending on the pattern that the user wants to identify.

As we want to remove any number, the regex that we have used is /[0-9]/, which instructs refine to locate any number between 0 and 9. The second part of our GREL expression works as before, replacing these numbers with “nothing”, which results in them being deleted.

Note: If you’ve come across regex before, you can also use \d in place of [0-9] to remove the numbers, although be aware that this isn’t directly equivalent. \d matches both 0123456789 characters and other digit characters too, including Eastern Arabic numerals, for example.

Screen Shot 2013 05 06 at 20 24 56

Our data is now cleaned up and ready to export! To do this, simply click on Export in the top right hand corner of the Refine page, and select your choice of Tab-separated value (.tsv), Comma-separated value (.csv), HTML table, Excel spreadsheet (.xls) or ODF Spreadsheet (for use with Open Office). Open it up in your program of choice and you’re ready to go.

As I mentioned before, the dataset in this example isn’t especially messy or complex, but when you inevitably come across a large set of badly formatted data, experiment with some of the techniques outlined above to see if you can clean things up before moving over to Excel.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: