Archive

Scraping

In our first session with Paul Bradshaw, we scraped the website horsedeathwatch.com into a Google spreadsheet.

If you haven’t signed up already, then register with Google Drive. Once you’re done, log in and click on the large red “Create” button in the left sidebar. Select “Spreadsheet”:

You’ll then be taken to your blank spreadsheet. Cell A1 should automatically have been selected. If not, click on it:

 

 

 

 

 

 

 

 

Next, go to horsedeathwatch.com and copy the URL. We’re going to use Google Docs’ ability to scrape raw data from HTML tables into your newly created spreadsheet.

In cell A1, type:

=importHTML(“

Paste the URL of the website you’re scraping – in our case http://www.horsedeathwatch.com and close the quotes. Your cell should now contain:

=importHTML(“http://www.horsedeathwatch.com”

Next, put a comma after the quotation marks and type “table” (including the quotes).

FInally, put another comma, and type “1”) (including the quotes and closed bracket). The final formula should look like this:

=importHTML(“http://www.horsedeathwatch.com”,”table”,”1″)

Press enter, and the data will be pulled into your spreadsheet, which should appear as below:

 

 

 

 

 

 

 

 

 

 

 

 

 

It’s worth then copying this data into a new sheet before saving the document, as at the moment your scrape is based on a live website. If the website changes, your data changes, so if you’ll be using this data for a particular story you’ll want to make sure it’s safe.

To do this, select all data (cmd+c on Macs, ctrl+c on Windows), click on Edit, then Copy. Then open a new sheet by clicking on the plus button in the bottom left corner. Click on cell A1 of the new sheet, click Edit, then Paste Special, then Paste Values Only.

Some other “import” functions include:

=importdata (grabs data from a .txt or .csv file)

=importfeed (grabs data from an RSS feed)

=importxml (grabs data from an XML file).

courtesy of wickes.co.uk

It’s Sunday night. I have had four days to ruminate on what I learned in my first data journalism session with Paul Bradshaw, who runs the Online Journalism Blog and can be followed @paulbradshaw.

By lunch I was ready to throw my computer out of the window and quickly follow suit, but in the afternoon something wonderful happened – I made a scraper!

“What’s a scraper?” I hear all one of you shout? A scraper is a small computer program that pulls bits of information from a website and puts them into a nice, neat table for you.

Using the Olympic Games website as an example, if you wanted to collect data (name, age, country, sport and result, for example) on each and every athlete that competed in this year’s games, it would take a long, long time. Longer than a Leonard Cohen song, even (© Armando Iannucci).

Almost 11,000 athletes from 203 countries competed and each of them has a unique page on the london2012.com site. That’s a lot of browsing.

A scraper cuts out the slow middle man – you – and quickly pulls the required information into a Google spreadsheet or similar program.

In my next post I’ll cover how we created a simple scraper that collated data from horsedeathwatch.com. Lovely.