Scraping data with the browser
Chrome's developer tools make it easy to scrape data from web pages. I'll demonstrate this by grabbing a list of ISO country codes and country names from Wikipedia.
Before we begin, some general tips with working with the console (OSX):
- Press up and down to navigate command history. This is great for iteratively building up a pipeline
- Press alt and left and right to skip forwards/backwards one word
- Cmd + k to clear the console
- Ctrl + e to skip to end, Ctrl + a to skip to the start
- Ctrl + enter to add a new line without running the current command
Finding the element to scrape
First off, we want to find the element that contains our data. Using the elements panel and right click -> 'Inspect element', we highlight the element containing our data.
Building a pipeline
Next we move to the console. Chrome places the element you inspected most recently in the $0
variable, the next most recent on $1
and so on. It also aliases querySelectorAll
as $$(css [,startNode])
, so we can use these together to check out the rows in our table:
> $$("tr",$0)[0].innerHTML "<%= require 'cgi'; CGI.escapeHTML('Afghanistan AF AFG 004 ISO 3166-2:AF') %>"
Great! Looks like some useful data in that HTML. I normally slap [0]
at the end of the pipeline to get this insight into what's being produced.
To grab the data from the HTML in another one-liner we can use one of the Array additions - map
. Unforunately $$
returns a NodeList
which is array-like but not an array. To work around this we grab [].map
and .call
it on the node list.
> [].map.call($$("tr",$0),mapper)[0]
We'll need to define mapper()
- a function that takes each element in turn and returns the data we want as a structured object. In this case we want the contents of the 1st and 2nd nodes, so:
> function mapper(el) { return { country: $("td",el)[0].innerText, code: $("td",el)[1].innerText } } > [].map.call($$("tr",$0),mapper)[0] Object {country: "Afghanistan", code: "AF"}
Excellent - just what we want. Now we have an array full of tasty data, but how do we get it out?
Chrome saves the day again with copy()
, giving us the data-clipboard bridge we've always wanted. Since we want useful data we'll JSON.stringify
the data first, and our completed pipeline looks like:
> var data = [].map.call($$("tr",$0),mapper); > copy(JSON.stringify(data))
What does our final data look like?
[{"country":"Afghanistan","code":"AF"},{"country":"Åland Islands","code":"AX"},/* ... */,{"country":"Zimbabwe","code":"ZW"}]
Here's the whole pipeline built up bit by bit
> var nodes = $$("tr",$0); > function mapper(el) { return { country: $("tr",el)[0].innerText, code: $("tr",el)[1].innerText } } > var data = [].map.call(nodes,mapper) > copy(JSON.stringify(data))
Happy scraping!