octalzeroes

HTML scraping credits at DigitalOcean

published on 06 Mar 2014, tagged with sed awk xpath ruby

My website is hosted on a VPS provided by DigitalOcean (warning: ref link) and so far I'm very pleased. Website looks great, control panel is very easy to navigate. New orders are actually deployed in less than a minute and they even have an API for your droplets. They're even kind enough to provide hosting for my domain! While all those are great features, the one thing I miss is the ability to monitor my credits. I'm hoping that one day this will be a part of the API but till then, let's scrape some html!

I am going to demonstrate a couple of different approaches that can be used in order to extract the credit currently on your account. First we are going to have a look at the page at the following URL:

  • https://cloud.digitalocean.com/billing

At the time of writing this the page displays balance & usage and a billing history. I'm going to include the relevant part of the HTML including the credits.

<h2 class='section-header balance'>Balance & Usage</h2>
<div id='account_balance'>
<h3 class='credit'>
$4.80
<small>You have credit</small>
</h3>
</div>
<div id='current_charges'>

First we need a way to download the document. If you use Firefox the addon Firebug can be used, simply have a look in the Net tab and right-click the request and Copy as cURL

This will have an excessive amount of arguments passed to curl. I stripped mine down a bit and ended up with the following.

% curl -s \
  -A 'Mozilla/5.0 ..' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -b '_digitalocean2_session_v2=[long_string]' \
  https://cloud.digitalocean.com/billing
  • -s, silent, don't show progress meter
  • --header, leaving this out would result in a redirect to the login page
  • -b, the cookie used to identify our account. since this is passed as an argument to the program it will be exposed to other users. if you're on a system with other users, consider using the Netscape/Mozilla cookie file format

Unless a page is returned with a message about a redirect we have succesfully received our billing page. If you'd rather just save an offline copy of the file this can be accomplished either by passing the -o argument followed by a filename or by redirecting the standard output to a file.

Now that we've succesfully received the page it could be passed down the shell pipeline to any of the following contestants:

sed

sed -n "/class='credit'/{n;p}"
  • -n, supress printing of lines (we only want the one)
  • /class='credit'/, match lines containing the pattern class='credit'
    • {n;p}, read the next line into the pattern space and print

awk

awk "/class='credit'/ { getline; print }"
  • /class='credit'/, match lines containing the pattern class='credit'
    • { getline; print }, move to the line following the pattern and print

xmllint

This will actually parse the document which could be considered the proper way to solve this problem. The program can be found in the libxml2 package.

xmllint --html --xpath 'normalize-space(//*[contains(@class, "credit")]/text())' - 2>/dev/null
  • --html, parse as html, not xml
  • --xpath, the path to the element containing the credits
    • //*[contains(@class, "credit")]/text(), look for an element containing the class credit and get the inner text. the call to text() is necessary in order to avoid the contents of the <small> tag
    • normalize-space(), this will strip the newlines from the output
  • -, process standard input since the document will be received through the pipe
  • 2>/dev/null, the program will have some objections to the html so lets get rid of the output on stderr

ruby

As a ruby entusiast I am no stranger to Nokogiri. Nokogiri knows XPath but to mix things up I will use the css selector.

require 'nokogiri'

doc = Nokogiri::HTML($stdin.read)
puts doc.css('h3.credit').children.first.text.strip
  • require the library and parse standard input to doc which will be an instance of the Nokogiri::HTML::Document class
  • doc.css('h3.credit'), find the first h3 tag with a class of credit
    • get the children nodes
    • we want the first node
    • output the text
    • strip newlines

Conclusion

This is the bare minimum of code required. Things could go wrong along the way or you might want to perform a task based on the amount of credits you currently have in your account. Following below you have a short script that will alert you when you're below $5 in credit.

#!/bin/zsh

credits=$(curl -s \
  -A 'Mozilla/5.0 ..' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -b '_digitalocean2_session_v2=[long_string]' \
  https://cloud.digitalocean.com/billing | sed -n "/class='credit'/{n;p}")

[[ ${credit:1} -lt 5 ]] && echo "less than $5 on your account"
  • store credit in the variable credit
  • perform a test (the posix compliant [ won't work because the value is floating point)
    • strip the leading $ character in the value of the variable credit
    • if credit is less than 5, continue to echo