Parsing XML documents in the shell

published on 26 Sep 2014, tagged with xml zsh gaming steam awesomenauts

As previous posts might have already implied, I'm a gamer. That's why a lot of the work I do is directly related to games that I play and the data that evolves around them. For this article I've decided to give a little insight into steam leaderboards, specifically for a game known as Awesomenauts.

Locating the data

The data for steam leaderboards are available at the following url:

http://steamcommunity.com/stats/<appid>/leaderboards/?xml=1

Since we will examine the leaderboard data for Awesomenauts we will replace <appid> with 204300.

Parsing

We also need a program that can parse xml. I am going use xmlstarlet. xmlstarlet has many different commands you can use for extracting and manipulating the data. The ones we will take a closer look at are sel (select) and el (elements). Starting out with the latter, it can be used to find out the elements that reside within the document. Let's find all unique elements:

% wget -qO- 'http://steamcommunity.com/stats/204300/leaderboards/?xml=1' | iconv -f iso-8859-15 -t utf-8 | xml el -u

xmlstarlet was not pleased with the encoding of the document, hence the use of iconv to convert it to utf-8.

The above command will return the following:

response
response/appFriendlyName
response/appID
response/leaderboard
response/leaderboard/display_name
response/leaderboard/displaytype
response/leaderboard/entries
response/leaderboard/lbid
response/leaderboard/name
response/leaderboard/onlyfriendsreads
response/leaderboard/onlytrustedwrites
response/leaderboard/sortmethod
response/leaderboard/url
response/leaderboardCount

Let's jump ahead a bit. What we want is the url where name starts with PLAYERRANK. To find these we can make use of the XPath function called starts-with().

% wget -qO- 'http://steamcommunity.com/stats/204300/leaderboards/?xml=1' | iconv -f iso-8859-15 -t utf-8 | \
  xml sel -t -v "response/leaderboard[starts-with(name, 'PLAYERRANK')]/url"

Quick overview of the arguments passed to xml:

-t, or template which will give us access to the options following the argument
-v, print the value of the XPATH expression
"response/leaderboard[starts-with(name, 'PLAYERRANK')]/url", following the path response/leaderboard we wish to retrieve the url of entries where the name starts with PLAYERRANK

Example output:

http://steamcommunity.com/stats/204300/leaderboards/89564/?xml=1
http://steamcommunity.com/stats/204300/leaderboards/145095/?xml=1
http://steamcommunity.com/stats/204300/leaderboards/331874/?xml=1
http://steamcommunity.com/stats/204300/leaderboards/397491/?xml=1
http://steamcommunity.com/stats/204300/leaderboards/483346/?xml=1

Extending

We now have the urls for leaderboards representing each season. Unfortunately they are scrambled but we catch a break as the value seems to increment for each season. Meaning all we need to do is sort this list according to the leaderboard id:

% wget -qO- 'http://steamcommunity.com/stats/204300/leaderboards/?xml=1' | iconv -f iso-8859-15 -t utf-8 | \
  xml sel -t -v "response/leaderboard[starts-with(name, 'PLAYERRANK')]/url" | \
  sed 's|.*leaderboards/\([^/]*\).*|\1|' | sort -n

sed, extract the id from the url /leaderboards/<id>/?xml=1

While we're at it, let's declare this a function:

% function awsmboards() { ... code from above ... }

Because that will make what I'm about to do way more readable. I want to create an array of season ids in zsh:

% seasons=(${(f)"$(awsmboards)"})

So, what just happened? We performed Parameter Expansion within an array declaration where we called the function previously defined. The parameter expansion flag f splits the values at each newline, meaning each line of the output is its own element within the array.

Let's print the first 5 elements of the season array:

% echo ${seasons:0:5}
89564 145095 145658 165967 167738

At the time of writing this we're at the end of season 11. Awesomenauts very first season was 0 and since zsh array indexes starts from 1 that means that the id of season 11 should be found at ${seasons[12]}. Let's have a look at the elements of this seasons document:

% wget -qO- "http://steamcommunity.com/stats/204300/leaderboards/${seasons[12]}/?xml=1" | xml el -u
response
response/appFriendlyName
response/appID
response/entries
response/entries/entry
response/entries/entry/details
response/entries/entry/rank
response/entries/entry/score
response/entries/entry/steamid
response/entries/entry/ugcid
response/entryEnd
response/entryStart
response/leaderboardID
response/nextRequestURL
response/resultCount
response/totalLeaderboardEntries

Fairly straight forward what all of the entries mean. Let's write a function that returns the rank of a specific steamid for this season:

function awsmrank() { 
  wget -qO- "http://steamcommunity.com/stats/204300/leaderboards/${seasons[12]}/?xml=1&steamid=$1" | \
  xml sel -t -v "response/entries/entry[steamid=$1]/rank" -n 
}

What's new here is an additional parameter to the url that specifies the user whos rank we are interested in finding out. This document will also include all of said users friends who are also ranked on the leaderboard. This is why the XPATH expression will specifically ask for the rank of the user of the steamid which will be passed as an argument to the function.

Let's try out our new function:

% awsmrank 12345678901234567
125

Another thing worth mentioning is the value at response/entries/entry/details. Let's have a look at the details belonging to the user on the top of the leaderboard:

% wget -qO- "http://steamcommunity.com/stats/204300/leaderboards/${seasons[12]}/?xml=1&end=1" | xml sel -t -v "response/entries/entry/details" -n
0200000076080000a0050000a50e0000710400000a000000060000008f0100005c0000000800000000000000

What we are looking at are most likely information that are displayed on the leaderboard that doesn't fit into any of the elements. This would be stats including wins, losses and favourite naut.

% details=0200000076080000a0050000a50e0000710400000a000000060000008f0100005c0000000800000000000000
% echo $(( ${#details} % 8 ))
0
% for (( i=0;i<${#details};i+=8 )); do sub=${details:$i:8}; echo $sub; done
02000000
76080000
a0050000
a50e0000
71040000
0a000000
06000000
8f010000
5c000000
08000000
00000000

We assign the value to a variable
Let's assume every piece of information is 8 characters long
Print each part on its own line

Seems we have 11 values. What are they? Having a look at the ApplicationPersistent.log file located in the Awesomenauts directory reveals the following:

ColumnType:   LCT_ENTRY_VERSION   
ColumnType:   LCT_WINS   
ColumnType:   LCT_LOSSES   
ColumnType:   LCT_KILLS   
ColumnType:   LCT_DEATHS   
ColumnType:   LCT_PRESTIGE_LEVEL   
ColumnType:   LCT_FAVORITE_CLASS_INDEX   
ColumnType:   LCT_SEASON_WINS   
ColumnType:   LCT_SEASON_LOSSES   
ColumnType:   LCT_PRESTIGE_ICON_CONTRIBUTION_PRIMARY   
ColumnType:   LCT_PRESTIGE_ICON_CONTRIBUTION_SECONDARY

Let's try converting the hexadecimal values:

% for (( i=0;i<${#details};i+=8 )); do 
    sub=${details:$i:8}
    echo $(( 16#${sub:6:2}${sub:4:2}${sub:2:2}${sub:0:2} ))
  done
2
2166
1440
3749
1137
10
6
399
92
8
0

loop over the length of the variable $details in 8 character increments
assign a sub variable
perform arithmetic evaluation and let the shell know it's a hexadecimal number
- change endianness (reverse the byte order)

A quick look at the leaderboard confirms that this is in fact correct.

Most of these columns don't need much explanation except LCT_FAVORITE_CLASS_INDEX and the ones regarding CONTRIBUTION. I'm more interested in finding out what different nauts people use so I've compiled a list:

arr[1]=froggy
arr[2]=lonestar
arr[5]=leon
arr[6]=clunk
arr[3]=voltar
arr[12]=gnaw
arr[8]=coco
arr[9]=skolldir
arr[4]=yuri
arr[11]=rae
arr[7]=derpl
arr[16]=vinnie
arr[18]=genji
arr[14]=ayla
arr[20]=swiggins
arr[21]=mcpain
arr[19]=penny
arr[22]=sentry
arr[23]=skree

With this information we could write yet another function. Let's find out what the top nauts are in League 1:

function nauts250() { 
  wget -qO- "http://steamcommunity.com/stats/204300/leaderboards/${seasons[12]}/?xml=1&end=251" | \
  xml sel -t -v 'response/entries/entry/details' | \
  while read details; do 
    echo $arr[$(( 16#${details:48:2} ))]
  done | sort | uniq -c | sort -nr 
}

mostly the same as before, we change the end parameter in the url to only retrieve the first 250 entries
this time around we're only interested in the data containing the naut and we know the specific location within the string
sort, make sure list is unique and sort by number

Trying it out:

% nauts250
     29 leon
     29 froggy
     28 lonestar
     21 coco
     19 skolldir
     14 vinnie
     14 ayla
     12 rae
     11 skree
     11 penny
     10 clunk
      9 yuri
      9 mcpain
      8 genji
      7 swiggins
      7 sentry
      6 gnaw
      4 voltar
      2 derpl