husk.org / blog. chaff. occasional witterings.

2009-01-08

Getting My Vox Off

computing 18:12:11

For the last two and a half years or so, I've been posting far more to my blog on Six Apart's Vox site than I have here. However, I've decided to stop that and move back here.

One of the reasons I'm doing so is that I found it far too difficult to access my content programmatically. Initially, I was doing so to try and get better details of what I'd posted, when, so I could flesh out my 2008 statistics, but now it's turned into a (possibly quixotic) journey to get my data back. (It feels especially important in this era of closing web services.)

Why is it so hard? For a start, it's tricky to figure out what API Vox offers. A bit of searching led me to believe that the only way I'd get anywhere was using Atom.

If you thought that Atom was like RSS, well, it is. However, it's also short for the Atom Publishing Protocol, which is completely different, except for when it's not. Now an IETF standard, APP (as I shall henceforth refer to it) allows you to retrieve, post and edit your entries on a service that supports it. Well, in theory it does.

In practice, the Atom API responses I was getting didn't have any paging information, and only 20 entries. This is obviously a killer when you're trying to retrieve a year's worth of data (or everything, for that matter). Annoyingly, it's clear that the underlying system knows how many posts there are: it's in an openSearch XML element at the top of the XML. I've found nothing to tell me how to use queries to get the other entries, but at least this response did contain pointers to authoritative feeds for each entry.

In contrast, the public Atom feeds do contain straightforward paging, so I ended up falling back to these to determine Vox's internal IDs for each entry. Naturally, this feed didn't contain links to the APP single-post feeds; instead, their data had additional cruft, like "Email this" links at the end of each entry.

The upshot was that I was able to get all my posts out of the site by paging through the Atom syndication feeds to get the IDs, which I then fed to the APP API to fetch each post as a lump of XML. This turns out to be entirely useless for blog import purposes, but it'll do as an archive (and I can always work on convertors later).

However, posts are a fairly small part of my content. The APP content list also contained pointers to embedded resources and comments, but as of yet I haven't worked hard at fetching these. I'm even further from copying down my entire library, which contains books, videos and other items that aren't used directly from posts.

So, what's the conclusion? Vox does have a way of getting your core data out, but it's almost entirely undocumented and seems inconsistent. In the end I had to rely on two different chunks of Atom XML to get my data, and even then it was merely a subset of my content. All in all, I found it too frustrating by far.

navigation.