Download all IATI data, lightning fast!

On Thursday, John Adams (IATI TAG chair) asked IATI discuss:

What’s the current recommended way to download the entire IATI dataset in XML? Separate files are OK.

By Friday, I’d made a Minimum Viable Product:

First iteration

By Monday, it looked a bit more polished:

Second iteration

Wait, what? But… Why?

IATI Data Dump provides a downloadable zip file of all XML data on the IATI registry, updated daily. While the raw XML is around 7 gigabytes, it compresses down to just 350 megabytes (a whopping 95% saving!) And (without getting too technical) by doing this in one HTTP request instead of ~6,000 (one per dataset), it is muuuuch faster to download. So with a broadband internet connection, you can download the lot in under a minute.

A raw data dump is a really basic requirement for a new IATI datastore. I.e.:

As an analyst,
I need access to all data on the IATI registry, unprocessed and unfiltered,
so that I can analyse it holistically in order to generate insights.

Or even:

As an IATI tool developer,
I need access to all data on the IATI registry, unprocessed and unfiltered,
so that I can process it before presenting it to a user.

It’s so basic, in fact, that most IATI tools and portals already implement it. d-portal does it (†), OIPA does it, the IATI Dashboard does it, the IATI Datastore does it. So at the moment, all of these tools (and lots more) visit the registry, and make a list of every publisher, and the locations of every dataset for every publisher. Then each of them visits the servers of every publisher, downloading each dataset individually. None of these tools make the unprocessed and unfiltered output available as a bulk download. So rather than duplicating the work, why not do it once and share?

Hold on… Doesn’t this create a single point of failure, Andy?

How perceptive of you! Yes, that’s certainly true. But, note that with the IATI Registry API, we already have a single point of failure (and indeed we’ve hit upon this problem recently.) The difference here, though, is that we have a fallback option – downloading every dataset individually. IATI Data Dump just provides a speedy shortcut.

Is it finished?

It’s never finished! But you’re welcome to use it. This is intended more as an illustration of a feature that the proposed IATI datastore could provide.

In the short term, the big piece that’s missing is a clear log of what happened when fetching the data. Perhaps a publisher’s data is mysteriously missing from the zip. Where did it go? It’s likely their server had a problem and was unreachable. But this information should be made available somewhere. I’ve made a ticket for that; I’ll address it very soon.

†: In fact, d-portal previously relied on the IATI Datastore for this. But it didn’t scale well, so they switched to downloading the data directly.