Tor Address: http://ladhzzr73kxkifxg.onion/
Open Web Address: https://ladhzzr73kxkifxg.tor2web.org/
The South African Press Association (SAPA) for years was the cornerstone of broad news coverage in South Africa. No single news organisation could afford the number of journalists required to cover our broad country, and the elegant solution of pooling resources meant that SAPA became the record-keeper of South African history.
On March 21 2015, SAPA closed down. No longer having the support of its founding organisations, themselves under pressure in an increasingly hostile business environment, SAPA was deemed unsustainable. Its assets were sold off to Sekunjalo Investment Holdings, including the SAPA archives.
We were concerned about the archives of SAPA -- essentially an historic record of some of South Africa's greatest (and smallest) moments -- being unavailable to the media. Going further, we believe that such a valuable resource should be available publicly, not just to the media. And we were concerned particularly of Sekunjalo owning this history, since this organisation has proved to be hostile to a free press, and its own distinct business and political interests have clashed with the free media.
This is why we bring you SapaFiles. We believe that knowledge should be free, and that this particular archive is a nationally important trove of history that belongs in the public domain.
Unfortunately we don't have anywhere near a complete archive. In the crazy few days before SAPA closed, we frantically scraped their website, which was not the fastest thing in the world. We even made it fall over a few times.
Of the digital archive, we retrieved about half of the FOUR MILLION articles. This searchable archive contains 1,884,042 documents currently indexed.
We have most of the articles up until 2007, and most of 2015.
If you have a more complete set, please consider sharing it with us so that we can include it in this archive.
The online archives also only stretched back as far as 1998. If you have older digitised SAPA content, please let us know.
SAPA is not easy to scrape, and so some of the articles are missing headlines. And since there are so many of them, converting the HTML into something usable is a big task. Please don't send us reports of problems with single articles right now. We'll be refining the processor and rebuilding the index over time. In fact, that makes a nice segway into...
Well, apart from completing the collection, we're also planning on having some fun with this data, such as entity extraction, automatic and user tagging, and relationship mapping/networks.
We also managed to get some of the multimedia files, so expect those available soon too.
There's a lot we can do with this, but it's only been a month, and this is a side-project for most of us.
Let us know if there's something in particular you want.
Not from Sekunjalo you don't. Use at your own risk.
You can get hold of us at firstname.lastname@example.org. We promise to keep all correspondence strictly confidential, unless you're sending us take-down notices or other threats. Then we'll put that up on the site for ridicule.