a wayback machine (pywb) on a cheap, shared host
For a long time the only free (i’m unaware of commercial ones) implementation of a web archival replay software has been the Wayback Machine (now Openwayback). It’s a stable and mature software, with a strong community behind.
To use it you need to be confident with the deploy of a java web application; not so difficult, and documentation is exaustive.
But there is a new player in the game, pywb, developed by Ilya Kramer, a former Internet Archive developer.
Built in python, relatively simpler than wayback, and now used in a pro archiving project at Rhizome.
Pywb simplicity and clean design make it very easy to deploy, even on shared hosts.
Nowadays seems that no one uses shared hosting anymore, virtual servers often are cheaper and with dozens of orchestration and provisioning tools it’s even easier to bootstrap a full machine.
Despite this, i still prefer a shared host when allowed by the application stack: less things to worry about.
So i tried to install pywb on dreamhost, a well known cheap provider, offering deploy of ruby/rack and python/wsgi applications via mod_passenger. In a few minutes i can have my own wayback machine.
The following steps are specific for dreamhost, but you should be able to replicate this installation inside any shared host providing deploy of python apps (fastcgi, uwsgi, passenger).
add a new domain in your dreamhost panel, and set document root in
tmpdirectory for passenger,
warcsto store warc files and
- install pywb via pip (inside the virtualenv)
- edit pywb config file
- edit passenger startup file
- put some warc files in ~/wayback/warcs and generate a sorted cdx
Try to search in the
/test collection for the url http://twitter.com/atomotic, you’ll have these results.
And Memento is also available:
$ curl "http://wayback.literarymachin.es/test/timemap/*/twitter.com/atomotic" <http://wayback.literarymachin.es/test/timemap/*/http://twitter.com/atomotic>; rel="self"; type="application/link-format"; from="Wed, 22 Oct 2014 16:30:30 GMT", <http://twitter.com/atomotic>; rel="original", <http://wayback.literarymachin.es/test/http://twitter.com/atomotic>; rel="timegate", <http://wayback.literarymachin.es/test/20141022163030/http://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:30 GMT", <http://wayback.literarymachin.es/test/20141022163031/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:31 GMT", <http://wayback.literarymachin.es/test/20141022163042/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:42 GMT", <http://wayback.literarymachin.es/test/20141022163355/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:33:55 GMT", <http://wayback.literarymachin.es/test/20141022163710/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:37:10 GMT"%
Why i think this will be useful? Archiveteam does a great job on running a distributed crawling organization, but the publishing is still centralized at Internet Archive. What if we begin to publish thousand of small web archives, aggregating them with memento protocol?