A wayback machine (pywb) on a cheap, shared host
2014-10-23 tags: webarchiving web preservation pythonFor a long time the only free (I'm unaware of commercial ones) implementation of a web archival replay software has been the Wayback Machine (now Openwayback). It's a stable and mature software, with a strong community behind.
To use it you need to be confident with the deploy of a java web application; not so difficult, and documentation is exaustive.
But there is a new player in the game, pywb, developed by Ilya Kramer, a former Internet Archive developer.
Built in python, relatively simpler than wayback, and now used in a pro archiving project at Rhizome.
Pywb simplicity and clean design make it very easy to deploy, even on shared hosts.
Nowadays seems that no one uses shared hosting anymore, virtual servers often are cheaper and with dozens of orchestration and provisioning tools it's even easier to bootstrap a full machine.
Despite this, i still prefer a shared host when allowed by the application stack: less things to worry about.
So i tried to install pywb on dreamhost, a well known cheap provider, offering deploy of ruby/rack and python/wsgi applications via modpassenger. In a few minutes i can have my own wayback machine.
The following steps are specific for _dreamhost, but you should be able to replicate this installation inside any shared host providing deploy of python apps (fastcgi, uwsgi, passenger).
Steps:
-
add a new domain in your dreamhost panel, and set document root in
/home/{USER}/wayback/public
-
init virtualenv
cd ~
$ virtualenv wayback
- create
public
andtmp
directory for passenger,warcs
to store warc files andcdx
for indexes
$ mkdir -p wayback/{public,tmp,warcs,cdx}
- install pywb via pip (inside the virtualenv)
$ source wayback/bin/activate
$ pip install pywb
- edit pywb config file
~/wayback/config.yaml
(full documentation)
collections:
test: ./cdx/
archive_paths: ./warcs/
enable_http_proxy: true
static_routes:
static/default: pywb/static/
enable_cdx_api: true
enable_memento: true
framed_replay: true
- edit passenger startup file
~/wayback/passenger_wsgi.py
import sys, os
INTERP = os.path.join(os.environ['HOME'], 'wayback', 'bin', 'python')
if sys.executable != INTERP: os.execl(INTERP, INTERP, \*sys.argv)
sys.path.append(os.getcwd())
from pywb.apps.wayback import application
- put some warc files in ~/wayback/warcs and generate a sorted cdx
$ cdx-indexer --sort ~/wayback/cdx ~/wayback/warcs
Try to search in the /test
collection for the url http://twitter.com/atomotic, you'll have these results.
And Memento is also available:
$ curl "http://wayback.literarymachin.es/test/timemap/*/twitter.com/atomotic"
<http://wayback.literarymachin.es/test/timemap/*/http://twitter.com/atomotic>; rel="self"; type="application/link-format"; from="Wed, 22 Oct 2014 16:30:30 GMT",
<http://twitter.com/atomotic>; rel="original",
<http://wayback.literarymachin.es/test/http://twitter.com/atomotic>; rel="timegate",
<http://wayback.literarymachin.es/test/20141022163030/http://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:30 GMT",
<http://wayback.literarymachin.es/test/20141022163031/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:31 GMT",
<http://wayback.literarymachin.es/test/20141022163042/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:42 GMT",
<http://wayback.literarymachin.es/test/20141022163355/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:33:55 GMT",
<http://wayback.literarymachin.es/test/20141022163710/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:37:10 GMT"%
Why i think this will be useful? Archiveteam does a great job on running a distributed crawling organization, but the publishing is still centralized at Internet Archive. What if we begin to publish thousand of small web archives, aggregating[1][2][3] them with memento protocol?