digital libraries,
web preservation,
books,
archives.

Pywb 2.0 - docker quickstart

2018-01-31 tags: webarchiving pywb docker

Four years have passed since i first wrote of pywb: it was a young tool at the time, but already usable and extremely simple to deploy. Since then a lot of works has been done by Ilya Kreymer (and others), resulting in all the new features available with the 2.0 release.

Also, some very big webarchiving initiatives have moved and used pywb in these years: Webrecorder itself, Rhizome, Perma, Arquivo PT in Portugal, the Italian National Library in Florence (Italy), (others I'm missing).

For many years i've used pywb for my personal private webarchive on a shared host, with the setup described here. Nowadays actually shared hosts are well defunct, and cloud virtual machines are even more cheap.

The simplest way you can use pywb today for your own instance is probably docker. Here a quick tutorial:

  • pull the docker image

    docker pull webrecorder/pywb
    
  • create a directory to keep the collection

    mkdir ~/webarchive; cd ~/webarchive
    
  • initialise the collection (call my-collection as you prefer)

    docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager init my-collection
    
  • add archived contents, copying WARCs you have previously created

    cp $file.warc.gz ~/webarchive/collections/my-collection/archive
    
  • index the collection

    docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager reindex my-collection
    

    a CDXJ index will be created in ~/webarchive/collections/my-collection/indexes/index.cdxj

  • start it: pywb will run on localhost:8080

    docker run -d --name pywb -v ~/webarchive:/webarchive -p 8080:8080 webrecorder/pywb
    open http://localhost:8080
    

Easy!

Again, why pywb has been so important in the webarchiving scene? Because it focus on individuals, for the easiness on creating, curating and mantaining personal web archives!