pywb 2.0 - docker quickstart

Four years have passed since i first wrote of pywb: it was a young tool at the time, but already usable and extremely simple to deploy. Since then a lot of works has been done by Ilya Kreymer (and others), resulting in all the new features available with the 2.0 release.
Also, some very big webarchiving initiatives have moved and used pywb in these years: Webrecorder itself, Rhizome, Perma, Arquivo PT in Portugal, the Italian National Library in Florence (Italy), (others i’m missing).

For many years i’ve used pywb for my personal private webarchive on a shared host, with the setup described here. Nowadays actually shared hosts are well defunct, and cloud virtual machines are even more cheap.

The simplest way you can use pywb today for your own instance is probably docker. Here a quick tutorial:

  • pull the docker image

      docker pull webrecorder/pywb
    
  • create a directory to keep the collection

      mkdir ~/webarchive; cd ~/webarchive
    
  • initialise the collection (call my-collection as you prefer)

      docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager init my-collection
    
  • add archived contents, copying WARCs you have previously created

      cp $file.warc.gz ~/webarchive/collections/my-collection/archive
    
  • index the collection

      docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager reindex my-collection
    

    a CDXJ index will be created in ~/webarchive/collections/my-collection/indexes/index.cdxj

  • start it: pywb will run on localhost:8080

      docker run -d --name pywb -v ~/webarchive:/webarchive -p 8080:8080 webrecorder/pywb
      open http://localhost:8080
    

Easy!

Again, why pywb has been so important in the webarchiving scene? Because it focus on individuals, for the easiness on creating, curating and mantaining personal web archives!