ArchivIIIFy
2020-07-05 tags: digital libraries internet archive books iiif solr ocr tesseractA short guide to download digitized books from Internet Archive and rehost on your own infrastructure using IIIF with full-text search.
I'm an avid explorer of Internet Archive (i also contribute to it with some scans of my zine collection), and I'm used to download on my disks the content i find valuable so that i can browse and read it offline.
The following guide is a quick tutorial describing some scripts and infrastructure pieces (docker) i've developed lately to download and rehost locally the digitized books with IIIF, allowing me to have a better viewer (where i can annotate content) and also full-text search (but note: IA has full-text search, and is good).
To start clone this repository https://github.com/atomotic/archiviiify and fire up the docker compose stack. It will start these containers:
- nginx that is proxying various things and hosting the Mirador viewer
- iipsrv (with openjpeg to decode JPEG2000) for serving IIIF images
- memcached used by iipsrv
- solr with ocr highlighting plugin (thanks! @jbaiter_)
- the search api: a simple Deno application that is translating Solr response to IIIF search response
The steps needed:
- Download images from Internet Archive
- Generate IIIF Manifest
- Generate OCR
- Index to Solr
- View and have fun
Disclaimer: there a lot of moving parts (and not enough glue). I'll write a proper Makefile at some point. For every step following there is shell script in ./scripts
Download images
Internet Archive is automatically deriving other formats when something is ingested: the digitized books after they are uploaded (with a pdf or a zip of images) are converted to JPEG2000 (also full text is extracted and other things are generated).
JPEG2000 images are ready to be used with the IIIF server, there is no need to convert it again to pyramidal formats.
To download use the internetarchive cli:
ia list -l -f "Single Page Processed JP2 ZIP" ITEM
example:
ia list -l -f "Single Page Processed JP2 ZIP" codici-immaginari-1
https://archive.org/download/codici-immaginari-1/codici-immaginari-1_jp2.zip
Run the script that download and unzip the images into ./data
./scripts/get ITEM
Generate IIIF manifest
JP2 images from ./data
directory are served by the iipsrv container following this pattern:
data/item/file.jp2
→ http://localhost:8094/iiif/item/file.jp2/info.json
To generate the IIIF manifest run (Deno is required to be installed locally):
./scripts/iiif ITEM
The manifest is saved to www/manifests
and published to
http://localhost:8094/manifests/ITEM.json
I found Deno extremely useful for quick prototyping. The script to generate the manifest is very simple (and incomplete). Better ways and libraries exists to produce IIIF Presentation, look at manifesto.
Generate OCR
Internet Archive is also running OCR and extracting full-text with ABBYY, but is not a supported format by the ocr highlightning plugin. I tried to convert it using this xsl (saxon needed, not xsltproc) but the result is not enough, the required ocrx_word
classes are missing. I've not looked deeply, XSLT is causing me headaches, so i gave up and went to re-OCR using Tesseract 4.
Run:
./scripts/ocr ITEM
The previous script create a file with the list of images:
~ find data/ITEM/*.jp2 > ITEM.list
and run Tesseract (you need to specify the proper language model):
~ tesseract -l ita ITEM.list ITEM hocr
This can take some time, to speed up things GNU parallel could be used to generate hocr for every single images and then combine the result together with hocr-combine.
A small fix is needed for the resulting hocr: Tesseract is naming ocr_page
classes with page_{1..n}
, i prefer to name with the full name of the original image file, that is contained also in the canvas identifier in the IIIF manifest
<div class='ocr_page' id='page_1' ...
↳
<div class='ocr_page' id='file_0000.jp2' ...
Run
./scripts/ocr-fix ITEM
hOCR is XHTML, would be advisable to use a proper parser (or xslt). The previous script uses some kind of cli voodoo because laziness (parallel, pup, sd required):
#!/usr/bin/env bash
ITEM=$1
parallel -j1 sd -f w {1} {2} "ocr/ITEM.hocr" \
::: $(pup .ocr_page attr{id} <"ocr/ITEM.hocr") \
:::+ \$(find data/ITEM/\*.jp2 -exec basename {} \;)
Index to Solr
The hOCR file is ready to be indexed to Solr:
POST solr/ocr/updates
{
'id': 'ITEM',
'ocr_text': '/ocr/ITEM.hocr',
'source':'IA'
}
Run
./scripts/index ITEM
Go to the Solr admin at http://localhost:8983 to try some queries, or reach the iiif search api at http://localhost:8094/search/ITEM?q=....
The query can be tweaked here
View
Open http://localhost:8094/mirador?manifest=ITEM
and enjoy reading your book with Mirador 3! This tutorial is not exclusive to Internet Archive, can be used to publish any content in IIIF.
A video that shows how it works:
Send your love to Internet Archive: use it and donate!