Epub linkrot
2015-03-03 tags: epub webarchiving linkrot digital preservationLinkrot also affects epub files (who would have thought! :)).
How to check the health of external links in epub books (required tools: a shell, atool, pup, gnu parallel).
extract all external links
$ acat -F zip {FILE.epub} "_.xhtml" "_.html" \
| pup 'a attr{href}' \
| egrep "^http" | sort | uniq \
> links.txt
check http status
$ echo "http_code, url" > links-status.csv
$ parallel -j 10 'curl -k -L -s -o /dev/null -w "%{http_code}" {}; echo ", {}\n"' :::: links.txt >> links-status.csv
link-status.csv
is a csv and contains http_code
and original url
. Installing csvkit you can perform this simple analysis:
$ csvstat --freq -c http_code links-status.csv | jq .
And you obtain a summary view by http_code
. The following example is extracted from a book i bought in 2011 (14 links gone):
{
"403": 1,
"301": 2,
"404": 14,
"200": 95
}
what to do?
And so you've discovered that your loved ebooks are full or rotten links. What you could do:
- you are a compulsive reader and digital books hoarder? archive by yourself links from the book after you've bought (remove drm from your own books is also a safe thing)
- you are an author or selfpublisher (or equivalent hipster term to describe it)? archive links from the book you are writing to internet archive waybackmachine (using Save Page Now), than link the archived version. Robust Links are not an option now, unless some epub reader client is implementing it right now.
- you are a big publisher? consider to manage your own web archive and offer it as a service to your authors