Jekyll2022-02-23T07:14:52-06:00https://literarymachin.es/rss.xmlliterary machinesdigital libraries, books, archivesraffaeleArchiviiify2020-07-05T10:00:00-05:002020-07-05T10:00:00-05:00https://literarymachin.es/archiviiify<p>A short guide to download digitized books from <a href="https://www.archive.org">Internet Archive</a> and rehost on your own infrastructure using <a href="https://iiif.io">IIIF</a> with <a href="https://dbmdz.github.io/solr-ocrhighlighting/">full-text search</a>.</p>
<p>I’m an avid explorer of Internet Archive (i also contribute to it with some scans of my zine collection), and i’m used to download on my disks the content i find valuable so that i can browse and read it offline.<br />
The following guide is a quick tutorial describing some scripts and infrastructure pieces (docker) i’ve developed lately to download and rehost locally the digitized books with IIIF, allowing me to have a better viewer (where i can annotate content) and also full-text search (but note: IA has full-text search, and is good).</p>
<p>To start clone this repository <a href="https://github.com/atomotic/archiviiify">https://github.com/atomotic/archiviiify</a> and fire up the docker compose stack. It will start these containers:</p>
<ul>
<li><strong>nginx</strong> that is proxying various things and hosting the <a href="https://projectmirador.org/">Mirador</a> viewer</li>
<li><a href="https://github.com/ruven/iipsrv"><strong>iipsrv</strong></a> (with <a href="https://github.com/uclouvain/openjpeg">openjpeg</a> to decode JPEG2000) for serving IIIF images</li>
<li><strong>memcached</strong> used by iipsrv</li>
<li><strong>solr</strong> with <a href="https://dbmdz.github.io/solr-ocrhighlighting/">ocr highlighting plugin</a> (thanks! <a href="https://twitter.com/jbaiter_">@jbaiter_</a>)</li>
<li>the <strong>search api</strong>: a simple <a href="https://deno.land">Deno</a> application that is translating Solr response to IIIF search response</li>
</ul>
<p>The steps needed:</p>
<ol>
<li><a href="#download-images">Download images</a> from Internet Archive</li>
<li><a href="#generate-iiif-manifest">Generate IIIF Manifest</a></li>
<li><a href="#generate-ocr">Generate OCR</a></li>
<li><a href="#index-to-solr">Index to Solr</a></li>
<li><a href="#view">View</a> and have fun</li>
</ol>
<p><strong>Disclaimer</strong>: there a lot of moving parts (and not enough glue). I’ll write a proper Makefile at some point. For every step following there is shell script in <code class="language-plaintext highlighter-rouge">./scripts</code></p>
<h3 id="download-images">Download images</h3>
<p>Internet Archive is automatically deriving other formats when something is ingested: the digitized books after they are uploaded (with a pdf or a zip of images) are converted to <a href="https://en.wikipedia.org/wiki/JPEG_2000">JPEG2000</a> (also full text is extracted and other things are generated).
JPEG2000 images are ready to be used with the IIIF server, there is no need to convert it again to <a href="https://iipimage.sourceforge.io/documentation/images/">pyramidal</a> formats.<br />
To download use the <a href="https://github.com/jjjake/internetarchive">internetarchive cli</a>:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ia list <span class="nt">-l</span> <span class="nt">-f</span> <span class="s2">"Single Page Processed JP2 ZIP"</span> ITEM</code></pre></figure>
<p>example:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ia list <span class="nt">-l</span> <span class="nt">-f</span> <span class="s2">"Single Page Processed JP2 ZIP"</span> codici-immaginari-1
https://archive.org/download/codici-immaginari-1/codici-immaginari-1_jp2.zip</code></pre></figure>
<p>Run the script that download and unzip the images into <code class="language-plaintext highlighter-rouge">./data</code></p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">./scripts/get ITEM</code></pre></figure>
<h3 id="generate-iiif-manifest">Generate IIIF manifest</h3>
<p>JP2 images from <code class="language-plaintext highlighter-rouge">./data</code> directory are served by the iipsrv container following this pattern:</p>
<p><code class="language-plaintext highlighter-rouge">data/item/file.jp2</code> → <code class="language-plaintext highlighter-rouge">http://localhost:8094/iiif/item/file.jp2/info.json</code></p>
<p>To generate the IIIF manifest run (<a href="https://deno.land">Deno</a> is required to be installed locally):</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">./scripts/iiif ITEM</code></pre></figure>
<p>The manifest is saved to <code class="language-plaintext highlighter-rouge">www/manifests</code> and published to<br />
<code class="language-plaintext highlighter-rouge">http://localhost:8094/manifests/ITEM.json</code></p>
<p>I found <a href="https://deno.land">Deno</a> extremely useful for quick prototyping. The <a href="https://github.com/atomotic/archiviiify/blob/master/scripts/make-manifest.js">script to generate the manifest</a> is very simple (and incomplete). Better ways and libraries exists to produce IIIF Presentation, look at <a href="https://iiif-commons.github.io/manifesto/">manifesto</a>.</p>
<h3 id="generate-ocr">Generate OCR</h3>
<p>Internet Archive is also running OCR and extracting full-text with ABBYY, but is not a <a href="https://dbmdz.github.io/solr-ocrhighlighting/formats/">supported format</a> by the ocr highlightning plugin. I tried to convert it using this <a href="https://raw.githubusercontent.com/OCR-D/format-converters/master/abbyy2hocr.xsl">xsl</a> (saxon needed, not xsltproc) but the result is not enough, the required <code class="language-plaintext highlighter-rouge">ocrx_word</code> classes are missing. I’ve not looked deeply, XSLT is causing me headaches, so i gave up and went to re-OCR using <a href="https://tesseract-ocr.github.io/">Tesseract 4</a>.</p>
<p>Run:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">./scripts/ocr ITEM</code></pre></figure>
<p>The previous script create a file with the list of images:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ find data/ITEM/<span class="se">\*</span>.jp2 <span class="o">></span> ITEM.list</code></pre></figure>
<p>and run Tesseract (you need to specify the proper language model):</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ tesseract <span class="nt">-l</span> ita ITEM.list ITEM hocr</code></pre></figure>
<p>This can take some time, to speed up things <a href="https://www.gnu.org/software/parallel/">GNU parallel</a> could be used to generate hocr for every single images and then combine the result together with <a href="https://github.com/tmbdev/hocr-tools#hocr-combine">hocr-combine</a>.<br />
A small fix is needed for the resulting hocr: Tesseract is naming <code class="language-plaintext highlighter-rouge">ocr_page</code> classes with <code class="language-plaintext highlighter-rouge">page_{1..n}</code>, i prefer to name with the full name of the original image file, that is contained also in the canvas identifier in the IIIF manifest</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"> <span class="nt"><div</span> <span class="na">class=</span><span class="s">'ocr_page'</span> <span class="na">id=</span><span class="s">'page_1'</span> <span class="err">...</span></code></pre></figure>
<p>↳</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"> <span class="nt"><div</span> <span class="na">class=</span><span class="s">'ocr_page'</span> <span class="na">id=</span><span class="s">'file_0000.jp2'</span> <span class="err">...</span></code></pre></figure>
<p>Run</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">./scripts/ocr-fix ITEM</code></pre></figure>
<p><a href="https://en.wikipedia.org/wiki/HOCR">hOCR</a> is XHTML, would be advisable to use a proper parser (or xslt). The previous script uses some kind of cli voodoo because laziness (<a href="https://www.gnu.org/software/parallel/">parallel</a>, <a href="https://github.com/ericchiang/pup">pup</a>, <a href="https://github.com/chmln/sd">sd</a> required):</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#!/usr/bin/env bash</span>
<span class="nv">ITEM</span><span class="o">=</span><span class="nv">$1</span>
parallel <span class="nt">-j1</span> sd <span class="nt">-f</span> w <span class="o">{</span>1<span class="o">}</span> <span class="o">{</span>2<span class="o">}</span> <span class="s2">"ocr/ITEM.hocr"</span> <span class="se">\</span>
::: <span class="si">$(</span>pup .ocr_page attr<span class="o">{</span><span class="nb">id</span><span class="o">}</span> <<span class="s2">"ocr/ITEM.hocr"</span><span class="si">)</span> <span class="se">\</span>
:::+ <span class="se">\$</span><span class="o">(</span>find data/ITEM/<span class="se">\*</span>.jp2 <span class="nt">-exec</span> <span class="nb">basename</span> <span class="o">{}</span> <span class="se">\;</span><span class="o">)</span></code></pre></figure>
<h3 id="index-to-solr">Index to Solr</h3>
<p>The hOCR file is ready to be indexed to Solr:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>POST solr/ocr/updates
{
'id': 'ITEM',
'ocr_text': '/ocr/ITEM.hocr',
'source':'IA'
}
</code></pre></div></div>
<p>Run</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">./scripts/index ITEM</code></pre></figure>
<p>Go to the Solr admin at http://localhost:8983 to try some queries, or reach the iiif search api at <code class="language-plaintext highlighter-rouge">http://localhost:8094/search/ITEM?q=....</code><br />
The query can be tweaked <a href="https://github.com/atomotic/archiviiify/blob/master/iiif-search-api/main.js#L17-L21">here</a></p>
<h3 id="view">View</h3>
<p>Open <code class="language-plaintext highlighter-rouge">http://localhost:8094/mirador?manifest=ITEM</code> and enjoy reading your book with Mirador 3! This tutorial is not exclusive to Internet Archive, can be used to publish any content in IIIF.</p>
<p>A video that shows how it works:</p>
<video width="500" height="300" controls="">
<source src="https://literarymachin.es/archiviiify.mp4" type="video/mp4" />
Your browser does not support the video tag.
</video>
<p>Send your love to Internet Archive: use it and <a href="https://archive.org/donate/">donate</a>!</p>raffaeleA short guide to download digitized books from Internet Archive and rehost on your own infrastructure using IIIF with full-text search.pywb 2.0 - docker quickstart2018-01-31T03:00:00-06:002018-01-31T03:00:00-06:00https://literarymachin.es/pywb2<p>Four years have passed since i <a href="https://literarymachin.es/pywb-wayback-machine/">first wrote</a> of <a href="http://pywb.readthedocs.io/en/latest/">pywb</a>: it was a young tool at the time, but already usable and extremely simple to deploy.
Since then a lot of works has been done by Ilya Kreymer (and others), resulting in all the new features available with the <a href="https://webrecorder.github.io/2018/01/30/pywb-release.html">2.0 release</a>.<br />
Also, some very big webarchiving initiatives have moved and used <strong>pywb</strong> in these years: <a href="https://webrecorder.io">Webrecorder</a> itself, <a href="http://rhizome.org/">Rhizome</a>, <a href="https://perma.cc/">Perma</a>, <a href="http://arquivo.pt/">Arquivo PT</a> in Portugal, the <a href="https://twitter.com/bncfirenze/status/844219966505320450">Italian National Library</a> in Florence (Italy), (others i’m missing).</p>
<p>For many years i’ve used <strong>pywb</strong> for my personal private webarchive on a shared host, with the setup described <a href="https://literarymachin.es/pywb-wayback-machine">here</a>. Nowadays actually shared hosts are well defunct, and cloud virtual machines are even more cheap.</p>
<p>The simplest way you can use pywb today for your own instance is probably <strong>docker</strong>.
Here a quick tutorial:</p>
<ul>
<li>
<p>pull the docker image</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> docker pull webrecorder/pywb
</code></pre></div> </div>
</li>
<li>
<p>create a directory to keep the collection</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mkdir ~/webarchive; cd ~/webarchive
</code></pre></div> </div>
</li>
<li>
<p>initialise the collection (call <em>my-collection</em> as you prefer)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager init my-collection
</code></pre></div> </div>
</li>
<li>
<p>add archived contents, copying WARCs you have previously created</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> cp $file.warc.gz ~/webarchive/collections/my-collection/archive
</code></pre></div> </div>
</li>
<li>
<p>index the collection</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager reindex my-collection
</code></pre></div> </div>
<p>a <a href="https://github.com/oduwsdl/ORS/wiki/CDXJ">CDXJ</a> index will be created in <code class="language-plaintext highlighter-rouge">~/webarchive/collections/my-collection/indexes/index.cdxj</code></p>
</li>
<li>
<p>start it: pywb will run on localhost:8080</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> docker run -d --name pywb -v ~/webarchive:/webarchive -p 8080:8080 webrecorder/pywb
open http://localhost:8080
</code></pre></div> </div>
</li>
</ul>
<p>Easy!</p>
<p>Again, why pywb has been so important in the webarchiving scene?
Because it focus on individuals, for the easiness on creating, curating and mantaining personal web archives!</p>
<iframe src="https://digipres.club/@despens/99443704321052297/embed" class="mastodon-embed" style="max-width: 100%; border: 0" width="600" height="400"></iframe>
<script src="https://digipres.club/embed.js" async="async"></script>raffaeleFour years have passed since i first wrote of pywb: it was a young tool at the time, but already usable and extremely simple to deploy. Since then a lot of works has been done by Ilya Kreymer (and others), resulting in all the new features available with the 2.0 release. Also, some very big webarchiving initiatives have moved and used pywb in these years: Webrecorder itself, Rhizome, Perma, Arquivo PT in Portugal, the Italian National Library in Florence (Italy), (others i’m missing).Anonymous webarchiving2017-10-05T04:00:00-05:002017-10-05T04:00:00-05:00https://literarymachin.es/anonymous-webarchiving<p>Webarchiving activities, as any other activity where an HTTP client is involved, leave marks of their steps: the web server you are visiting or crawling will save your IP address in its logs (or even worse it can decide to ban your IP). This is usually not a problem, there are plenty of good reasons for a webserver to keep logs of its visitors.<br />
But sometimes you may need to protect your own identity when you are visiting or saving something from a website, and there a lot of sensitive careers that need this protection: activists, journalist, political dissidents.<br />
<a href="https://www.torproject.org/">TOR</a> has been invented for this, and today offer a good protection to browse anonymously the web.<br />
<strong>Can we also archive the web through TOR?</strong></p>
<p>Actually is not difficult: we need the TOR daemon running and then we have to proxy our webarchiving client through it. Every crawler (Heritrix, wget, wpull) can be configured to use a proxy.</p>
<p>Here i want to use <a href="https://github.com/ikreymer/pywb"><strong>pywb</strong></a>, a python implementation of the wayback machine (i wrote about it in the past!), with a <a href="http://pywb.readthedocs.io/en/develop/manual/usage.html#using-pywb-recorder"><strong>new recorder feature</strong></a> that will be soon released (kudos to <a href="https://twitter.com/IlyaKreymer">@IlyaKreymer</a> and <a href="https://twitter.com/webrecorder_io">@webrecorder</a>).</p>
<p>A quick guide for macos, easy to adapt to GNU/Linux:</p>
<h3 id="install-and-run-tor">Install and run TOR</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ brew install tor
~ echo "TestSocks 1" | tee ~/.torrc
~ tor -f ~/.torrc
</code></pre></div></div>
<p>Keep the daemon running in foreground. Check its output (after the last step) and verify that is logging something like this to be sure that there are no leaks:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Oct 05 12:25:41.000 [notice] Your application (using socks5 to port 42) instructed Tor to take care of the DNS resolution itself if necessary. This is good.
</code></pre></div></div>
<h3 id="configure-torsocks">Configure torsocks</h3>
<p>verify to have version 2.2.0:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ torsocks --version
Torsocks 2.2.0
</code></pre></div></div>
<p>change the default configuration:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ <span class="nv">TORSOCKS_CONF</span><span class="o">=</span>/usr/local/Cellar/torsocks/2.2.0/etc/tor/torsocks.conf
~ gsed <span class="nt">-i</span> <span class="s1">'/AllowInbound/s/^#//g'</span> <span class="nv">$TORSOCKS_CONF</span>
~ gsed <span class="nt">-i</span> <span class="s1">'/AllowOutboundLocalhost/s/^#//g'</span> <span class="nv">$TORSOCKS_CONF</span></code></pre></figure>
<h3 id="install-pywb">Install pywb</h3>
<p>install pywb from develop branch</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ pip3 install git+https://github.com/ikreymer/pywb@develop
</code></pre></div></div>
<p>create an archive</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ mkdir webarchive
~ cd webarchive
~ wb-manager init anonymous-archive
~ echo "recorder:live" | tee config.yaml
</code></pre></div></div>
<h3 id="run-pywb-behind-tor">Run pywb behind TOR</h3>
<p>set your shell to use Torsocks by default, every network activity will be proxied trough TOR:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ . torsocks on
</code></pre></div></div>
<p>run pywb:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ wayback --live -a --auto-interval 10
</code></pre></div></div>
<p>record your site:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://localhost:8080/anonymous-archive/record/{URL-TO-RECORD}
</code></pre></div></div>
<p><strong>important</strong>: always use a dedicated browser for this, to avoid leaks by extensions or other custom settings. Also make sure to disable DNS Prefetch:</p>
<ul>
<li>Firefox: <code class="language-plaintext highlighter-rouge">about:config</code> ➜ set <code class="language-plaintext highlighter-rouge">network.dns.disablePrefetch</code> to <code class="language-plaintext highlighter-rouge">true</code></li>
<li>Chrome: <em>Settings</em> ➜ <em>Advanced</em> ➜ <em>Privacy and security</em> ➜ toggle the “<em>Use a prediction service to load pages more quickly</em>”</li>
</ul>
<p>Browse the site, everything will be recorded inside<br />
<code class="language-plaintext highlighter-rouge">./collections/anonymous-archive</code><br />
You can replay the recordings still using pywb or also <a href="https://github.com/webrecorder/webrecorderplayer-electron">Webrecorder Player</a></p>
<p><strong>Beware: double check every step and make sure to test it with a known website where you can check the access log to verify that the IP address that is hitting the server is not yours.</strong><br />
Or, even better, record <a href="https://check.torproject.org">https://check.torproject.org</a> and verify if this message is obtained:</p>
<div class="image-wrapper">
<img src="/assets/images/pywb-tor.png" alt="pywb recording check.torproject" />
</div>raffaeleWebarchiving activities, as any other activity where an HTTP client is involved, leave marks of their steps: the web server you are visiting or crawling will save your IP address in its logs (or even worse it can decide to ban your IP). This is usually not a problem, there are plenty of good reasons for a webserver to keep logs of its visitors. But sometimes you may need to protect your own identity when you are visiting or saving something from a website, and there a lot of sensitive careers that need this protection: activists, journalist, political dissidents. TOR has been invented for this, and today offer a good protection to browse anonymously the web. Can we also archive the web through TOR?Open BNI2016-09-03T05:00:00-05:002016-09-03T05:00:00-05:00https://literarymachin.es/open-bni<p>Il <a href="http://mailman.wikimedia.it/pipermail/bibliotecari/2016-May/003789.html">30 maggio 2016</a> viene annunciato il rilascio libero della <a href="https://it.wikipedia.org/wiki/Bibliografia_nazionale_italiana">Bibliografia Nazionale Italiana</a> (BNI). Viene apprezzata l’apertura di questo catalogo (anche se con i limiti dei soli pdf), e da profano di biblioteconomia faccio anche una <a href="http://mailman.wikimedia.it/pipermail/bibliotecari/2016-May/003790.html">domanda</a> sull’effettivo caso d’uso della BNI.<br />
Il <a href="http://mailman.wikimedia.it/pipermail/bibliotecari/2016-September/003831.html">30 agosto 2016</a> viene annunciato il rilascio delle annate 2015 e 2016 anche in formato UNIMARC e MARCXML.<br />
Incuriosito dal catalogo inizio ad esplorarlo, per pensare a possibili trasformazioni (triple rdf) o arricchimenti con/verso altri dati (wikidata).</p>
<p>Ecco come esplorare i dati con <a href="http://basex.org">basex</a> e <a href="https://it.wikipedia.org/wiki/XQuery">xquery</a> e un notebook <a href="https://jupyter.org/">jupyter</a>:</p>
<ul>
<li>download degli xml (mi limito all’analisi delle monografie):</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ curl <span class="nt">-OJLs</span> <span class="s2">"http://bni.bncf.firenze.sbn.it/bniweb/scaricaxml.jsp?mese=01&anno=2015&serie=Monografie"</span>
~ curl <span class="nt">-OJLs</span> <span class="s2">"http://bni.bncf.firenze.sbn.it/bniweb/scaricaxml.jsp?mese=02&anno=2015&serie=Monografie"</span>
~ curl <span class="nt">-OJLs</span> <span class="s2">"http://bni.bncf.firenze.sbn.it/bniweb/scaricaxml.jsp?mese=03&anno=2015&serie=Monografie"</span>
~ curl <span class="nt">-OJLs</span> <span class="s2">"http://bni.bncf.firenze.sbn.it/bniweb/scaricaxml.jsp?mese=01&anno=2016&serie=Monografie"</span>
~ <span class="nb">sha1sum </span>Monografie<span class="k">*</span>.xml
7c226c88daefd7b145ebb0bc01d621ba9f3ea9b3 Monografie201501.xml
204134fef0f5275f466feb9c6a018c794fadd07b Monografie201502.xml
bdbcab246290b9d2e0db3b7279bd32ea20ea6ef3 Monografie201503.xml
c8e56442bc5c8a1e7fb9e31731108ba586993c17 Monografie201601.xml</code></pre></figure>
<ul>
<li>installazione di <a href="http://basex.org">basex</a></li>
</ul>
<p>debian/ubuntu: <code class="language-plaintext highlighter-rouge">~ apt-get install basex</code><br />
macos: <code class="language-plaintext highlighter-rouge">~ brew install basex</code></p>
<ul>
<li>creazione del database e caricamento degli xml</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ basex <span class="nt">-c</span> <span class="s2">"create database bni"</span>
~ basex <span class="nt">-i</span> bni <span class="nt">-c</span> <span class="s2">"add Monografie201501.xml; add Monografie201502.xml;</span><span class="se">\</span><span class="s2">
add Monografie201503.xml; Monografie201601.xml"</span></code></pre></figure>
<ul>
<li>avvio del db</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ basexserver</code></pre></figure>
<ul>
<li>installazione di jupyter notebook e della libreria client in python per basex</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">~ pip3 <span class="nb">install </span>jupyter
~ pip3 <span class="nb">install </span>basexclient
~ jupyter notebook</code></pre></figure>
<p>In nuovo notebook si possono quindi iniziare ad estrarre i dati con xquery. Esempio di una semplice funzione che conta le occorrenze di un path (xpath):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">BaseXClient</span> <span class="kn">import</span> <span class="n">BaseXClient</span>
<span class="n">session</span> <span class="o">=</span> <span class="n">BaseXClient</span><span class="p">.</span><span class="n">Session</span><span class="p">(</span><span class="s">'127.0.0.1'</span><span class="p">,</span> <span class="mi">1984</span><span class="p">,</span> <span class="s">'admin'</span><span class="p">,</span> <span class="s">'admin'</span><span class="p">)</span>
<span class="n">publisher</span> <span class="o">=</span> <span class="s">'//rec/df[@t="210"]/sf[@c="c"]'</span>
<span class="n">publisher_city</span> <span class="o">=</span> <span class="s">'//rec/df[@t="210"]/sf[@c="a"]'</span>
<span class="n">subject</span> <span class="o">=</span> <span class="s">'//rec/df[@t="606"]/sf[@c="a"]'</span>
<span class="k">def</span> <span class="nf">count</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">limit</span><span class="p">):</span>
<span class="n">q</span> <span class="o">=</span> <span class="s">'''let $db := db:open("bni")
let $result :=
for $publisher in distinct-values($db{0})
let $count := count(index-of($db{0}, $publisher))
order by $count descending
return concat($publisher, ", ", $count)
for $limited at $lim in subsequence($result, 1, {1})
return $limited'''</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">limit</span><span class="p">)</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">session</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="n">q</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">query</span><span class="p">.</span><span class="nb">iter</span><span class="p">():</span>
<span class="k">print</span><span class="p">(</span><span class="n">item</span><span class="p">)</span></code></pre></figure>
<p>I 10 editori e le città con maggior numero di pubblicazioni:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">count</span><span class="p">(</span><span class="n">publisher</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">count</span><span class="p">(</span><span class="n">publisher_city</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span></code></pre></figure>
<p>oppure i 30 soggetti piu’ usati (<a href="http://unimarc-it.wikidot.com/606">field 606</a>)</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">count</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span> <span class="mi">30</span><span class="p">)</span></code></pre></figure>
<p>Il notebook già pronto può essere scaricato da <a href="https://github.com/atomotic/bni-xquery/blob/master/bni-xquery.ipynb">atomotic/bni-xquery</a></p>
<h3 id="conclusioni">Conclusioni</h3>
<p>Cosa fare con questi dati? Sicuramente arricchire wikidata, ad esempio ci sono pochissimi item di <a href="http://tinyurl.com/j9xfkqz">editori italiani</a>.</p>
<p>La BNI può essere considerato un primo passo verso l’apertura completa del catalogo OPAC? Speriamo di si.</p>raffaeleIl 30 maggio 2016 viene annunciato il rilascio libero della Bibliografia Nazionale Italiana (BNI). Viene apprezzata l’apertura di questo catalogo (anche se con i limiti dei soli pdf), e da profano di biblioteconomia faccio anche una domanda sull’effettivo caso d’uso della BNI. Il 30 agosto 2016 viene annunciato il rilascio delle annate 2015 e 2016 anche in formato UNIMARC e MARCXML. Incuriosito dal catalogo inizio ad esplorarlo, per pensare a possibili trasformazioni (triple rdf) o arricchimenti con/verso altri dati (wikidata).Epub linkrot2015-03-03T04:00:00-06:002015-03-03T04:00:00-06:00https://literarymachin.es/epub-linkrot<p><a href="https://en.wikipedia.org/wiki/Link_rot">Linkrot</a> also affects epub files (who would have thought! :)).<br />
How to check the health of external links in epub books (required tools: a shell, <a href="https://savannah.nongnu.org/projects/atool">atool</a>, <a href="https://github.com/EricChiang/pup">pup</a>, <a href="http://www.gnu.org/software/parallel">gnu parallel</a>).</p>
<h3 id="extract-all-external-links">extract all external links</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>acat <span class="nt">-F</span> zip <span class="o">{</span>FILE.epub<span class="o">}</span> <span class="s2">"*.xhtml"</span> <span class="s2">"*.html"</span> <span class="se">\ </span>
| pup <span class="s1">'a attr{href}'</span> <span class="se">\ </span>
| egrep <span class="s2">"^http"</span> | <span class="nb">sort</span> | <span class="nb">uniq</span> <span class="se">\ </span>
<span class="o">></span> links.txt </code></pre></figure>
<h3 id="check-http-status">check http status</h3>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">echo</span> <span class="s2">"http_code, url"</span> <span class="o">></span> links-status.csv
<span class="nv">$ </span>parallel <span class="nt">-j</span> 10 <span class="s1">'curl -k -L -s -o /dev/null -w "%{http_code}" {}; echo ", {}\n"'</span> :::: links.txt <span class="o">>></span> links-status.csv</code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">link-status.csv</code> is a csv and contains <code class="language-plaintext highlighter-rouge">http_code</code> and original <code class="language-plaintext highlighter-rouge">url</code>. Installing <a href="https://csvkit.readthedocs.org/en/0.9.0/">csvkit</a> you can perform this simple analysis:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>csvstat <span class="nt">--freq</span> <span class="nt">-c</span> http_code links-status.csv | jq .</code></pre></figure>
<p>And you obtain a summary view by <code class="language-plaintext highlighter-rouge">http_code</code>. The following example is extracted from a book i bought in 2011 (14 links gone):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"403": 1,
"301": 2,
"404": 14,
"200": 95
}
</code></pre></div></div>
<h3 id="what-to-do">what to do?</h3>
<p>And so you’ve discovered that your loved ebooks are full or rotten links.<br />
What you could do:</p>
<ul>
<li><em>you are a compulsive reader and digital books hoarder?</em> archive by yourself links from the book after you’ve bought (remove drm from your own books is also a safe thing)</li>
<li><em>you are an author or selfpublisher (or equivalent hipster term to describe it)?</em> archive links from the book you are writing to internet archive waybackmachine (using <a href="http://gijn.org/2015/01/27/introducing-the-research-desk-secrets-of-the-wayback-machine/">Save Page Now</a>), than link the archived version. <a href="http://robustlinks.mementoweb.org/">Robust Links</a> are not an option now, unless some epub reader client is implementing it right now.</li>
<li><em>you are a big publisher?</em> consider to manage your own web archive and offer it as a service to your authors</li>
</ul>raffaeleLinkrot also affects epub files (who would have thought! :)). How to check the health of external links in epub books (required tools: a shell, atool, pup, gnu parallel).SKOS Nuovo Soggettario, api e autocomplete2015-02-26T11:00:00-06:002015-02-26T11:00:00-06:00https://literarymachin.es/skos-autocomplete<p>Come creare una api per un form con autocompletamento usando i termini del Nuovo Soggettario, con i Sorted Sets di Redis e Nginx+Lua.</p>
<div class="image-wrapper">
<img src="/assets/images/nuovosoggettario.jpg" alt="ns skos" />
</div>
<p>Il <a href="http://thes.bncf.firenze.sbn.it">Nuovo Soggettario</a>, disponibile in formato <a href="http://thes.bncf.firenze.sbn.it/dati/NS-SKOS.zip">SKOS</a> (CC BY), può essere facilmente usato per creare delle api da usare per servizi di normalizzazione, inserimento dati, catalogazione. E’ un set di dati abbastanza piccolo, tale da non rendere necessario l’uso di strumenti sofisticati come SOLR o ElasticSearch.</p>
<p>Il tipo di dato <a href="http://redis.io/topics/data-types#sorted-sets">Sorted Sets</a> di Redis e il comando <a href="http://redis.io/commands/ZRANGEBYLEX#details-on-strings-comparison">ZRANGEBYLEX</a> sono una soluzione molto efficace e semplice per realizzare sistemi di <a href="http://autocomplete.redis.io/">autocompletamento</a>.</p>
<h2 id="creazione-dellindice">creazione dell’indice</h2>
<p><a href="https://github.com/atomotic/nuovosoggettario-skos-redis">nuovosoggettario-skos-redis</a> contiene uno script dimostrativo in python (moduli richiesti lxml e redis):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/atomotic/nuovosoggettario-skos-redis.git
$ cd nuovosoggettario-skos-redis
$ pip install -r requirements.txt
</code></pre></div></div>
<p>Download del soggettario:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir xml
$ wget http://thes.bncf.firenze.sbn.it/dati/NS-SKOS.zip
$ unzip NS-SKOS.zip -d xml
$ rm NS-SKOS.zip
</code></pre></div></div>
<p>Indicizzazione (usate <a href="http://www.gnu.org/s/parallel">gnu parallel</a> o xargs per caricarli in parallelo):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ redis-server &
$ parallel -j 8 /usr/bin/env python index.py {} ::: xml/*.xml
</code></pre></div></div>
<p>Nell’esempio vengono indicizzate la <strong>prefLabel</strong> e tutte le <strong>altLabel</strong>
(si lo so, <a href="https://github.com/atomotic/nuovosoggettario-skos-redis/blob/master/index.py">index.py</a> imbroglia parsando l’xml, ma il parsing rdf con rdflib è estremamente più lento).</p>
<p>Ricerca di esempio:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ redis-cli --raw
127.0.0.1:6379> ZRANGEBYLEX autocomplete [archiv "[archiv\xff" LIMIT 0 5
archivi capitolari:{"label":"Archivi capitolari", "id":"http://purl.org/bncf/tid/17165"}
archivi comunali:{"label":"Archivi comunali", "id":"http://purl.org/bncf/tid/32025"}
archivi correnti:{"label":"Archivi correnti", "id":"http://purl.org/bncf/tid/52282"}
archivi di autorità di nomi e titoli:{"label":"Archivi di autorità di nomi e titoli", "id":"http://purl.org/bncf/tid/2260"}
archivi di autorità:{"label":"Archivi di autorità", "id":"http://purl.org/bncf/tid/2261"}
</code></pre></div></div>
<p>Memoria in uso:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ redis-cli info | grep used_memory_human
used_memory_human:9.57M
</code></pre></div></div>
<h2 id="api-di-ricerca">api di ricerca</h2>
<p>La api web può essere realizzata in qualsiasi linguaggio. Seguendo questo post <a href="http://www.cucumbertown.com/craft/autocomplete-using-redis-nginx-lua/">Redis on steroids: Autocomplete using Redis, Nginx and Lua</a> ho voluto provare con uno script <a href="http://wiki.nginx.org/HttpLuaModule">Lua</a> in Nginx.</p>
<p>Su Debian (testing, sid) basta installare <code class="language-plaintext highlighter-rouge">nginx</code> e <code class="language-plaintext highlighter-rouge">nginx-extras</code>, diversamente bisogna compilare a mano <a href="http://openresty.org/">Openresty</a>.</p>
<p><a href="https://github.com/openresty/lua-resty-redis">lua-resty-redis</a> non è ancora aggiornato per usare il comando ZRANGEBYLEX, va aggiunto:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir /srv/nginx-lua; cd /srv/nginx-lua
$ wget https://raw.githubusercontent.com/openresty/lua-resty-redis/master/lib/resty/redis.lua
$ sed -i 's/\"zscan\"/\"zscan\",\"zrangebylex\"/g' redis.lua
</code></pre></div></div>
<p>script di ricerca: /srv/nginx-lua/<strong>autocomplete.lua</strong></p>
<figure class="highlight"><pre><code class="language-lua" data-lang="lua"><span class="kd">local</span> <span class="n">redis</span> <span class="o">=</span> <span class="nb">require</span> <span class="s2">"redis"</span>
<span class="kd">local</span> <span class="n">red</span> <span class="o">=</span> <span class="n">redis</span><span class="p">:</span><span class="n">new</span><span class="p">()</span>
<span class="n">red</span><span class="p">:</span><span class="n">set_timeout</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">header</span><span class="p">.</span><span class="n">content_type</span> <span class="o">=</span> <span class="s1">'text/plain'</span><span class="p">;</span>
<span class="kd">local</span> <span class="n">ok</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">red</span><span class="p">:</span><span class="n">connect</span><span class="p">(</span><span class="s2">"127.0.0.1"</span><span class="p">,</span> <span class="mi">6379</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">ok</span> <span class="k">then</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">status</span> <span class="o">=</span> <span class="n">ngx</span><span class="p">.</span><span class="n">HTTP_SERVICE_UNAVAILABLE</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">say</span><span class="p">(</span><span class="s2">"Redis down"</span><span class="p">)</span>
<span class="k">return</span>
<span class="k">end</span>
<span class="kd">local</span> <span class="n">q</span><span class="o">=</span><span class="n">ngx</span><span class="p">.</span><span class="n">req</span><span class="p">.</span><span class="n">get_uri_args</span><span class="p">().</span><span class="n">q</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">q</span> <span class="k">then</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">status</span> <span class="o">=</span> <span class="n">ngx</span><span class="p">.</span><span class="n">HTTP_BAD_REQUEST</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">say</span><span class="p">(</span><span class="s2">"arguments missing"</span><span class="p">)</span>
<span class="k">return</span>
<span class="k">end</span>
<span class="kd">local</span> <span class="n">res</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="n">red</span><span class="p">:</span><span class="n">zrangebylex</span><span class="p">(</span><span class="s2">"autocomplete"</span><span class="p">,</span> <span class="s2">"["</span><span class="o">..</span><span class="n">q</span><span class="p">,</span> <span class="s2">"["</span><span class="o">..</span><span class="n">q</span><span class="o">..</span><span class="s2">"\xff"</span><span class="p">,</span><span class="s2">"LIMIT"</span><span class="p">,</span> <span class="s2">"0"</span><span class="p">,</span> <span class="s2">"100"</span><span class="p">)</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">say</span><span class="p">(</span><span class="s2">"["</span><span class="p">)</span>
<span class="n">table</span><span class="p">.</span><span class="n">foreach</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="k">function</span><span class="p">(</span><span class="n">k</span><span class="p">,</span><span class="n">v</span><span class="p">)</span> <span class="n">ngx</span><span class="p">.</span><span class="n">say</span><span class="p">(</span><span class="nb">string.match</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="s2">"{.*}"</span><span class="p">)</span> <span class="o">..</span> <span class="s2">","</span> <span class="p">)</span> <span class="k">end</span><span class="p">)</span>
<span class="n">ngx</span><span class="p">.</span><span class="n">say</span><span class="p">(</span><span class="s2">"{\"</span><span class="n">label</span><span class="err">\</span><span class="s2">":\"</span><span class="err">\</span><span class="s2">", \"</span><span class="n">id</span><span class="err">\</span><span class="s2">":\"</span><span class="err">\</span><span class="s2">"}]"</span><span class="p">)</span></code></pre></figure>
<p>configurazione del virtualhost in nginx, lo script è servito da <strong>/ns-bncf/autocomplete</strong>:</p>
<figure class="highlight"><pre><code class="language-nginx" data-lang="nginx"><span class="k">lua_package_path</span> <span class="s">"/srv/nginx-lua/?.lua</span><span class="p">;</span>;<span class="k">"</span><span class="p">;</span>
<span class="k">server</span> <span class="p">{</span>
<span class="kn">listen</span> <span class="mi">80</span> <span class="s">default_server</span><span class="p">;</span>
<span class="kn">root</span> <span class="s">....</span><span class="p">;</span>
<span class="kn">index</span> <span class="s">index.html</span> <span class="s">index.htm</span><span class="p">;</span>
<span class="kn">server_name</span> <span class="s">....</span><span class="p">;</span>
<span class="kn">location</span> <span class="n">/</span> <span class="p">{</span>
<span class="kn">try_files</span> <span class="nv">$uri</span> <span class="nv">$uri</span><span class="n">/</span> <span class="p">=</span><span class="mi">404</span><span class="p">;</span>
<span class="p">}</span>
<span class="kn">location</span> <span class="p">=</span> <span class="n">/ns-bncf/autocomplete</span> <span class="p">{</span>
<span class="kn">content_by_lua_file</span> <span class="n">/srv/nginx-lua/autocomplete.lua</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>test dell’api, viene restituito un array di oggetti json:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl http://atomotic.com/ns-bncf/autocomplete?q=archiv
</code></pre></div></div>
<p>a questo punto <a href="https://twitter.github.io/typeahead.js">typeahead</a> (o in alternativa jquery-autocomplete) possono essere usati per costruire una select con autocompletamento.</p>
<p><strong>DEMO</strong>: <a href="http://atomotic.com/ns-bncf">http://atomotic.com/ns-bncf</a></p>
<h2 id="possibili-utilizzi">Possibili utilizzi</h2>
<ul>
<li>un plugin per wordpress che usi i termini del soggettario come categorie</li>
<li><a href="http://wiki.eprints.org/w/Autocompletion#external_source">autocomplete</a> di EPrints</li>
<li>un reconciliation service per OpenRefine</li>
<li>….</li>
</ul>raffaeleCome creare una api per un form con autocompletamento usando i termini del Nuovo Soggettario, con i Sorted Sets di Redis e Nginx+Lua.Serve deepzoom images from a zip archive with openseadragon2014-11-23T04:00:00-06:002014-11-23T04:00:00-06:00https://literarymachin.es/deepzoom-server<p><a href="http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS">vips</a> is a fast image processing system. Version <a href="http://lists.andrew.cmu.edu/pipermail/openslide-users/2014-June/000832.html">higher than 7.40</a> can generate static tiles of big images in <a href="http://en.wikipedia.org/wiki/Deep_Zoom">deepzoom</a> format, saving them directly into a zip archive.</p>
<p>Simple as: <code class="language-plaintext highlighter-rouge">$ vips dzsave big.jpg image.zip</code><br />
(note: if you compile vips, verify to have <a href="https://github.com/jcupitt/libvips/issues/173">libgsf1-dev</a> installed)</p>
<p>A zip archive is more convenient than having thousand of small image files sparsed into a filesystem, so i was thinking
a simple way to serve it with openseadragon directly from zip, without extracting.<br />
In my desire to learn better <a href="http://golang.org">Go</a>, i’ve built this <strong><a href="http://github.com/atomotic/deepzoom-osd-server">deepzoom-osd-server</a></strong>, a static web application that embeds openseadragon.</p>
<h2 id="compile">Compile:</h2>
<p>install <code class="language-plaintext highlighter-rouge">gom</code> package manager</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ go get github.com/mattn/gom
</code></pre></div></div>
<p>clone and compile</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/atomotic/deepzoom-osd-server.git
$ cd deepzoom-osd-server
$ make
</code></pre></div></div>
<p>The <a href="https://github.com/atomotic/deepzoom-osd-server/blob/master/Makefile">Makefile</a> will download the latest <a href="https://github.com/openseadragon/openseadragon/releases/download/v1.1.1/openseadragon-bin-1.1.1.tar.gz">binary</a> of openseadragon, then bundle all dependencies into <code class="language-plaintext highlighter-rouge">_vendor</code>, then build the binary embedding all static assets (with <a href="https://github.com/tebeka/nrsc">nrsc</a>)</p>
<h2 id="run">Run</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./deepzoom-osd-server
-- missing dzi directory, creating.
-- running on http://127.0.0.1:8080
</code></pre></div></div>
<p>Now you can generate some deepzoom images and put them into the <code class="language-plaintext highlighter-rouge">dzi</code> directory just created.</p>
<p>The server will expose:</p>
<ul>
<li>
<p>a <code class="language-plaintext highlighter-rouge">/dzi</code> endpoint to explore the content of the zip archive</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> http://localhost:8080/dzi/{ZIPPED-DZI}.zip/{ZIPPED-DZI}.dzi
</code></pre></div> </div>
</li>
<li>
<p>a <code class="language-plaintext highlighter-rouge">/view</code> endpoint with openseadragon</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> http://localhost:8080/view/{ZIPPED-DZI}
</code></pre></div> </div>
</li>
</ul>
<div class="image-wrapper">
<img src="/assets/images/deepzoom-osd-server-screenshot.png" alt="deepzom screenshot" />
</div>
<p><strong>This is an experiment</strong>, i’m not sure if is a sane idea to use for real things, i should stress the application (with <a href="http://httpd.apache.org/docs/2.2/programs/ab.html">ab</a> or <a href="https://github.com/tsenart/vegeta">vegeta</a>) and monitor its memory consumption.</p>
<p>Anyhow, if you are curating a digital library today and you need to publish big high quality images, absolutely you should look at <a href="http://iiif.io">IIIF</a> specs, and use a more
suitable server like <a href="http://iipimage.sourceforge.net/">IIPImage</a> or <a href="https://github.com/pulibrary/loris">Loris</a>.</p>raffaelevips is a fast image processing system. Version higher than 7.40 can generate static tiles of big images in deepzoom format, saving them directly into a zip archive.a wayback machine (pywb) on a cheap, shared host2014-10-23T06:00:00-05:002014-10-23T06:00:00-05:00https://literarymachin.es/pywb<p>For a long time the only free (i’m unaware of commercial ones) implementation of a web archival replay software has been the <a href="http://archive-access.sourceforge.net/projects/wayback/">Wayback Machine</a> (now <a href="http://netpreserve.org/openwayback">Openwayback</a>). It’s a stable and mature software, with a strong community behind.<br />
To use it you need to be confident with the deploy of a java web application; not so difficult, and <a href="https://github.com/iipc/openwayback/wiki">documentation</a> is exaustive.<br />
But there is a new player in the game, <a href="https://github.com/ikreymer/pywb"><strong>pywb</strong></a>, developed by <a href="https://webrecorder.io">Ilya Kramer</a>, a former Internet Archive developer.<br />
Built in python, relatively simpler than wayback, and now used in a pro archiving project at <a href="http://bits.blogs.nytimes.com/2014/10/19/a-new-tool-to-preserve-moments-on-the-internet/">Rhizome</a>.</p>
<p><strong>Pywb</strong> simplicity and clean design make it very easy to deploy, even on shared hosts.
Nowadays seems that no one uses shared hosting anymore, virtual servers often are cheaper and with dozens of orchestration and provisioning tools it’s even easier to bootstrap a full machine.<br />
Despite this, i still prefer a shared host when allowed by the application stack: less things to worry about.</p>
<p>So i tried to install <strong>pywb</strong> on <em>dreamhost</em>, a well known cheap provider, offering deploy of ruby/rack and python/wsgi applications via mod_passenger. <strong>In a few minutes i can have my own wayback machine</strong>.<br />
The following steps are specific for <em>dreamhost</em>, but you should be able to replicate this installation inside any shared host providing deploy of python apps (fastcgi, uwsgi, passenger).</p>
<h3 id="steps">Steps:</h3>
<ul>
<li>
<p>add a new domain in your dreamhost panel, and set document root in <code class="language-plaintext highlighter-rouge">/home/{USER}/wayback/public</code></p>
</li>
<li>
<p>init virtualenv</p>
</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">cd</span> ~
<span class="nv">$ </span>virtualenv wayback</code></pre></figure>
<ul>
<li>create <code class="language-plaintext highlighter-rouge">public</code> and <code class="language-plaintext highlighter-rouge">tmp</code> directory for passenger, <code class="language-plaintext highlighter-rouge">warcs</code> to store warc files and <code class="language-plaintext highlighter-rouge">cdx</code> for indexes</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">mkdir</span> <span class="nt">-p</span> wayback/<span class="o">{</span>public,tmp,warcs,cdx<span class="o">}</span></code></pre></figure>
<ul>
<li>install pywb via pip (inside the virtualenv)</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span><span class="nb">source </span>wayback/bin/activate
<span class="nv">$ </span>pip <span class="nb">install </span>pywb</code></pre></figure>
<ul>
<li>edit pywb config file <code class="language-plaintext highlighter-rouge">~/wayback/config.yaml</code> (full <a href="https://github.com/ikreymer/pywb/blob/master/config.yaml">documentation</a>)</li>
</ul>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">collections</span><span class="pi">:</span>
<span class="err"> </span><span class="na">test</span><span class="pi">:</span> <span class="s">./cdx/</span>
<span class="na">archive_paths</span><span class="pi">:</span> <span class="s">./warcs/</span>
<span class="na">enable_http_proxy</span><span class="pi">:</span> <span class="no">true</span>
<span class="na">static_routes</span><span class="pi">:</span>
<span class="err"> </span><span class="s">static/default</span><span class="pi">:</span> <span class="s">pywb/static/</span>
<span class="na">enable_cdx_api</span><span class="pi">:</span> <span class="no">true</span>
<span class="na">enable_memento</span><span class="pi">:</span> <span class="no">true</span>
<span class="na">framed_replay</span><span class="pi">:</span> <span class="no">true</span></code></pre></figure>
<ul>
<li>edit passenger startup file <code class="language-plaintext highlighter-rouge">~/wayback/passenger_wsgi.py</code></li>
</ul>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">sys</span><span class="p">,</span> <span class="n">os</span>
<span class="n">INTERP</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'HOME'</span><span class="p">],</span> <span class="s">'wayback'</span><span class="p">,</span> <span class="s">'bin'</span><span class="p">,</span> <span class="s">'python'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">sys</span><span class="p">.</span><span class="n">executable</span> <span class="o">!=</span> <span class="n">INTERP</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">execl</span><span class="p">(</span><span class="n">INTERP</span><span class="p">,</span> <span class="n">INTERP</span><span class="p">,</span> <span class="o">*</span><span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span>
<span class="n">sys</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">getcwd</span><span class="p">())</span>
<span class="kn">from</span> <span class="nn">pywb.apps.wayback</span> <span class="kn">import</span> <span class="n">application</span></code></pre></figure>
<ul>
<li>put some warc files in ~/wayback/warcs and generate a sorted cdx</li>
</ul>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>cdx-indexer <span class="nt">--sort</span> ~/wayback/cdx ~/wayback/warcs </code></pre></figure>
<ul>
<li>run! <a href="http://wayback.literarymachin.es/">http://wayback.literarymachin.es/</a></li>
</ul>
<p>Try to search in the <code class="language-plaintext highlighter-rouge">/test</code> collection for the url <em>http://twitter.com/atomotic</em>, you’ll have these <a href="http://wayback.literarymachin.es/test/*/twitter.com/atomotic">results</a>.</p>
<p>And <a href="http://www.mementoweb.org/">Memento</a> is also available:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl "http://wayback.literarymachin.es/test/timemap/*/twitter.com/atomotic"
<http://wayback.literarymachin.es/test/timemap/*/http://twitter.com/atomotic>; rel="self"; type="application/link-format"; from="Wed, 22 Oct 2014 16:30:30 GMT",
<http://twitter.com/atomotic>; rel="original",
<http://wayback.literarymachin.es/test/http://twitter.com/atomotic>; rel="timegate",
<http://wayback.literarymachin.es/test/20141022163030/http://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:30 GMT",
<http://wayback.literarymachin.es/test/20141022163031/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:31 GMT",
<http://wayback.literarymachin.es/test/20141022163042/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:30:42 GMT",
<http://wayback.literarymachin.es/test/20141022163355/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:33:55 GMT",
<http://wayback.literarymachin.es/test/20141022163710/https://twitter.com/atomotic>; rel="memento"; datetime="Wed, 22 Oct 2014 16:37:10 GMT"%
</code></pre></div></div>
<p>Why i think this will be useful? <a href="http://archiveteam.org/">Archiveteam</a> does a great job on running a <a href="http://tracker.archiveteam.org/">distributed</a> crawling organization, but the publishing is still centralized at Internet Archive. What if we begin to publish thousand of small web archives, aggregating<a href="http://blog.dshr.org/2013/03/re-thinking-memento-aggregation.html">[1]</a><a href="http://inkdroid.org/journal/2012/05/03/way-way-back/">[2]</a><a href="http://inkdroid.org/journal/2013/09/30/preserving-linked-data/">[3]</a> them with memento protocol?</p>raffaeleFor a long time the only free (i’m unaware of commercial ones) implementation of a web archival replay software has been the Wayback Machine (now Openwayback). It’s a stable and mature software, with a strong community behind. To use it you need to be confident with the deploy of a java web application; not so difficult, and documentation is exaustive. But there is a new player in the game, pywb, developed by Ilya Kramer, a former Internet Archive developer. Built in python, relatively simpler than wayback, and now used in a pro archiving project at Rhizome.Opendata dell’Anagrafe Biblioteche2014-09-22T11:00:00-05:002014-09-22T11:00:00-05:00https://literarymachin.es/opendata-anagrafe-biblioteche<p>Come usare gli opendata dell’<a href="http://anagrafe.iccu.sbn.it/">Anagrafe delle Biblioteche Italiane</a> e disegnare su una mappa web gli indirizzi delle biblioteche.</p>
<p>Un file CSV con i <a href="http://opendata.anagrafe.iccu.sbn.it/territorio.zip">dati anagrafici e territoriali</a> è scaricabile dalla pagina <a href="http://anagrafe.iccu.sbn.it/opencms/opencms/open_data/">opendata</a> del sito. Sono presenti altri dataset che possono essere usati per integrare la descrizione di una determinata biblioteca, in questo esempio mi limito ai dati generali e alle coordinate geografiche.
Il contenuto del csv può essere facilmente importato in un database relazionale e da lì si possono estrarne i dati di interesse.</p>
<p><a href="https://github.com/dinedal/textql">textql</a> è un tool estremamente utile che permette di eseguire delle query sql direttamente sul csv (in realtà dietro le quinte textql non fa altro che caricare i dati in un database sqlite temporaneo).</p>
<p>Con questo unico comando shell posso estrarre tutte le righe dal csv in cui il field <code class="language-plaintext highlighter-rouge">comune</code>=<code class="language-plaintext highlighter-rouge">Bologna</code> e salvare l’output risultante in un nuovo file csv.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> <span class="nv">$ </span>textql <span class="nt">-source</span> territorio.csv <span class="nt">-header</span> <span class="nt">-dlm</span><span class="o">=</span><span class="s2">";"</span> <span class="nt">-lazy-quotes</span> <span class="nt">-output-header</span> <span class="nt">-sql</span><span class="o">=</span><span class="s1">'SELECT __codice_isil_ AS ISIL, denominazione, indirizzo, telefono, email, url, REPLACE(latitudine,",",".") AS latitudine, REPLACE(longitudine,",",".") AS longitudine FROM tbl WHERE comune="Bologna" AND (latitudine !="" AND latitudine !="0") and (longitudine !="" AND latitudine !="0")'</span> <span class="o">></span> bologna.csv</code></pre></figure>
<p>La query sql (di seguito più leggibile) scarta anche i campi in cui le coordinate sono vuote (o riportano 0,0) e sostuisce il separatore decimale delle coordinate da <code class="language-plaintext highlighter-rouge">,</code> a <code class="language-plaintext highlighter-rouge">.</code></p>
<figure class="highlight"><pre><code class="language-sql" data-lang="sql"><span class="k">SELECT</span>
<span class="n">__codice_isil_</span> <span class="k">AS</span> <span class="n">ISIL</span><span class="p">,</span> <span class="n">denominazione</span><span class="p">,</span> <span class="n">indirizzo</span><span class="p">,</span> <span class="n">telefono</span><span class="p">,</span> <span class="n">email</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span>
<span class="k">REPLACE</span><span class="p">(</span><span class="n">latitudine</span><span class="p">,</span><span class="nv">","</span><span class="p">,</span><span class="nv">"."</span><span class="p">)</span> <span class="k">AS</span> <span class="n">latitudine</span><span class="p">,</span>
<span class="k">REPLACE</span><span class="p">(</span><span class="n">longitudine</span><span class="p">,</span><span class="nv">","</span><span class="p">,</span><span class="nv">"."</span><span class="p">)</span> <span class="k">AS</span> <span class="n">longitudine</span>
<span class="k">FROM</span>
<span class="n">tbl</span>
<span class="k">WHERE</span>
<span class="n">comune</span><span class="o">=</span><span class="nv">"Bologna"</span>
<span class="k">AND</span>
<span class="p">(</span><span class="n">latitudine</span> <span class="o">!=</span><span class="nv">""</span> <span class="k">AND</span> <span class="n">latitudine</span> <span class="o">!=</span><span class="nv">"0"</span><span class="p">)</span>
<span class="k">AND</span>
<span class="p">(</span><span class="n">longitudine</span> <span class="o">!=</span><span class="nv">""</span> <span class="k">AND</span> <span class="n">latitudine</span> <span class="o">!=</span><span class="nv">"0"</span><span class="p">)</span>
<span class="p">;</span></code></pre></figure>
<p>Convertiamo il csv risultato in formato <a href="https://en.wikipedia.org/wiki/GeoJSON">GeoJSON</a> usando il tool <a href="http://csvkit.readthedocs.org/en/latest/scripts/csvjson.html">csvjson</a> dalla suite <a href="http://csvkit.readthedocs.org/en/latest/index.html">csvkit</a>:</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"> <span class="nv">$ </span>csvjson <span class="nt">-d</span><span class="s2">";"</span> <span class="nt">--lat</span> latitudine <span class="nt">--lon</span> longitudine bologna.csv | jq <span class="nb">.</span> <span class="o">></span> bologna.geojson</code></pre></figure>
<p>A questo punto il file geojson può essere usato con una qualsiasi libreria per la visualizzazione di mappe (esempio <a href="http://leafletjs.com/examples/geojson.html">Leaflet</a>) oppure per una visualizzazione immediata può essere caricato in un <a href="http://defunkt.io/gist/">gist</a></p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>gist bologna.geojson
https://gist.github.com/9d4ed56efcf4f9fc2c61</code></pre></figure>
<p>il gist è subito visualizzabile e riporta una mappa navigabile (su un layer mapbox) con i punti delle nostre biblioteche (nell’esempio quelle del comune di Bologna):
<a href="https://gist.github.com/9d4ed56efcf4f9fc2c61">https://gist.github.com/9d4ed56efcf4f9fc2c61</a></p>
<script src="https://gist.github.com/atomotic/9d4ed56efcf4f9fc2c61.js"></script>raffaeleCome usare gli opendata dell’Anagrafe delle Biblioteche Italiane e disegnare su una mappa web gli indirizzi delle biblioteche.api json dell’opac SBN2014-09-05T13:00:00-05:002014-09-05T13:00:00-05:00https://literarymachin.es/sbn-json-api<p>Alcuni mesi fa è stata rilasciata da ICCU una <a href="https://play.google.com/store/apps/details?id=it.inera.opacmobile">app mobile</a> per consultare l’<a href="http://opac.sbn.it">OPAC SBN</a>.
Anche se graficamente poco accattivante l’app funziona bene, e trovo molto utili le funzioni di ricerca di un libro scansionando il codice a barre con la camera del telefonino, e la possibilità di bookmarkare dei preferiti.<br />
Incuriosito dal funzionamento ho pensato di analizzarne il traffico http.</p>
<div class="image-wrapper">
<img src="/assets/images/sbn-mobile-screenshot-2.png" alt="screenshot" />
</div>
<p>Con <a href="http://mitmproxy.org/">mitmproxy</a> in esecuzione sul laptop ho settato il device android</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Settings > Wi-Fi > Modify Network > Show advanced options > Proxy: Manual
</code></pre></div></div>
<p>configurando <code class="language-plaintext highlighter-rouge">Proxy hostname</code> e <code class="language-plaintext highlighter-rouge">Proxy port</code> con l’indirizzo del laptop nella mia rete e la porta <code class="language-plaintext highlighter-rouge">:8080</code></p>
<p>Effettuando delle operazioni di ricerca con la app ho potuto ispezionarne il traffico http, vedendo del traffico di dati json verso l’endpoint <code class="language-plaintext highlighter-rouge">http://opac.sbn.it/opacmobilegw</code></p>
<div class="image-wrapper">
<img src="/assets/images/mitmproxy.png" alt="mitmproxy" />
</div>
<p>Qui alcune delle API principali:</p>
<h2 id="ricerca-libera">Ricerca libera</h2>
<p>URL: <code class="language-plaintext highlighter-rouge">http://opac.sbn.it/opacmobilegw/search.json?any={STRING}&type=0&start=0&rows=3</code></p>
<p>Ricerca una <code class="language-plaintext highlighter-rouge">{STRING}</code> nell’intero catalogo (paginando i risultati con i parametri <code class="language-plaintext highlighter-rouge">start</code> e <code class="language-plaintext highlighter-rouge">row</code>), avendo come risposta una serie di record nel seguente formato (oltre ad altre informazioni interessanti, come faccette e soggetti).</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"autorePrincipale"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Comici, Emilio"</span><span class="p">,</span><span class="w">
</span><span class="nl">"citazioni"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"codiceIdentificativo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT</span><span class="se">\\</span><span class="s2">ICCU</span><span class="se">\\</span><span class="s2">RAV</span><span class="se">\\</span><span class="s2">2002745"</span><span class="p">,</span><span class="w">
</span><span class="nl">"livello"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Monografia"</span><span class="p">,</span><span class="w">
</span><span class="nl">"localizzazioni"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"luogoNormalizzato"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"nomi"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"note"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"numeri"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"progressivoId"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
</span><span class="nl">"pubblicazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Milano : Corriere Della Sera, 2014"</span><span class="p">,</span><span class="w">
</span><span class="nl">"tipo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Testo a stampa"</span><span class="p">,</span><span class="w">
</span><span class="nl">"titolo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Alpinismo eroico / Emilio Comici"</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h2 id="ricerca-per-isbn">Ricerca per ISBN</h2>
<p>URL: <code class="language-plaintext highlighter-rouge">http://opac.sbn.it/opacmobilegw/search.json?isbn={ISBN}</code></p>
<p>Esempio: <a href="http://opac.sbn.it/opacmobilegw/search.json?isbn=9788842092995">/search.json?isbn=9788842092995</a></p>
<h2 id="metadati-di-un-singolo-record-bid">Metadati di un singolo record (BID)</h2>
<p>URL: <code class="language-plaintext highlighter-rouge">http://opac.sbn.it/opacmobilegw/full.json?bid={BID}</code></p>
<p>Effettuando la chiamata con un <strong>BID</strong> (dalla risposta precedente <code class="language-plaintext highlighter-rouge">"codiceIdentificativo": "IT\\ICCU\\RAV\\2002745"</code> - nota: i backslash sono singoli) si ottiene un record json con i metadati del libro, corredati delle localizzazioni (le biblioteche che lo possiedono) complete di coordinate geografiche (quindi pronte per essere visualizzate su una mappa).</p>
<p>esempio: <a href="http://opac.sbn.it/opacmobilegw/full.json?bid=IT\ICCU\RAV\2002745">/full.json?bid=IT\ICCU\RAV\2002745</a></p>
<p>Risposta ottenuta:</p>
<figure class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nl">"autorePrincipale"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Comici, Emilio"</span><span class="p">,</span><span class="w">
</span><span class="nl">"citazioni"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"standard"</span><span class="p">:</span><span class="w"> </span><span class="s2">"mla"</span><span class="p">,</span><span class="w">
</span><span class="nl">"valore"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Comici, Emilio. Alpinismo eroico Milano Corriere Della Sera, 2014"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"standard"</span><span class="p">:</span><span class="w"> </span><span class="s2">"apa"</span><span class="p">,</span><span class="w">
</span><span class="nl">"valore"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Comici, E. (2014). Alpinismo eroico Milano Corriere Della Sera."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"codiceIdentificativo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT</span><span class="se">\\</span><span class="s2">ICCU</span><span class="se">\\</span><span class="s2">RAV</span><span class="se">\\</span><span class="s2">2002745"</span><span class="p">,</span><span class="w">
</span><span class="nl">"collezione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca della montagna ; 8"</span><span class="p">,</span><span class="w">
</span><span class="nl">"descrizioneFisica"</span><span class="p">:</span><span class="w"> </span><span class="s2">"170 p. ; 19 cm"</span><span class="p">,</span><span class="w">
</span><span class="nl">"linguaPubblicazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ITALIANO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"livello"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Monografia"</span><span class="p">,</span><span class="w">
</span><span class="nl">"localizzazioni"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Canale d'Agordo"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BIBLIOTECA COMUNALE DI CANALE D'AGORDO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-BL0089"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">46.3606418</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">11.9148422</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BL"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"VIACQ"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Associazione Italiana Cultura Sport"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca del Centro Informazione Documentazione"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-BO0630"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">44.4769143</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">11.4094361</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"CID-AICS"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"UBOXA"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Firenze"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca delle Oblate"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-FI0104"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">43.772209</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">11.2600206</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"FI"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"RT1AA"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Latina"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca comunale Aldo Manuzio"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-LT0048"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">41.4675967</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">12.9037</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"LT"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"RMSA2"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Milano"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca nazionale Braidense"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-MI0185"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">45.471946</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">9.187845</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"MI"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"MILNB"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Rimini"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca civica Gambalunga"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-RN0013"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">44.0616558</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">12.5678351</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"RN"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"RAVRI"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Pecetto Torinese"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca civica"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-TO0152"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">45.0170177</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">7.7491581</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"TO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"TO13T"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Trieste"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Biblioteca comunale Stelio Mattioni"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-TS0268"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">45.6164974</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">13.8230741</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"TS"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"TSAU2"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"comune"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Iesolo"</span><span class="p">,</span><span class="w">
</span><span class="nl">"denominazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BIBLIOTECA CIVICA DI JESOLO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"isil"</span><span class="p">:</span><span class="w"> </span><span class="s2">"IT-VE0124"</span><span class="p">,</span><span class="w">
</span><span class="nl">"latitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">45.5367875</span><span class="p">,</span><span class="w">
</span><span class="nl">"longitudine"</span><span class="p">:</span><span class="w"> </span><span class="mf">12.6391389</span><span class="p">,</span><span class="w">
</span><span class="nl">"provincia"</span><span class="p">:</span><span class="w"> </span><span class="s2">"VE"</span><span class="p">,</span><span class="w">
</span><span class="nl">"sbn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"VIAVJ"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"luogoNormalizzato"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"nomi"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"Comici, Emilio"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"note"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"Edizione speciale per Corriere della Sera."</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"numeri"</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span><span class="w">
</span><span class="nl">"paesePubblicazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ITALIA"</span><span class="p">,</span><span class="w">
</span><span class="nl">"pubblicazione"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Milano : Corriere Della Sera, 2014"</span><span class="p">,</span><span class="w">
</span><span class="nl">"tipo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Testo a stampa"</span><span class="p">,</span><span class="w">
</span><span class="nl">"titolo"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Alpinismo eroico / Emilio Comici"</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h1 id="disclaimer">Disclaimer</h1>
<p>Queste API non sono documentate pubblicamente, per cui potrebbero cambiare. E non sono noti nemmeno i termini di utilizzo, per cui non sono certo che si possano usare liberamente per costruirci sopra applicazioni esterne.
Se volete sperimentarle in ogni caso evitate operazioni di scraping selvaggio, ed esprimete con una mail all’<a href="mailto:opac.contatti@iccu.sbn.it">ICCU</a> il desiderio di vedere questo tipo di servizi resi pubblici, documentati, con opportune licenze d’uso aperte.</p>raffaeleAlcuni mesi fa è stata rilasciata da ICCU una app mobile per consultare l’OPAC SBN. Anche se graficamente poco accattivante l’app funziona bene, e trovo molto utili le funzioni di ricerca di un libro scansionando il codice a barre con la camera del telefonino, e la possibilità di bookmarkare dei preferiti. Incuriosito dal funzionamento ho pensato di analizzarne il traffico http.