digital libraries,
web preservation,
books,
archives.

Build a static search for an Internet Archive Collection with Pagefind

2026-03-07 tags: pagefind internet archive search static digital library minimal computing

Pagefind caught my attention about a year ago, and since then I've adopted it in several hobby projects (nothing work-related): some blogs built with static generators like Hugo or Zola, some old HTML content distributed on CD-ROM, and some mailing list archives where I converted mbox files to HTML and then indexed them.

The tool is great, better for my needs than other JavaScript search libraries (though it's not really fair to compare them, since they're quite different). Pagefind is a search tool that runs entirely in the browser with zero server-side dependencies. It indexes your content into a compact binary index, using WASM to run search in the browser.

It can't completely replace server-side search technologies like Solr or Elasticsearch, mainly because the index can't be updated incrementally. But for many small to medium digital libraries or collections that are rarely updated once completed, it's an extremely good tool: very fast, easy to integrate into web pages, and requires almost no maintenance.

Until now I was convinced that the only way to build an index was by reading content from existing HTML files. That changed when I listened to this Python in Digital Humanities podcast, where David Flood mentioned:

Critically, PageFind has a Python API that lets you build indexes programmatically from database dumps rather than only from HTML files.

I'd completely missed that Pagefind has a Python API (and a Node one too), which makes it easy to build an index from any data source.


Here's a basic example: building a search index for an Internet Archive collection.

I'm using the Pagefind pre-release here, which introduces a new UI with web components.

Init

uv init .
uv add internetarchive
uv add --prerelease=allow 'pagefind[bin]'

Directory to save the index and serve the UI

mkdir ./web

Python code: create an index from metadata of this collection (that is actually a collection of subcollections in Internet Archive, Italian content, related to radical movements)

import asyncio
import logging
import os

import internetarchive
from pagefind.index import PagefindIndex, IndexConfig

logging.basicConfig(level=os.environ.get("LOG_LEVEL", "DEBUG"))
log = logging.getLogger(__name__)


async def main():
    config = IndexConfig(output_path="./web/pagefind")

    async with PagefindIndex(config=config) as index:
        log.info("Searching collection:radical-archives ...")
        results = internetarchive.search_items(
            "collection:radical-archives",
            fields=["identifier", "title", "description"],
        )

        count = 0
        for item in results:
            identifier = item.get("identifier", "")
            title = item.get("title", identifier)
            description = item.get("description", "")
            url = f"https://archive.org/details/{identifier}"
            thumbnail = f"https://archive.org/services/img/{identifier}"

            if isinstance(description, list):
                description = " ".join(description)

            await index.add_custom_record(
                url=url,
                content=description or title,
                language="en",
                meta={
                    "title": title,
                    "description": description,
                    "image": thumbnail,
                },
            )
            count += 1
            log.debug("indexed %s: %s", identifier, title)

        log.info("Indexed %d items. Writing index ...", count)

    log.info("Done. Index written to ./web/pagefind")


if __name__ == "__main__":
    asyncio.run(main())

HTML UI in ./web/index.html

<!DOCTYPE html>
<html lang="en">
	<head>
		<meta charset="UTF-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>pagefind-ia</title>
		<link href="/pagefind/pagefind-component-ui.css" rel="stylesheet">
		<script src="/pagefind/pagefind-component-ui.js" type="module"></script>
	</head>
	<body>
		<pagefind-modal-trigger></pagefind-modal-trigger>
		<pagefind-modal>
			<pagefind-modal-header>
				<pagefind-input></pagefind-input>
			</pagefind-modal-header>
			<pagefind-modal-body>
				<pagefind-summary></pagefind-summary>
				<pagefind-results show-images></pagefind-results>
			</pagefind-modal-body>
			<pagefind-modal-footer>
				<pagefind-keyboard-hints></pagefind-keyboard-hints>
			</pagefind-modal-footer>
		</pagefind-modal>
	</body>
</html>

Result: easy to embed it anywhere!