Thanks — really appreciate that, and glad it worked well for a random article.
That’s a great suggestion. A side-by-side text + page view would be very nice for exactly the reasons you mention (verifying the text and seeing the original layout). I haven’t built that yet, but I’ve considered it.
Also helpful to hear that the links to the scans weren’t immediately obvious — I should probably make them a bit clearer. This may also not be obvious, but you can click the vol:page links in the left margin and go directly to the scan of whatever page you're reading.
I know exactly what you mean — I had the same experience with CD-ROM encyclopedias. There’s something about just browsing and falling into articles that’s hard to replicate.
Part of the motivation here was to bring that kind of exploration back, but with the original 1911 text and structure.
Good catch — thanks. That’s a font coverage issue. I’ll either swap in a fallback font for missing glyphs or normalize those cases. This only sounds trivial, this project is full of items like that.
– ~37k articles reconstructed from the original volumes
– section-level structure (contents are clickable within articles)
– cross-references extracted and linked
– contributors indexed and searchable
– original volume + page references preserved and shown while reading
– links to the original scans for each page
– ancillary material included (prefaces, abbreviations, etc.)
– topic index reproduced and cross-linked
– full-text search with article metadata (length, volume, etc.)
Most of the work was in parsing and reconstruction: headings, multi-page articles, tables, math, languages, footnotes, plates, and all the small edge cases that come up in a work like this.
The goal was to make something that feels like the original, but is actually usable.
I’d especially appreciate feedback on:
– search quality
– navigation (sections, cross-references)
– anything that looks structurally off
Happy to answer questions about the pipeline or data model
Thanks for adding this! Do you plan to add back-links in the article pages (and perhaps in contributors pages) pointing to the chapters in the Reader's Guide that mention them, similar to what's done for the subject-based index?
Not a bad idea. I'll see what I can work out on that score. But I imagine the far more common path is from the Guide to the encyclopedia than the reverse.
Thanks so much for sharing this. It looks fantastic. A couple of questions, if you don't mind: what license are you releasing this under, if any? Is there any way to download it? The reason someone might want to download it is for use as training data.
The underlying text (1911 edition) is public domain, but the structured version here — the parsing, reconstruction, and linking — is something I put together for this site. Right now there isn’t a bulk download available. I’m considering exposing structured access (API or dataset) in some form, but haven’t decided exactly how that will work yet.
If you have a specific use case in mind (especially for training), I’d be interested to hear more.
Regarding the specific use case, I was thinking this: I had Gemma 4 (a small but highly capable offline model released by Google) make a public domain cc0 encyclopedia of some core science and technology concepts[1]. I thought it was pretty good.
Separately, I've fine-tuned the Gemma 4 model[2], it was very quick (just 90 seconds), so I think it could be interesting to train it to talk like 1911 Encyclopedia Britannica.
I would use the entries as training data and train it to talk in the same style. There isn't a specific use case for why, I just think it would be interesting. For example, I could see how it writes about modern concepts in the style of 1911 Britannica.
I've wanted to do something like this for The Encyclopédie, a hugely relevant text to the Enlightenment. If you ever get around to adding a rough "How I (generally) Made This" section, that'd be appreciated! Site looks great :)
> Is there any way to download it? The reason someone might want to download it is for use as training data.
Another reason would be to able to keep running/using it even if the main site were to go down for whatever reason eventually; or, to operate a mirror of it, for redundancy (linking back to the original, of course).
As a feature request, would it possible for your pipeline to also create an EPUB? Then people can easily access and search through the document even when your site would go down. EPUB by default uses compression so the file size might even not be too bad for the full encyclopedia.
Very nice. I actually spent a bit of time browsing a few topics, which is something I rarely do these days!
A few things... when I click an article and try to jump to a new topic, the top search box (labeled "Search titles and full text...") doesn't work. Second, when I first came to the site, I was a bit stuck. It took a bit of time to realize I need to click on "Articles" or even "Topics" to start browsing. Not sure why, maybe I expected the image to let me enter the site somehow...?
legal terms question here also -- several major world economies are operating under very different rules regarding datasets and publication rights. I am in the USA / California.. will there be terms for me, given that I am not a giant deep-pockets FAANG, just a book person ? commercial use terms for "small business" scale ?
The 1911 text itself is public domain, so anyone is free to use it.
What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.
For casual or small-scale use there’s no issue at all. For bulk use (e.g. dataset / training / redistribution), I’d prefer people get in touch so I can figure out a sensible way to support that.
> What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.
> They only release books that are in the public domain.
Not necessarily. Project Gutenberg does provide some works still under US copyright, such as F. P. Walter’s 1999 translation of Twenty Thousand Leagues Under the Seas: https://gutenberg.org/ebooks/2488
That’s a great suggestion. A side-by-side text + page view would be very nice for exactly the reasons you mention (verifying the text and seeing the original layout). I haven’t built that yet, but I’ve considered it.
Also helpful to hear that the links to the scans weren’t immediately obvious — I should probably make them a bit clearer. This may also not be obvious, but you can click the vol:page links in the left margin and go directly to the scan of whatever page you're reading.
Thanks again.
reply