For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more ahaspel's commentsregister

Thanks — really appreciate that, and glad it worked well for a random article.

That’s a great suggestion. A side-by-side text + page view would be very nice for exactly the reasons you mention (verifying the text and seeing the original layout). I haven’t built that yet, but I’ve considered it.

Also helpful to hear that the links to the scans weren’t immediately obvious — I should probably make them a bit clearer. This may also not be obvious, but you can click the vol:page links in the left margin and go directly to the scan of whatever page you're reading.

Thanks again.


I know exactly what you mean — I had the same experience with CD-ROM encyclopedias. There’s something about just browsing and falling into articles that’s hard to replicate.

Part of the motivation here was to bring that kind of exploration back, but with the original 1911 text and structure.


Do you happen to use a language model to translate or format your comments?

Just me. I spent a lot of time thinking about this, so I like talking about it.

Try Jenghiz Khan. That's how they used to spell it then. Or just plain Khan and scroll the results.


Interesting! Thanks

That’s exactly the use case I had in mind. The 11th is full of gems like that, but they’ve never been easy to point people to.

That’s high praise. Those are both great projects and this one is definitely in the same spirit.

Good catch — thanks. That’s a font coverage issue. I’ll either swap in a fallback font for missing glyphs or normalize those cases. This only sounds trivial, this project is full of items like that.

I rebuilt the 1911 Encyclopædia Britannica into a clean, structured, navigable site:

https://britannica11.org/

What it does:

– ~37k articles reconstructed from the original volumes – section-level structure (contents are clickable within articles) – cross-references extracted and linked – contributors indexed and searchable – original volume + page references preserved and shown while reading – links to the original scans for each page – ancillary material included (prefaces, abbreviations, etc.) – topic index reproduced and cross-linked – full-text search with article metadata (length, volume, etc.)

Most of the work was in parsing and reconstruction: headings, multi-page articles, tables, math, languages, footnotes, plates, and all the small edge cases that come up in a work like this.

The goal was to make something that feels like the original, but is actually usable.

I’d especially appreciate feedback on: – search quality – navigation (sections, cross-references) – anything that looks structurally off

Happy to answer questions about the pipeline or data model


You might want to add The Reader's Guide to the Encyclopaedia Britannica, PD text available at https://www.gutenberg.org/ebooks/74039 and scans at https://archive.org/details/readersguidetoen00londuoft - It would fit naturally with the Ancillary material that includes the topic-based index.

It would indeed. I will see about working this in, it's highly pertinent.

The Reader's Guide has been added to the ancillary material. Thanks for the excellent suggestion.

Thanks for adding this! Do you plan to add back-links in the article pages (and perhaps in contributors pages) pointing to the chapters in the Reader's Guide that mention them, similar to what's done for the subject-based index?

Not a bad idea. I'll see what I can work out on that score. But I imagine the far more common path is from the Guide to the encyclopedia than the reverse.

Thanks so much for sharing this. It looks fantastic. A couple of questions, if you don't mind: what license are you releasing this under, if any? Is there any way to download it? The reason someone might want to download it is for use as training data.

Wikisource has the original scans available in the public domain, and their enriched text under CC-BY-SA: https://en.wikisource.org/wiki/EB1911

Thanks!

The underlying text (1911 edition) is public domain, but the structured version here — the parsing, reconstruction, and linking — is something I put together for this site. Right now there isn’t a bulk download available. I’m considering exposing structured access (API or dataset) in some form, but haven’t decided exactly how that will work yet.

If you have a specific use case in mind (especially for training), I’d be interested to hear more.


Regarding the specific use case, I was thinking this: I had Gemma 4 (a small but highly capable offline model released by Google) make a public domain cc0 encyclopedia of some core science and technology concepts[1]. I thought it was pretty good.

Separately, I've fine-tuned the Gemma 4 model[2], it was very quick (just 90 seconds), so I think it could be interesting to train it to talk like 1911 Encyclopedia Britannica.

I would use the entries as training data and train it to talk in the same style. There isn't a specific use case for why, I just think it would be interesting. For example, I could see how it writes about modern concepts in the style of 1911 Britannica.

[1] https://stateofutopia.com/encyclopedia/

[2] To talk like a pirate! https://www.youtube.com/live/WuCxWJhrkIM


That’s a fun idea — I can see the appeal of that style.

The underlying text is public domain, but the structured version here is something I put together for the site. I haven’t released a bulk dataset yet.

If you end up experimenting with it, I’d love to hear how it turns out — and I’m still figuring out what structured access might look like.


I've wanted to do something like this for The Encyclopédie, a hugely relevant text to the Enlightenment. If you ever get around to adding a rough "How I (generally) Made This" section, that'd be appreciated! Site looks great :)

Thanks for the kind words. I've had a few requests for a technical appendix (i.e., "how I built this") and it is in the works.

> Is there any way to download it? The reason someone might want to download it is for use as training data.

Another reason would be to able to keep running/using it even if the main site were to go down for whatever reason eventually; or, to operate a mirror of it, for redundancy (linking back to the original, of course).


There’s an escaping issue in tables of contents. See, e.g., “Roosevelt's” in the “United States” article. https://britannica11.org/article/27-0635-united-states-the/u...

This is now fixed, along with several more serious rendering errors in "United States". Thanks a lot for pointing it out.

Really nice. Well done.

As a feature request, would it possible for your pipeline to also create an EPUB? Then people can easily access and search through the document even when your site would go down. EPUB by default uses compression so the file size might even not be too bad for the full encyclopedia.


Very nice. I actually spent a bit of time browsing a few topics, which is something I rarely do these days!

A few things... when I click an article and try to jump to a new topic, the top search box (labeled "Search titles and full text...") doesn't work. Second, when I first came to the site, I was a bit stuck. It took a bit of time to realize I need to click on "Articles" or even "Topics" to start browsing. Not sure why, maybe I expected the image to let me enter the site somehow...?


legal terms question here also -- several major world economies are operating under very different rules regarding datasets and publication rights. I am in the USA / California.. will there be terms for me, given that I am not a giant deep-pockets FAANG, just a book person ? commercial use terms for "small business" scale ?

The 1911 text itself is public domain, so anyone is free to use it.

What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.

For casual or small-scale use there’s no issue at all. For bulk use (e.g. dataset / training / redistribution), I’d prefer people get in touch so I can figure out a sensible way to support that.


> What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.

If you live in the U.S. I recommend you read No Sweat of the Brow Copyright: https://www.gutenberg.org/help/no_sweat_copyright.html


It's been on Project Gutenburg for over 20 years: https://www.gutenberg.org/ebooks/13600

They only release books that are in the public domain.


> They only release books that are in the public domain.

Not necessarily. Project Gutenberg does provide some works still under US copyright, such as F. P. Walter’s 1999 translation of Twenty Thousand Leagues Under the Seas: https://gutenberg.org/ebooks/2488



I guess such an old edition is in the public domain

Nice job. How about wikipedia-style links to other articles for topics mentioned within another article?

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You