Standalone hypertext documents

Posted on 2024-09-27 by Jens Pitkänen

Let’s go back to 1991 and assume the world wide web is a collection of documents, a vast world-wide library, where documents can link to each other, and form a grand web of information. A vast library of hypertext, as the name of the protocol would suggest. What a nice vision.

Now, a bit closer to our current reality, note that the web is not just text. CSS and JS were slapped onto HTML to make it more widely useful, and to allow more self-expression and creativity. Both can be included in HTML, but are often stored in separate resources that get loaded after the initial HTML, or pushed alongside the initial HTML if the server is modern and cool and supports the newest protocols.

But note: the moment a page requires other resources to look how it’s supposed to look, it’s no longer a singular document! This is evident if you try to right-click and “Save page as…” on pretty much any web page — it will download the HTML to your computer, and try its best to also download every resource it refers to, maybe into a directory next to the HTML file. Now, if you wanted to share this page like a document, with someone over a different protocol, or archive it for yourself, you might zip it up with the resources to keep them together. What a mess. My own article archive directory does not look nice, largely due to this.

Consider the otherwise very annoying format, PDF. Download a PDF file, just the one, open it, and it probably looks as it was intended! Of course, they’re otherwise very annoying, defining the layout so strictly that it’s incredibly hard to read most PDFs on any screen smaller than 13 inches diagonally. But focusing on how the PDF has everything needed to display the contents as intended, I think it does a good job! I think many pages on the world wide web — at least pages that still pretend to be “documents” — should have this feature as well.

Standalone/inlined/resources-embedded HTML

HTML does allow for self-contained documents! There’s <style> tags, and there’s <script> tags that don’t include the src attribute. It’s just that, for probably good caching related purposes, we tend to link to resources rather than including them straight in the HTML. Well, a few bytes of CSS or JS probably won’t hurt your internet bill too much, given all the JS being hauled on Modern Websites™ in any case. Not to mention videos or livestreams. I think any non-moving media will be just fine, bandwidth-wise. And you can cache the resulting HTML, we’re talking about documents here, not dynamic websites that show different content on every load.

Alright, we’ve included our CSS and JS resources in the HTML, what about fonts and images? Data URLs! Yes, really. I don’t really like linking to Google Fonts anyway, so anyone visiting my pages would be pulling in the font files in any case — not much of a difference whether they come as their own files, or with the main HTML file. Same applies for the images, but even more, since those wouldn’t be cached even in the Google Fonts case. Of course, it’s unfortunate that you’ll have to increase the size of the resources to 133% of the original due to base64 encoding, but maybe we can save space in other ways…

Fonts

There are ways to pick out just the glyphs used on a particular page! That way, even though you’ll be serving the fonts on each page, it’ll be a small part of the entire font. And you never need to serve the entire font files! At least on my pages, the biggest files I serve, by a good margin, are fonts. Fonts are big for an understandable reason, they need to encode drawing instructions for a large amount of different looking glyphs — that’s a lot of information! However, any one particular page probably doesn’t contain that many different characters, at least in my experience writing English (which admittedly might be skewing the numbers a bit).

Additionally, thanks to the web being a wild west of odd HTML, browsers have to accept all kinds unorthodox patterns: you can actually include stylesheets at the bottom of your <body>! This is actually the main reason I ended up attempting this whole operation — the main content of my pages, the text, does still load on very slow connections without much issue: the fonts are at the end of the document. This way, you get the important part very fast, can start reading it, and leave the fonts loading for however long it takes.

fonttools

The easiest way I found to make subsets of fonts is the fonttools python library. It’s also available as a command line tool — kind of evident from the API — but I had already started writing a python script for the inlining logic anyway, for parsing the HTML and going through the relevant files. Here’s some code to showcase the basic TTF font subsetting API, and how I use it:

import fontTools.subset
from fontTools.ttLib import TTFont
import html
import re
import tempfile
import unicodedata

def compress_font(resource_bytes, charset):
    with tempfile.NamedTemporaryFile() as tmp:
        tmp.write(resource_bytes)
        with tempfile.NamedTemporaryFile(suffix=".woff2") as tmp_subsetted:
            fontTools.subset.main([tmp.name, f'--unicodes={charset}', f'--output-file={tmp_subsetted.name}', '--no-ignore-missing-unicodes', '--with-zopfli'])
            return tmp_subsetted.read()

charset = set()
char_source = some_html + maybe_some_css_if_you_have_content_text_in_there
# Add html entities as the char they represent
for entity in re.finditer(r'&[A-Za-z]*;', char_source):
    charset.add("U+%04x" % ord(html.unescape(entity[0])))
# Add all non-control chars to the charset
for c in char_source:
    if unicodedata.category(c).startswith("C"):
        continue # control characters
    charset.add("U+%04x" % ord(c))
charset = ','.join(sorted(charset))

my_ttf_font_bytes = ...
subsetted_font_bytes = compress_font(my_ttf_font_bytes, charset)

Note: I’ve spliced this together from my longer resource-inlining script, so this is more like pseudocode, to use as a reference. No license. (Or maybe any license?) Use however you like.

Ideally, there’d be an option in the popular static site generators that you could turn on to have the generator do this for you — I don’t know, maybe there is! But hey, if you also have your own generator, there you go.

Images

Images are a bit more problematic than fonts, mostly due to how they’re included in HTML. For images, the source (= in this case, the entire image file) has to be in the <img> tag. This means that on a very slow connection, you can only read up to the first image, then wait for it to download in its entirety, and only then get the rest of the text — until the next image, where it happens again. Ideally you’d place the image where you want it, and provide the source later, at the end of the file. But alas, I didn’t come up with a way to do that with HTML.

I had a limit of 10000 bytes for image inlining for a while, but ended up removing that limit. Even with all the images included, the biggest page on this blog — my entire undergraduate thesis, about 20 pages of text and diagrams in PDF form — is only a bit over one megabyte. That’s not too bad! On a slow connection, it might take a while to load, but there’s certainly enough text that you’d have something to read and ponder while waiting for the images to arrive. On a really slow connection, it might be annoying, but you could just always download the page and let it download while doing other things. I assume that’s how you all dealt with files on the internet in 1995. In any case, using the DSL simulator in Firefox, which I think is an appropriate “worst thing to expect in a situation where the user isn’t browsing on their phone in the woods”, the thesis loads fast enough that the images are already loaded when you scroll to them, even if you’re only skimming.

Results

Well, my HTML files sure used to be smaller. However, instead of serving a megabyte of fonts to every visitor, and half a megabyte of images, I’m just serving a single, one megabyte HTML file — in the worst case. My blog’s front page is 234kB. Most of the posts are very similarly sized, due to most of the bytes going to images, videos, and fonts, and many of my posts do not have images or videos. The Quake post is almost 600kB due to the video, and as previously stated, the worst case is my undergrad thesis, at 1.14MB.

My sites’ PageSpeed Insights score improved! Not that it matters too much, but they do provide fun little numbers to try and optimize. Since my HTTP server is not configured to use HTTP 2 and push the relevant resources, previously browsers had to get the HTML, see that there’s CSS to download, then download that, then find out that there’s fonts to download, and finally start downloading all the resources required. Very bad for latency, especially given that the server is a Raspberry Pi 2B, and even if it’s trying its best, every request does have a bit of unavoidable lag. Now, browsers can simply request /index.html and they get a stream of bytes that will eventually result in the entire page loading. I like this on just the theory level as well, in that all the required downloads (in this case, just the 1 file) can be started as soon as the browser has done the TLS handshake and sent the initial GET. Fewer round trips! And no server-side special configuration needed.

But most importantly, the pages are standalone documents now, downloadable as HTML files, viewable as is! No extra files! You don’t even need system fonts, since all the fonts required to display the page are included. Maybe I’ll need to post some 4K video some day, but even in that case, I think it’ll be better to include a low-res version or just a thumbnail inline, and link to the high-res version.

Update, 2024-11-15: As I’m writing this addition, my internet is down! A novel experience in the well-connected city of Helsinki, Finland. Thanks to RSS, while I’m off the internet, I can still read articles fetched before — great! However, many RSS feeds I follow are somewhat hard to read, as they use images, and those images link to resources on the internet. Moral of the story: inline (even low resolution versions of) your images so that your RSS feed is readable without fetching additional resources. Though alternatively, I suppose RSS readers could prefetch all the resources needed.