pandoc and website preservation
Back when I used firefox/chromium. I use their print function to save full web page to a pdf file. For example, Paul V Bolotoff wrote articles on the history of DEC alpha CPUs, but his website is long gone, the only copy of the article I could find is on the archive section of someone’s personal site. But I found out I could use pandoc accomplish this task:
pandoc http://oldsite.com/oldarticle.html -o oldarticle.epub
I find the epub family of formats better suit my needs. PDF is heavy, text in PDF are fixed and PDF file can’t be easily converted into other format.
This works fine for simple sites, for more recent sites it produce a document with overlapping image all over the place. A work around is to convert the file (with pandoc) to rtf format and convert it back to epub.
Is there a more elegant solution? Different flags for pandoc or other software package?
2025-11-28 · 5 months ago · 👍 Half_Elf_Monk, curry
10 Comments ↓
EPUB is a ZIP archive file containing HTML files, so it can use the same things that are possible with HTML. I don't know what pandoc does, but I would expect it should be possible to copy the HTML files, pictures, CSS, and fonts; if it uses JavaScripts, or external references to pictures/CSS/fonts, then it might be more difficult; you might need to execute the scripts to modify the HTML and then use the modified HTML as the output. I don't know what program handles all of that automatically.
😎 decant [OP] · Nov 28 at 07:03:
zzo38, pandoc does't do javascript at all. So it is good for old or simple sites only.
wget and just view with a browser?
There is probably some way to wedge it into an EPUB. Although modern web sites contain so much crap that maybe copying the text by hand and pasting it elsewhere is better for small projects.
🌲 Half_Elf_Monk · Nov 28 at 20:08:
I really like this idea, for a lot of reasons. Anti-censorship, data archiving/hoarding, local-backups, etc etc. I tried it with a wikipedia page, and got a blank epub with a "please set a user agen" flag. Overall, this is a cool idea that seems worth exploring.
🌲 Half_Elf_Monk · Nov 28 at 20:10:
Worth exploring ESPECIALLY if websites could be pandoc-ed out of javascript range. Imagine being able to browse a parallel web, even of recipe sites, wherein the 80mb of bloat had been screened into the 1mb of content that it really was, now intra-hyperlinked to link to the other versions of it. Scrapers + pandoc + smolweb might restore the old ways.
Pandoc has a lot of options including specifying CSS to be included in the output. In principle, you could specify the CSS file(s) from the old site via a link. You'd have to look at the source for the page to find out where the CSS is located. Alternatively, you could open the epub in an editor like Sigil-ebook or even Emacs and add original/own CSS as appropriate. Disclaimer: I haven't tried this myself.
Another option is to have an AI assistant look at the complicated HTML document and create a text-only, or even gemtext version that matches as much as possible...
😎 decant [OP] · Nov 29 at 01:50:
@Half_Elf_Monk For wikipedia, I use to tricks: 1. zim archive and python library, zimply, I don't know if zimply works with minimal text browser. 2. there is always gempedia, I just save the .gmi file for good articles. But as @stack said there are many irredeemable java heavy sites where I will just download with wget or just ctrl-c ctrl-v the text part.
🌲 Half_Elf_Monk · Dec 12 at 16:16:
@stack - That would work, but I simply don't trust the LLMs to get details right. @decant - interesting. Realistically, shouldn't there be a way to simply download a snapshot of wikipedea as a whole? As mostly-text, it shouldn't be that big. Why are there no local wikipedia browsers?
@Half_Elf_Monk -- there is a stripped down wikipedia distribution, people often put it on local devices feaured on Gizmodo and such... Can't remember where to get it, and not at a decent computer now, but I am sure you can easily find it.