Comment by π² Half_Elf_Monk
Re: "pandoc and website preservation"
I really like this idea, for a lot of reasons. Anti-censorship, data archiving/hoarding, local-backups, etc etc. I tried it with a wikipedia page, and got a blank epub with a "please set a user agen" flag. Overall, this is a cool idea that seems worth exploring.
2025-11-28 Β· 5 months ago
6 Later Comments β
π² Half_Elf_Monk [βοΈ] Β· Nov 28 at 20:10:
Worth exploring ESPECIALLY if websites could be pandoc-ed out of javascript range. Imagine being able to browse a parallel web, even of recipe sites, wherein the 80mb of bloat had been screened into the 1mb of content that it really was, now intra-hyperlinked to link to the other versions of it. Scrapers + pandoc + smolweb might restore the old ways.
πΊ k8quinn Β· Nov 28 at 20:50:
Pandoc has a lot of options including specifying CSS to be included in the output. In principle, you could specify the CSS file(s) from the old site via a link. You'd have to look at the source for the page to find out where the CSS is located. Alternatively, you could open the epub in an editor like Sigil-ebook or even Emacs and add original/own CSS as appropriate. Disclaimer: I haven't tried this myself.
π stack Β· Nov 28 at 21:19:
Another option is to have an AI assistant look at the complicated HTML document and create a text-only, or even gemtext version that matches as much as possible...
π decant [OP] Β· Nov 29 at 01:50:
@Half_Elf_Monk For wikipedia, I use to tricks: 1. zim archive and python library, zimply, I don't know if zimply works with minimal text browser. 2. there is always gempedia, I just save the .gmi file for good articles. But as @stack said there are many irredeemable java heavy sites where I will just download with wget or just ctrl-c ctrl-v the text part.
π² Half_Elf_Monk [βοΈ] Β· Dec 12 at 16:16:
@stack - That would work, but I simply don't trust the LLMs to get details right. @decant - interesting. Realistically, shouldn't there be a way to simply download a snapshot of wikipedea as a whole? As mostly-text, it shouldn't be that big. Why are there no local wikipedia browsers?
π stack Β· Dec 12 at 19:31:
@Half_Elf_Monk -- there is a stripped down wikipedia distribution, people often put it on local devices feaured on Gizmodo and such... Can't remember where to get it, and not at a decent computer now, but I am sure you can easily find it.
Original Post
pandoc and website preservation β Back when I used firefox/chromium. I use their print function to save full web page to a pdf file. For example, Paul V Bolotoff wrote articles on the history of DEC alpha CPUs, but his website is long gone, the only copy of the article I could find is on the archive section of someoneβs personal site. But I found out I could use pandoc accomplish this task: pandoc [http link] -o oldarticle.epub I find the epub family of formats better suit my needs. PDF is...