Comment by 🌲 Half_Elf_Monk

Re: "pandoc and website preservation"

In: u/decant

I really like this idea, for a lot of reasons. Anti-censorship, data archiving/hoarding, local-backups, etc etc. I tried it with a wikipedia page, and got a blank epub with a "please set a user agen" flag. Overall, this is a cool idea that seems worth exploring.

🌲 Half_Elf_Monk

2025-11-28 Β· 5 months ago

6 Later Comments ↓

🌲 Half_Elf_Monk [✝️] · Nov 28 at 20:10:

Worth exploring ESPECIALLY if websites could be pandoc-ed out of javascript range. Imagine being able to browse a parallel web, even of recipe sites, wherein the 80mb of bloat had been screened into the 1mb of content that it really was, now intra-hyperlinked to link to the other versions of it. Scrapers + pandoc + smolweb might restore the old ways.

😺 k8quinn · Nov 28 at 20:50:

Pandoc has a lot of options including specifying CSS to be included in the output. In principle, you could specify the CSS file(s) from the old site via a link. You'd have to look at the source for the page to find out where the CSS is located. Alternatively, you could open the epub in an editor like Sigil-ebook or even Emacs and add original/own CSS as appropriate. Disclaimer: I haven't tried this myself.

πŸš€ stack Β· Nov 28 at 21:19:

Another option is to have an AI assistant look at the complicated HTML document and create a text-only, or even gemtext version that matches as much as possible...

😎 decant [OP] · Nov 29 at 01:50:

@Half_Elf_Monk For wikipedia, I use to tricks: 1. zim archive and python library, zimply, I don't know if zimply works with minimal text browser. 2. there is always gempedia, I just save the .gmi file for good articles. But as @stack said there are many irredeemable java heavy sites where I will just download with wget or just ctrl-c ctrl-v the text part.

🌲 Half_Elf_Monk [✝️] · Dec 12 at 16:16:

@stack - That would work, but I simply don't trust the LLMs to get details right. @decant - interesting. Realistically, shouldn't there be a way to simply download a snapshot of wikipedea as a whole? As mostly-text, it shouldn't be that big. Why are there no local wikipedia browsers?

πŸš€ stack Β· Dec 12 at 19:31:

@Half_Elf_Monk -- there is a stripped down wikipedia distribution, people often put it on local devices feaured on Gizmodo and such... Can't remember where to get it, and not at a decent computer now, but I am sure you can easily find it.

Original Post

😎 decant

pandoc and website preservation β€” Back when I used firefox/chromium. I use their print function to save full web page to a pdf file. For example, Paul V Bolotoff wrote articles on the history of DEC alpha CPUs, but his website is long gone, the only copy of the article I could find is on the archive section of someone’s personal site. But I found out I could use pandoc accomplish this task: pandoc [http link] -o oldarticle.epub I find the epub family of formats better suit my needs. PDF is...

πŸ’¬ 10 comments Β· 2 likes Β· 2025-11-28 Β· 5 months ago