Comment by πŸš€ stack

Re: "pandoc and website preservation"

In: u/decant

wget and just view with a browser?

There is probably some way to wedge it into an EPUB. Although modern web sites contain so much crap that maybe copying the text by hand and pasting it elsewhere is better for small projects.

πŸš€ stack

2025-11-28 Β· 5 months ago

7 Later Comments ↓

🌲 Half_Elf_Monk · Nov 28 at 20:08:

I really like this idea, for a lot of reasons. Anti-censorship, data archiving/hoarding, local-backups, etc etc. I tried it with a wikipedia page, and got a blank epub with a "please set a user agen" flag. Overall, this is a cool idea that seems worth exploring.

🌲 Half_Elf_Monk · Nov 28 at 20:10:

Worth exploring ESPECIALLY if websites could be pandoc-ed out of javascript range. Imagine being able to browse a parallel web, even of recipe sites, wherein the 80mb of bloat had been screened into the 1mb of content that it really was, now intra-hyperlinked to link to the other versions of it. Scrapers + pandoc + smolweb might restore the old ways.

😺 k8quinn · Nov 28 at 20:50:

Pandoc has a lot of options including specifying CSS to be included in the output. In principle, you could specify the CSS file(s) from the old site via a link. You'd have to look at the source for the page to find out where the CSS is located. Alternatively, you could open the epub in an editor like Sigil-ebook or even Emacs and add original/own CSS as appropriate. Disclaimer: I haven't tried this myself.

πŸš€ stack Β· Nov 28 at 21:19:

Another option is to have an AI assistant look at the complicated HTML document and create a text-only, or even gemtext version that matches as much as possible...

😎 decant [OP] · Nov 29 at 01:50:

@Half_Elf_Monk For wikipedia, I use to tricks: 1. zim archive and python library, zimply, I don't know if zimply works with minimal text browser. 2. there is always gempedia, I just save the .gmi file for good articles. But as @stack said there are many irredeemable java heavy sites where I will just download with wget or just ctrl-c ctrl-v the text part.

🌲 Half_Elf_Monk · Dec 12 at 16:16:

@stack - That would work, but I simply don't trust the LLMs to get details right. @decant - interesting. Realistically, shouldn't there be a way to simply download a snapshot of wikipedea as a whole? As mostly-text, it shouldn't be that big. Why are there no local wikipedia browsers?

πŸš€ stack Β· Dec 12 at 19:31:

@Half_Elf_Monk -- there is a stripped down wikipedia distribution, people often put it on local devices feaured on Gizmodo and such... Can't remember where to get it, and not at a decent computer now, but I am sure you can easily find it.

Original Post

😎 decant

pandoc and website preservation β€” Back when I used firefox/chromium. I use their print function to save full web page to a pdf file. For example, Paul V Bolotoff wrote articles on the history of DEC alpha CPUs, but his website is long gone, the only copy of the article I could find is on the archive section of someone’s personal site. But I found out I could use pandoc accomplish this task: pandoc [http link] -o oldarticle.epub I find the epub family of formats better suit my needs. PDF is...

πŸ’¬ 10 comments Β· 2 likes Β· 2025-11-28 Β· 5 months ago