Comment by 🦂 zzo38

Re: "FOSS infrastructure is under attack by AI companies"

I can think of a few ideas. However, if you are consistent then they might change their scraping to work around it.

Confuse scrapers by using links to URLs that depend on the user-agent and IP address. If they keep changing these (which the article suggests they do), then the links will not work; instead the server can respond with an error message in plain text format, with instructions to manually find the correct URL. Perhaps limit this to requests for HTML files only.

Requiring JavaScripts, CSS, pictures, sufficiently fast computers, etc, is not a good idea in my opinion, unless perhaps it has a <noscript> block to explain how to access using another protocol or to manually correct the issue. (You might also limit this to requests for HTML files too, so that downloading files using curl, wget, etc will still work.)

You might also check if the User-Agent specifies a browser that is known to implement pictures, CSS, JavaScripts, cookies, etc, and if it claims to but doesn't, return an error message without any links, explaining what is wrong, suggesting that the user can change the user-agent setting if they deliberately disabled these features, in order that the request will be accepted.

🦂 zzo38

2025-03-20 · 1 year ago

4 Later Comments ↓

🐦 wasolili [...] · 2025-03-21 at 01:37:

@HanzBrix if I read the article correctly, it's that the webscrapers used by some of the more egregious LLM companies are proxying through residential proxies (which i assume are offered by a botnet operator, given the nature of the residential proxy business)

though upon rereading I realize you were probably mentioning blocking cloud providers as a response to comments about blocking the crawlers of gemspace, not about the crawlers mentioned in the article, in which case, ignore me :)

❤️ smps · 2025-03-22 at 18:27:

Bumped into this today. Maybe useful? gemini://alexschroeder.ch/2025-03-21-defence-summary

👻 ps · 2025-03-22 at 18:34:

That is irony,

sr.ht have js-less interface and their can't simply integrate Anubis PoW solution.

🦂 zzo38 · 2025-03-24 at 00:58:

I set up port knocking on the HTTP server. (Maybe I should specify which port number to use, in the gopher and/or scorpion server, which do not themself require port knocking. Note that if you use the wrong port number, you will be locked out (in some cases), in order to prevent being accessed by port scanning.)

🛰️ lufte

FOSS infrastructure is under attack by AI companies — Is the era of anonymous browsing coming to an end?

💬 14 comments · 1 like · 2025-03-20 · 1 year ago