Comment by 🐦 wasolili

Re: "FOSS infrastructure is under attack by AI companies"

my favorite part of the article is the guy complaining about the proof-of-work screen having an anime character on it because his girlfriend would be mad at him if she saw it on his computer.

mostly residential IP addresses

Probably paying a botnet operator for the privilege. Satire idea: an article purporting to be from a credit card fraudster complaining that LLM crawlers have driven the cost of residential proxies up.

I read a pretty simple suggestion fix for this on the web1.1, just blanket block all subnets assigned to the major cloud providers

@HanzBrix article says that in many cases, the flood is coming from residential IPs in unrelated subnets and that each IP makes only one request.

🐦 wasolili [flaired user]

2025-03-20 · 1 year ago

5 Later Comments ↓

🦂 zzo38 · 2025-03-20 at 23:54:

I can think of a few ideas. However, if you are consistent then they might change their scraping to work around it.

Confuse scrapers by using links to URLs that depend on the user-agent and IP address. If they keep changing these (which the article suggests they do), then the links will not work; instead the server can respond with an error message in plain text format, with instructions to manually find the correct URL. Perhaps limit this to requests for HTML files only.

Requiring JavaScripts, CSS, pictures, sufficiently fast computers, etc, is not a good idea in my opinion, unless perhaps it has a <noscript> block to explain how to access using another protocol or to manually correct the issue. (You might also limit this to requests for HTML files too, so that downloading files using curl, wget, etc will still work.)

You might also check if the User-Agent specifies a browser that is known to implement pictures, CSS, JavaScripts, cookies, etc, and if it claims to but doesn't, return an error message without any links, explaining what is wrong, suggesting that the user can change the user-agent setting if they deliberately disabled these features, in order that the request will be accepted.

🐦 wasolili [...] · 2025-03-21 at 01:37:

@HanzBrix if I read the article correctly, it's that the webscrapers used by some of the more egregious LLM companies are proxying through residential proxies (which i assume are offered by a botnet operator, given the nature of the residential proxy business)

though upon rereading I realize you were probably mentioning blocking cloud providers as a response to comments about blocking the crawlers of gemspace, not about the crawlers mentioned in the article, in which case, ignore me :)

❤️ smps · 2025-03-22 at 18:27:

Bumped into this today. Maybe useful? gemini://alexschroeder.ch/2025-03-21-defence-summary

👻 ps · 2025-03-22 at 18:34:

That is irony,

sr.ht have js-less interface and their can't simply integrate Anubis PoW solution.

🦂 zzo38 · 2025-03-24 at 00:58:

I set up port knocking on the HTTP server. (Maybe I should specify which port number to use, in the gopher and/or scorpion server, which do not themself require port knocking. Note that if you use the wrong port number, you will be locked out (in some cases), in order to prevent being accessed by port scanning.)

🛰️ lufte

FOSS infrastructure is under attack by AI companies — Is the era of anonymous browsing coming to an end?

💬 14 comments · 1 like · 2025-03-20 · 1 year ago