The bot situation on the internet is actually worse than you could imagine. Here's why.

54 points by kfwyre a day ago on tildes | 21 comments

[OP] kfwyre | a day ago

As you may know, on Glade Art we tend to take anti-bot measures very seriously; it is one of our topmost priorities to protect our fellow users from having their art trained on. We also tend to engage in trolling bots by using endless labyrinths of useless data to trap them in. These are commonly referred to as "honeypots" or "digital tar pits." And so, after 6.8 million requests in the last 55 days at the time of writing this, we have some substantial data, so standby and let us share it with you. : )

delphi | 21 hours ago

Maybe a hot take, but I really couldn't care less. My own websites and apps (like delphi.tools) all compile to static HTML and have no server component, so this isn't really impacting my resources, and all the stuff that's on there is stuff I willingly put on the internet, so it would feel weird to me if I said "oh well bots don't get to read my blog!", as if I didn't know this could happen when I uploaded to the damn web. I get that if you're running a web app you have to protect yourself from bots putting a strain on your infra, but this problem is trivial to undetectable for anyone who just puts their stuff on the web and doesn't bloat their page with analytics and tracking that can phone home.

Exellin | 20 hours ago

How do you feel about the fact that these bots ignore robots.txt? Even if it's "weird" if someone says that these pages are off-limits since it's on the internet, the fact that so many people set up their bots to ignore website owners requests is the unethical breaking point for me.

delphi | 16 hours ago

Not to be a joyless cynic, but I always thought it was deeply naive to expect potential bad actors to respect the honour system. I can't say I'm surprised. It's not good form, certainly, but I'm not going to say that it's unethical. If ethics was a consideration, the standard would have made provisions for direct enforcement of these rules instead of asking nicely.

GoatOnPony | 14 hours ago

Counterexample as someone who is starting to put content on the internet, I do care regardless of the resource impact. I'm not running analytics or using any non-static resources (at least not currently) but I want people to interact with what I write and produce, not bots. Call it vanity perhaps, but I'm not putting things on the internet out of pure altruism - I want some amount of validation and credit and feedback. Most bots today don't provide that and more often provide the opposite in that they disintermediate between my work and potential audience. If the return (monetary or via ego boost) on putting things on the internet goes negative then people (myself included) will find alternative distribution channels, likely ones less free, widespread, or available, which would be sad. So even if bots aren't directly costing me money they are still an element of a web shifting towards more intermediaries which I'd like to avoid.

delphi | 13 hours ago

Clarity: How exactly is excluding bots helping in your goal of getting more eyes on your work? Seems like that wouldn't make a difference.

GoatOnPony | 10 hours ago

Caveat up front that I don't really have any reliable data to back up the following statements and I could easily turn out to be incorrect about the direction of the internet. Prognosticating is errorprone!

Some bots are fine, search crawlers, rss/atom feed readers, etc in theory are net directors of traffic or at least wouldn't likely detract from traffic. The bots at issue in the current internet though have a different purpose, they're LLM training data scrapers, RAG query answer bots, and other ingestors of data who have no (or negative) interest in sending traffic to my website. Their aim is to provide an alternative within which users have no need to leave, they are building a generic competitor to all other websites. A competitor which is well funded and desires to take your traffic. Trying to make their lives difficult is a very small, probably ineffective, but maybe collectively useful way to delay them taking the content and giving people who come to my website directly a benefit. I view it as attempting to prevent them from keeping everyone in their walled gardens while the rest of us can only feed their machine.

Having said all that, I don't think excluding bots is a particularly effective approach - instead I'd rather try to find audiences who actually want human content instead.

raze2012 | 5 hours ago

My own websites and apps (like delphi.tools) all compile to static HTML and have no server component,

making it publicly available on the internet means someone's spinning a server. Even if you don't care about the scaping, you will care when your website slows to a crawl because the bots are basically DDOS'ing your site, right?

The caveat is that if you stay small, this may not happen. But any amount of visibility (even just some facebook post linking to your site for some reason) can change that.

this problem is trivial to undetectable for anyone who just puts their stuff on the web and doesn't bloat their page with analytics and tracking that can phone home.

It's tough because the topic here involves artists. Artists need to simultaneously show off their work to get more work. But they can't show of so much now that other malicious actors essentially steal their work.

I don't think most artists can spin up their own website. Even if they could, doing that goes against the visibility needed to advertise themselves.

It's not good form, certainly, but I'm not going to say that it's unethical

Ethics is often disconnected from law. I wouldn't say "it's not unethical because it's legal".

Eji1700 | a day ago

Yeah scrapers are a wonderful example of all sorts of holes in our current infrastructure, standards, and legislation.

"Please don't scrape my site" is basically all the teeth you've ever really had unless you're behind something like cloudflare. The way traffic is handled makes it very hard to identify bad actors, and the laws mean that these AI companies can just pay Asian nations for these vast swaths of data that are totally definitely legally acquired.

Its going to be a very tricky thing to handle because while it COULD be done right, it could also be used as yet another excuse to jam horrible practices into place for future control.

Bwerf | a day ago

Reading the article, proof-of-work seems to be a tool tooth, that you have. I don't see why smaller actors can't use that.

Eji1700 | a day ago

It is, but I'm wary of the overhead (especially if you're already trying to do minimal or no JS), the long term effectiveness (i suspect part of it's success is due to it's lack of deployment), and possible knock on effects (not sure how much it's going to change things if EVERY site is jamming PoW)

Edit:

Coincidentally I accidentally clicked the link again when going to see the topic on hacker news and it gave me a difficulty 8 pow. After about a minute my phone still hadn’t made any progress.

Edit 2 -

Oh and now that I’m actually in the comments section seems like I’m far from the only one with this experience.

hobbes64 | 19 hours ago

Speaking of "minimal or no JS", the blog entry requires JS to read. It's mentioned in the text that this is part of the bot mitigation. I can see that to protect the art from being scraped, but maybe the blog part shouldn't be protected in that way because I may not want to disable NoScript plugin to read a blog entry.

And by the way once I temporarily enabled script on the site I noticed that I very much don't like the site aesthetics of green on black. It's not accessible. Also the image at the top of the blog entry had artifacts or something so it shimmered in an annoying way. I realize this is off topic noise but it really made me just want to leave the site.

sparksbet | 14 hours ago

I'm sure it's possible to use JS like they do and still have the site be accessible to those using screen readers... but if you made me put down money I'd definitely bet against this site being accessible on that front, too.

post_below | 23 hours ago

"Please don't scrape my site" is basically all the teeth you've ever really had unless you're behind something like cloudflare

There's so much more that you can do beyond saying please. Automated traffic and websites/apps have been in an arms race since before the dawn of the commercial internet. The balance has never really changed. Sometimes automated traffic gets ahead a little, more often identification and blocking is a little ahead. If you don't want to be scraped, and you have the time and expertise (or resources to rent time and expertise), you can block the vast majority of it. If you just want to block the larger part of it you don't need any of the above, just an out of the box solution (like Cloudflare).

Cloudflare collected a lot of techniques that people were already using and made them easily accessible but they aren't the only, or even the best, way to deal with bots.

At the end of the day bot net operators aren't usually particularly bright or technically proficient. They're after volume not quality. People who are skilled can make more money elsewhere (with the possible exception of state sponsored operations).

The volume of bots though, that is definitely going up, and quickly.

skybrian | a day ago

I wonder if it's done by botnets or if people in Asia are being paid to run these things at home?

Eji1700 | a day ago

The answer is likely "yes".

There's TONS of bot farms all over the world because they're somewhat easy to run. You need the capital to get the devices you need/want and all the required infrastructure, but then it just...does the thing with some minimal maintenance by a local (who may or may not even be good at troubleshooting and just given basic instructions or having it handled remote.)

As for the botnets, well naturally? The whole point of a botnet is to do something that requires a lot of devices, and a good way to get a lot of devices is to just run in the background of someone else's system (obviously, seldom legally). Scrapping is probably a hell of a lot more lucrative and less risk than a DDOS or whatever, and as with any resource like this you're probably looking to prevent idle/downtime. You've got the "machine" so you want it running as much as possible to generate revenue to cover costs.

raze2012 | 5 hours ago

https://finance.yahoo.com/news/click-farms-internet-china-154440209.html

The answer is indeed "yes". They had these click farms over a decade ago ready simply to drive traffic to an app or site. I can only imagine the dark arts involved now with AI scaping being so in-demand.

chocobean | 23 hours ago

Does Tildes use similar tarpits? I'm a logged in user so I don't see any proof of work requests; are there any such for regular browsing humans? How scrapped is the data on Tildes?

davek804 | 21 hours ago

On the scraping front, you can just put your username into a search engine + site:tildes.net. Your profile should almost certainly be found, along with some body description of a reasonably recent post.

Yours:
Comment on The feckless opposition in ~society chocobean 10 hours, 10 minutes ago Link Parent

Mine:
It was refreshing to search online tonight after reading this historical piece about the science behind ma -made diamonds and see that it has genuinely changed finally.

Skybrian:
"You cannot make a return on investment if you don't have access to the U.S. market," Bancel told Bloomberg, noting that high-level headwinds have made the


I didn't look at how recent any of these three actually are. But they're certain examples of pages that are public and being scraped, that's for sure.

raze2012 | 5 hours ago

for me, most were from some 4-5 months ago on the top results. But one was from 5 days ago from the "Sora is cancelled" topic. This is just a gut feelinig, but I'm guessing it doesn't scrape from Tildes until a story gets a certain amount of comments or votes on it.

Protected | 11 hours ago

The Internet started out as a distributed, democratized, cooperative endeavor. This lack of centralization provided a level playing field whose participants could all have access to exposure and connection for a negligible amount of investment. Conversely, any information or service could be at your fingertips if you could find it. The benefits to our species and to our global society were massive (or so I would argue).

By making it so that only huge participants can afford to be connected - most small players must hide behind, and therefore be reliant on, the larger ones, as recommended in this very article - the Internet's out-of-control bot problem directly erodes that original premise. It's an aggressive and harmful attack against global society, democracy, and prosperity. It's not about copyright, the unauthorized use of one's content, but about the network effect of it all, regardless of whether a few people's technological needs are such that they can manage to coexist with this horrible dystopian situation.

Thus I would argue that logic dictates that abusive bot use should be persecuted and heavily sanctioned by the law. I'm talking serious jail time.