I hope you leave it on the WAF. If they're only just deciding to respect robots.txt, which has been internet infrastructure forever, then it's probably still incredibly amateur software with 'Amazon-priorities' rather than 'responsible internet traffic' priorities.
Robots.txt is lame BTW, there is no way to enforce it. It is up to the bot to decide to crawl or not and most cases they don't care.
Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.
Yes, we know, its purpose is to guide the bots, not forcibly block them.
That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.
robots.txt is a great herald example of people misunderstanding and misusing a tool. The file was designed to help crawlers, by pointing them to the most valuable to index content and help them avoid wasting resources on useless pages.
The people trying to use it to block or limit bots are uninformed and/or misinformed.
Robots.txt is great if you're trying to run an above board operation. Much easier than trying to guess how a webmaster wishes the crawler to behave, and then getting angry emails when you guess wrong.
It's not great. It used to be very common that robots.txt would Disallow *, Allow GoogleBot which just entrenches the search engine monopoly. In response to this other search engines just used the rules for GoogleBot instead of the rules for their own crawlers.
Eh, not really my experience running an internet search engine and a crawler. It happens occasionally, but mostly people seem to focus on what they perceive as nuisance crawlers if they do disallow any specific UAs.
Huh, I get a lot of traffic from Amazonbot (relative to humans) and try as I might, it would get stuck in a tarpit of no creation because it would sit there and keep blasting every variation of my recent pages because Mediawiki lists many links. I have them appropriately nofollow and warning the bot not to waste its time with robots.txt but it just goes and sticks itself on nonsense internal pages.
The traffic isn't a problem. I've got Cloudflare in front and the machine itself is relatively overpowered, and downtime isn't critical. But I'd just like the thing to be able to spider me properly. Someone did point out to me that maybe I wasn't receiving actual Amazonbot but some other spider: https://news.ycombinator.com/item?id=46352723
Amazonbot is specifically the user agent they use for crawling for "provide more accurate information to customers" (whatever that means, could be anything it sounds like) and also when they scrape for data used in AI training, according to https://developer.amazon.com/amazonbot
> Amazonbot is used to improve our products and services. This helps us provide more accurate information to customers and may be used to train Amazon AI models.
I was wondering about this. And it makes me think this is all mistruth, unless they plan to drop this pricing tactic.
They've been getting some heat on it lately, but I find it hard to believe they're going to give up entirely? And if so, what's to stop someone from just flouting their rules on pricing, and then doing the robots.txt thing to prevent issues?
Good place to ask, saw a new AWS User agent in logs today: Amazon-Quick-on-Behalf-of-$HEXID
I found a mention on some user agent trackers but no official documentation. Anyone knows if it’s documented? Asking because I am seeing decent traffic (30GB/week) from this.
> Crawling behavior [...] Crawler identification: Identifies itself with user-agent string "aws-quick-on-behalf-of-<UUID>" in request headers.
Maybe people found a way of using it as a loophole for something or Amazon Quick is just picking up in usage, and your website is popular amongst whoever uses that sort of stuff.
I just put Anubis in front of my self-hosted forge this morning because AmazonBot had helped itself to 750 GiB (!) of traffic to my public repos this month!
> We are writing to inform you that starting Monday, June 15, 2026, crawl preferences for Amazonbot will be managed solely through the industry-standard directives.
I just do this for the IP ranges of Amazon, OpenAI, Huawei and other companies that run these insane crawlers: it's 100% effective and it doesn't annoy real users with a captcha or some PoW thing. There's simply no reason for them to reach my homeserver other than to scrape the hell out of it.
Yup, mostly. There are more ranges for the Amazon store too.
It would be rather nifty if Amazon and other companies would confine AI to specific CIDR or a dedicated ASN but I would not hold my breath on that one. AI crawlers will likely muddy the waters for everyone else.
It's good that you mentioned this; smear campaigns are definitely not a new thing, and I suspect a lot of this DDoS'ing that's going on is a plot to accelerate towards Big Tech's authoritarian dystopia. Basically extortion.
Is it just me, or is it extra unethical and self-serving when crawlers from say Amazon(Bot) decides to incessantly crawl AWS hosted websites? Same goes for Google and Microsoft crawlers crawling GC and Azure.
By that, I mean the types of crawls that can hog up significant usage.
if you run Meta Ads, it's notorious for ddosing your website with bots. Basically, their ad manager sends dozens of click for each variant of ad you post.
bstsb | 13 hours ago
this bit made me laugh. was the email drafted in Outlook? was it sent to some sort of forwarding mailbox, or did they just BCC every customer in?
jdiff | 9 hours ago
My guess would be some sort of internal forwarding mailing list, yeah.
jacobn | 13 hours ago
Did end up just adding them to our WAF blocklist, which is weirdly ironic - hosting on their infra & using their services to block their AI scraper...
BLKNSLVR | 13 hours ago
tardedmeme | 8 hours ago
Google only respected it because blocking Google from crawling your site used to hurt you more than it hurt Google.
BLKNSLVR | 7 hours ago
adrianvi | 12 hours ago
namegulf | 13 hours ago
Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.
input_sh | 13 hours ago
That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.
namegulf | 13 hours ago
You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.
llbbdd | 12 hours ago
Ferret7446 | 9 hours ago
The people trying to use it to block or limit bots are uninformed and/or misinformed.
marginalia_nu | 12 hours ago
tardedmeme | 8 hours ago
marginalia_nu | 3 hours ago
arjie | 13 hours ago
The traffic isn't a problem. I've got Cloudflare in front and the machine itself is relatively overpowered, and downtime isn't critical. But I'd just like the thing to be able to spider me properly. Someone did point out to me that maybe I wasn't receiving actual Amazonbot but some other spider: https://news.ycombinator.com/item?id=46352723
TurdF3rguson | 13 hours ago
embedding-shape | 13 hours ago
reaperducer | 13 hours ago
input_sh | 13 hours ago
> Amazonbot is used to improve our products and services. This helps us provide more accurate information to customers and may be used to train Amazon AI models.
tardedmeme | 8 hours ago
tintor | 13 hours ago
b112 | 12 hours ago
They've been getting some heat on it lately, but I find it hard to believe they're going to give up entirely? And if so, what's to stop someone from just flouting their rules on pricing, and then doing the robots.txt thing to prevent issues?
captn3m0 | 12 hours ago
I found a mention on some user agent trackers but no official documentation. Anyone knows if it’s documented? Asking because I am seeing decent traffic (30GB/week) from this.
iLoveOncall | 12 hours ago
It has AI agents included so I guess this can just come from it searching the web based on user requests.
embedding-shape | 12 hours ago
> Crawling behavior [...] Crawler identification: Identifies itself with user-agent string "aws-quick-on-behalf-of-<UUID>" in request headers.
Maybe people found a way of using it as a loophole for something or Amazon Quick is just picking up in usage, and your website is popular amongst whoever uses that sort of stuff.
phdelightful | 12 hours ago
At least, it claimed to be AmazonBot…
nathanmills | 12 hours ago
[OP] xena | 12 hours ago
They will in the future, but not today.
Bender | 11 hours ago
[1] - https://ip-ranges.amazonaws.com/ip-ranges.json
rnhmjoj | 4 hours ago
lofaszvanitt | 4 hours ago
Symbiote | 3 hours ago
Bender | 22 minutes ago
It would be rather nifty if Amazon and other companies would confine AI to specific CIDR or a dedicated ASN but I would not hold my breath on that one. AI crawlers will likely muddy the waters for everyone else.
userbinator | 8 hours ago
It's good that you mentioned this; smear campaigns are definitely not a new thing, and I suspect a lot of this DDoS'ing that's going on is a plot to accelerate towards Big Tech's authoritarian dystopia. Basically extortion.
faangguyindia | 6 hours ago
I've also seen Google bots with AWS IP ranges. You gotta look at their ASN/ISP/ORG
vindin | 12 hours ago
c-hendricks | 11 hours ago
TrackerFF | 11 hours ago
By that, I mean the types of crawls that can hog up significant usage.
rho138 | 8 hours ago
faangguyindia | 6 hours ago
lofaszvanitt | 4 hours ago