What's old is new again. The solution RSS offered was structure for an otherwise unstructured challenge (trying to figure out updates on a site). That value grew exponentially when connected to AI (providing the signals of when do I need to look at this site/podcast again). Smart marketing.
I kinda don't like RSS because I often want like a whole blog archive downloaded if I add a new feed and it usually has limits how far back of posts it will download (randomly configured by each site)
Unless someone has a fix of whatever settings I've been using
For the Premium Archive tier, NewsBlur attempts to download a blog's entire backlog to backfill stories, whether it's exposed through paging or RFC 5005. Here's more info about how NewsBlur does it: https://blog.newsblur.com/2022/07/01/premium-archive-subscri...
I had a similar issue where I wanted to read a newly-followed blog from the beginning instead of the point in time when I started to follow, so I created https://refeed.to/
i mean, i still read hacker news primarily via RSS in feedly. i kind of never stopped using it, and everybody is much more generous with their feeds nowadays than back in google reader times. bearblog, etc. RSS rules
Anyone know the best practices for keeping AI crawlers off your RSS feeds? I know robots.txt works for the well-behaved bots. Other tools like interstitial captchas don't as the feed readers break if you send them anything but XML.
Putting just the post intro in the feed and linking to the website feels like a safer approach, assume you have bot protections on the website, but that's a poor experience for people who want to read in their feed reader.
I have some aggressive filters in Caddy that block the worst offenders by CIDR range, and also filter by user agent to remove any honest facebook and amazon bots. Otherwise, maybe strong rate limits by IP?
Edit:
Longer term, the approach might be - provide a separate RSS feed with full content but gated by a query parameter, then only give that URL to known-good consumers via email verification or patreon subscription, etc.
It would suck that people would have to pay more to consume content in their preferred way, but depending on your needs it might be a reasonable compromise.
I have almost 40 feeds I subscribe to and they're my primary way of getting information I care about without being exposed to ads or other things I don't want to see.
I must add that I self-host FreshRSS to fetch news and GitHub repos updates so I can update my stuff, everything in-house, controlled by me.
RSS makes life so much easier, some only provide the bare minimal while others, provide the whole post so I can read everything right there without opening a website.
Also, some podcast support it so I have a list of podcast that I list and can go back without having to go from website to website.
With US techs harvesting people's data, subscription mess, cars that are no longer cars but computers on wheel, and now AI, even folks with bare minimal knowledge are self hosting things.
All you need is a second hand dirty cheap Dell SFF computer from eBay, install Proxmox on it and even if it comes with only 8GB, you can still spin up a few Proxmox LXC containers (small like Docker but far better).
People are going back to buying physical media, old model of things, wired headphones is all time high.
MP3 players are all time high, no phone, no subscription, just music.
90s, early 2000s is so back and is a good thing, people themselves are putting a hard break on technology.
Her problems are the problems of a polling-based protocol and really if she does not like the RSS protocol she should stop publishing it and stand up an ActivityPub or PubSubHubBub service instead.
A big part of the value of Google Reader and the ecosystem around it was that Google could poll your RSS feed once and everyone could read it... A huge win for the Rachels!
> Her problems are the problems of a polling-based protocol and really if she does not like the RSS protocol she should stop publishing it and stand up an ActivityPub or PubSubHubBub service instead.
Bit odd to take potshots at a third party blog on this discussion, why single out Rachel?
And more to the point, the dynamics here might be due to RSS being polling-based, but if feed readers implemented the RSS logic correctly it wouldn't matter nearly as much, would it?
(1) Rachel complains more than most. Most people realize it is easier for you to speed up your server/lower your costs than to expect people to implement RSS "correctly"
(2) You can use a cache or be correct, pick one! I think of all the lame cache busting methods that are still in use because it took web browsers more than 15 years to get caching mostly right.
(3) If you'd been reading Rachel as opposed to asking why I pointed Rachel out your questions would be answered!
(4) Polling based systems come in two speeds: too fast and too slow and it is possible to be both at the same time
(2) Helps people to see that the phenom is more general. applies to any quasisocial SaaS that has an underappreciated hardtech moat or 90% learn-by-doing experts go for suboptimal tradeoffs which noobs would never consider.
(3) Maybe "complains" should be "insists", or even softer "maintains"
I built a site that's similar in concept to Hacker News, but is entirely fed by RSS feed content, that is then bullet-pointed summarized on the article page: https://engineered.at/
But I also extract topics automatically from the content too with LLMs, to allow for dynamic topic pages that users can separately subscribe to to tune their feeds.
Haven't promoted it much, but it's pretty amazing what you can do for a couple bucks a month. And my main thesis with this site is that by locking the content to only rss feeds of known blogs, you dramatically reduce the spam submission risk (basically eliminate it). Doesn't handle the spam comment side of things, but that's a different problem.
Figured it out, had a random block of Firefox versions less than 147 in my ApplicationController for some reason. Of course my home internet went down though so I’ll push in a few.
I presume you’re politely asking in order to block? Which is fine, I get it. On my phone right now but can update later.
I do want to ask though (and I should make this clear in a FAQ or something): the way I check RSS feeds uses adaptive scheduling, so I intentionally don’t check feeds of sites too rapidly. Then the summarization is based on the full article content but I never render that full content on the site (to avoid traffic hijacking concerns). Given that: what’s the concern?
I do appreciate you addressing the concerns about traffic hijacking, but at the same time I really don't like having my content run through a text mangler like an LLM. I get the use case, but at the end of the day it's my content and I'm a bit prickly.
That said, I'm not necessarily planning to immediately block your crawlers, I intend to just add them to a list I maintain for personal reference. I'm mostly interested in correlating the crawling traffic that I see with various sources, I have been gathering data about crawling activity and sources that I display on an embedded map on my site. I have caddy annotate traffic with a header indicating what the crawler is, and if the fleet behaves nicely then they don't get added to the blocklist.
If I set my UA to "FUCKIT" I can use the site perfectly fine. Why is there a User Agent Filter that disables the whole website? This should be maybe a warning, not a complete block.
you know, I had setup some analytics filtering based on geoip because I was getting crazy spam traffic from Chine and Singapore, but that should only be affecting analytics not the whole site. Mind if I ask where you're located? (you can email me privately if preferred: me@dchuk.com)
This looks great, I've wanted something like this for a while. Finding how to click through to the actual item in the feed was a high point of friction for me.
I went to a topic and then clicked on the header of something I was interested in expecting to be brought to the blog post directly. Needing to click on that same title again to be brought to the post was unintuitive to me, I searched around the page, went back and forth a few times and eventually figured it out.
As a user I would love to be able to click directly through to the article FROM the topic feed. I would expect that the comments is a URL to the page that the header currently brings me to. This would match my expectations from using sites like reddit/HN.
A one or two liner summary directly on the topics feed would be really great I think.
we spent a decade killing structured feeds in favor of algorithmic timelines and now we're rebuilding them because the algorithms need structured feeds. the circle of life, but for protocols.
I have this idea, that instead of browsing completely random things on the internet pushed by what other people are interested in (or want to promote), create an llm that scans through your backlog of projects YOU want to do, and then search the internet for projects/articles about those things, and then create a feed from that.
I'm not sure why I keep reading HN, 99% of the content is uninteresting, probably 99.9% now that every article is about AI. maybe I just like clicking on things.
This is going to happen, but it's too expensive for your LLM to do the scanning, and instead someone needs to build and maintain the index while allowing other people to subscribe to concepts. The problem is no one has sorted out the embedding space this all lives in.
It could be that soon we're gonna get a fully personalized briefing on the topics that we're interested in, or maybe a new kind of feed, replacing social media.
keywords are a start but not enough imo - consider a concept subscription such as "any of my political representatives making statements about firearm control"
That's why I reached for Apples own local LLM to fool with similar ideas like this: https://pageforth.com. Apple is better than I expected at this. Right now it filters through things like hacker news articles and whatever else you point it at to summarize and find things that match your interests. Apple's LLM reminds me of Claude like 3 years ago. It's weak for sure. But useful for small dose kind of problems.
Agree on RSS as the right shape — and worth adding the cost angle nobody's
quantified here yet. Having an LLM read a 50KB HTML page is ~$0.03 of
gpt-4o input. Polling 1000 sources hourly = ~$720/day, almost all of it
tokenizing layout chrome the model throws away. RSS-shaped feeds drop
that 90%+ because they strip to deltas. The harder blocker is the supply
side though — publishers earn pennies per human pageview from ads and ~$0
from agent polls, so unless feeds become licensed paid endpoints, the
publisher incentive runs against your "publish an RSS feed for your
content" recommendation. Just like that :)
alextillman | 21 days ago
erelong | 21 days ago
Unless someone has a fix of whatever settings I've been using
happytoexplain | 21 days ago
conesus | 21 days ago
phyzix5761 | 21 days ago
jayemar | 20 days ago
0gs | 21 days ago
b3ing | 21 days ago
8organicbits | 20 days ago
Putting just the post intro in the feed and linking to the website feels like a safer approach, assume you have bot protections on the website, but that's a poor experience for people who want to read in their feed reader.
solid_fuel | 20 days ago
Edit:
Longer term, the approach might be - provide a separate RSS feed with full content but gated by a query parameter, then only give that URL to known-good consumers via email verification or patreon subscription, etc.
It would suck that people would have to pay more to consume content in their preferred way, but depending on your needs it might be a reasonable compromise.
phyzix5761 | 21 days ago
frigidwalnut | 21 days ago
phyzix5761 | 20 days ago
Protesilaos: https://protesilaos.com/codelog.xml and https://protesilaos.com/commentary.xml
HN: https://hnrss.org/frontpage
Sacha Chua: https://sachachua.com/blog/feed/index.xml
David Revoy: https://www.davidrevoy.com/feed/rss
Davep: https://blog.davep.org/feeds/all.atom.xml
xkcd: https://xkcd.com/atom.xml
YouTube - Michelle Khare: https://www.youtube.com/feeds/videos.xml?channel_id=UCGGZ_PO...
YouTube - TmarTn2: https://www.youtube.com/feeds/videos.xml?channel_id=UC36MGPf...
sophiabits | 20 days ago
phyzix5761 | 20 days ago
_-_-__-_-_- | 20 days ago
h4kunamata | 21 days ago
Where? Not within the homelab space.
h4kunamata | 21 days ago
RSS makes life so much easier, some only provide the bare minimal while others, provide the whole post so I can read everything right there without opening a website.
Also, some podcast support it so I have a list of podcast that I list and can go back without having to go from website to website.
One place to govern them all, RSS still king.
8organicbits | 20 days ago
https://trends.google.com/explore?q=%2Fm%2F0n5tx&date=all&ge...
PunchyHamster | 20 days ago
h4kunamata | 20 days ago
With US techs harvesting people's data, subscription mess, cars that are no longer cars but computers on wheel, and now AI, even folks with bare minimal knowledge are self hosting things.
All you need is a second hand dirty cheap Dell SFF computer from eBay, install Proxmox on it and even if it comes with only 8GB, you can still spin up a few Proxmox LXC containers (small like Docker but far better).
People are going back to buying physical media, old model of things, wired headphones is all time high.
MP3 players are all time high, no phone, no subscription, just music.
90s, early 2000s is so back and is a good thing, people themselves are putting a hard break on technology.
PaulHoule | 21 days ago
https://rachelbythebay.com/w/2024/05/27/feed/
but coming from an aggressively anticommercial world view. She collects evidence that real world feed readers don't implement RSS correctly
https://rachelbythebay.com/w/2026/02/23/readers/
Her problems are the problems of a polling-based protocol and really if she does not like the RSS protocol she should stop publishing it and stand up an ActivityPub or PubSubHubBub service instead.
A big part of the value of Google Reader and the ecosystem around it was that Google could poll your RSS feed once and everyone could read it... A huge win for the Rachels!
solid_fuel | 20 days ago
Bit odd to take potshots at a third party blog on this discussion, why single out Rachel?
And more to the point, the dynamics here might be due to RSS being polling-based, but if feed readers implemented the RSS logic correctly it wouldn't matter nearly as much, would it?
PaulHoule | 20 days ago
(2) You can use a cache or be correct, pick one! I think of all the lame cache busting methods that are still in use because it took web browsers more than 15 years to get caching mostly right.
(3) If you'd been reading Rachel as opposed to asking why I pointed Rachel out your questions would be answered!
(4) Polling based systems come in two speeds: too fast and too slow and it is possible to be both at the same time
oliculipolicula | 20 days ago
(1) is gender-proportionate
(2) Helps people to see that the phenom is more general. applies to any quasisocial SaaS that has an underappreciated hardtech moat or 90% learn-by-doing experts go for suboptimal tradeoffs which noobs would never consider.
(3) Maybe "complains" should be "insists", or even softer "maintains"
rvz | 20 days ago
[0] https://www.reddit.com/r/modnews/comments/1tq9vxo/protecting...
eli | 20 days ago
dchuk | 20 days ago
But I also extract topics automatically from the content too with LLMs, to allow for dynamic topic pages that users can separately subscribe to to tune their feeds.
Haven't promoted it much, but it's pretty amazing what you can do for a couple bucks a month. And my main thesis with this site is that by locking the content to only rss feeds of known blogs, you dramatically reduce the spam submission risk (basically eliminate it). Doesn't handle the spam comment side of things, but that's a different problem.
EDIT: I also open sourced a Rails engine I made to power this site if anyone is interested: https://github.com/dchuk/source_monitor
shaunpud | 20 days ago
dchuk | 20 days ago
EDIT: just checked in firefox, I don't see an issue. can you email me at me@dchuk.com and maybe I can debug with you?
Joe_Cool | 20 days ago
UA being blocked for example:
Did mess with it some more:Allowed:
406: Maybe just remove it?dchuk | 20 days ago
dchuk | 20 days ago
dchuk | 20 days ago
solid_fuel | 20 days ago
dchuk | 20 days ago
I do want to ask though (and I should make this clear in a FAQ or something): the way I check RSS feeds uses adaptive scheduling, so I intentionally don’t check feeds of sites too rapidly. Then the summarization is based on the full article content but I never render that full content on the site (to avoid traffic hijacking concerns). Given that: what’s the concern?
solid_fuel | 20 days ago
That said, I'm not necessarily planning to immediately block your crawlers, I intend to just add them to a list I maintain for personal reference. I'm mostly interested in correlating the crawling traffic that I see with various sources, I have been gathering data about crawling activity and sources that I display on an embedded map on my site. I have caddy annotate traffic with a header indicating what the crawler is, and if the fleet behaves nicely then they don't get added to the blocklist.
Joe_Cool | 20 days ago
If I set my UA to "FUCKIT" I can use the site perfectly fine. Why is there a User Agent Filter that disables the whole website? This should be maybe a warning, not a complete block.
dchuk | 20 days ago
Joe_Cool | 20 days ago
IP address has no effect on the User Agent block though...
dchuk | 20 days ago
Joe_Cool | 20 days ago
devinpower | 20 days ago
I went to a topic and then clicked on the header of something I was interested in expecting to be brought to the blog post directly. Needing to click on that same title again to be brought to the post was unintuitive to me, I searched around the page, went back and forth a few times and eventually figured it out.
As a user I would love to be able to click directly through to the article FROM the topic feed. I would expect that the comments is a URL to the page that the header currently brings me to. This would match my expectations from using sites like reddit/HN.
A one or two liner summary directly on the topics feed would be really great I think.
dchuk | 20 days ago
_pdp_ | 20 days ago
hparadiz | 20 days ago
https://technex.us/.rss
https://github.com/hparadiz/technexus/blob/release/src/Contr...
I would enjoy a JSON based refresh of the format.
ramgine | 20 days ago
themafia | 20 days ago
Get your rapacious hands away from my website please.
> and actively degrades programmatic access.
That's your problem. You choose these tools. If they can't function without ripping everyone else off then why do you persist in using them?
grobibi | 20 days ago
Can someone reccomend a way to create an rss feed from a site that has none?
senectus1 | 20 days ago
daxfohl | 20 days ago
sperandeo | 20 days ago
analogpixel | 20 days ago
I'm not sure why I keep reading HN, 99% of the content is uninteresting, probably 99.9% now that every article is about AI. maybe I just like clicking on things.
acgourley | 20 days ago
iknowstuff | 20 days ago
DSemba | 20 days ago
Projects like OpenClaw and Hermes already show that this can work whether the source is RSS or simply a website the agent visits.
Even Google now envisions this, since they recently announced "information agents" (https://blog.google/products-and-platforms/products/search/s...) that will keep working in the background. They surely have an index they can use, but I wonder whether that's necessary? AI agents like Claude Code suggest it's possible to use simple keyword searches, without maintaining vector indexes - https://www.tigerdata.com/blog/why-cursor-is-about-to-ditch-...
It could be that soon we're gonna get a fully personalized briefing on the topics that we're interested in, or maybe a new kind of feed, replacing social media.
I'm actually working on the briefing idea myself: https://briefin.com
acgourley | 20 days ago
nate | 20 days ago
hendler | 20 days ago
notnullorvoid | 20 days ago
nreece | 20 days ago
Systems and agents need to monitor and extract public web content into fresh structured data for their ingestion, intelligence workflows and analysis.
* Shameless plug * Our data infrastructure layer for businesses and AI turns continuously updated websites into a stream of structured data.
https://newsloth.com
amai | 20 days ago
Nowadays AI agents also don't read ads. Let's see how that is going, but the ad industry isn't amused about that.
eugeneonai | 20 days ago