It was a well written blog post which had built up suspense. Disappointing to not see the root cause. I'd say it's not even established that it's systemd-resolved that is broken.
I've seen systemd-resolved do weird things with DNSSEC-enabled domains before. Perhaps the circumstances I saw weirdness matched this, but I don't have the notes from debugging it before. I've learned not to trust systemd-resolved (or dnsmasq) at all and always replace it with good old Unbound.
This domain isn’t signed and the article says systemd-resolved’s DNSSEC validator was turned off.
But I seem to have found a bug: takeonme.org is hosted by Cloudflare, and although the authoritative servers return NXDOMAIN for most query types, they return NODATA for DNSKEY. But I would be surprised if that’s relevant to this article’s issue.
Regarding the staging fallback: Caddy will not use a certificate retrieved on staging, it is only used as a way to check if the challenge is solvable, without being hindered by the rate-limiting of LE prod.
Once staging is successful, Caddy retries against prod immediately.
Regarding the monitoring: a soon-to-expire certificate should trigger an Uptime-Kuma alert if configured correctly ([ ] Certificate Expiry Notification).
I started removing systemd-resolved from my linux machines. Too much troubleshooting complexity. I don't need a third or fourth way to cache DNS between my ISP, router, and apps. What is the point of it? Didn't ask for it.
The post suggests using log base alerts to check if the TLS certificate renewal is working. I suggest that you'll get more bang for your buck by having alerts on your certificates having less than a week left to renew. It'll catch the same problem, and also problems like "certbot renewed successfully but didn't manage to install the new certificate" or "caddy didn't pick the new certificate up in a timely fashion".
Good job having site monitoring that caught the invalid cert in prod tho. Could have been worse.
systemd-resolvd is not part of the init system though. It is a DNS resolver daemon that just happens to be developed by and be a part of the systemd suite of software.
fanf | 11 days ago
Sadly we never find out why systemd-resolved is dropping NXDOMAIN responses.
tuxes | 11 days ago
It was a well written blog post which had built up suspense. Disappointing to not see the root cause. I'd say it's not even established that it's systemd-resolved that is broken.
intelfx | 11 days ago
I noticed resolved dropping NXDOMAINs multiple times already, but never bothered to investigate. Might this be the final push?
jamesog | 10 days ago
I've seen systemd-resolved do weird things with DNSSEC-enabled domains before. Perhaps the circumstances I saw weirdness matched this, but I don't have the notes from debugging it before. I've learned not to trust systemd-resolved (or dnsmasq) at all and always replace it with good old Unbound.
fanf | 10 days ago
This domain isn’t signed and the article says systemd-resolved’s DNSSEC validator was turned off.
But I seem to have found a bug: takeonme.org is hosted by Cloudflare, and although the authoritative servers return NXDOMAIN for most query types, they return NODATA for DNSKEY. But I would be surprised if that’s relevant to this article’s issue.
jamesog | 10 days ago
Ah, I misread then, I thought I read that DNSSEC was in play.
I still don't trust systemd-resolved. :-)
Garbi | 10 days ago
There is now a follow-up project on my whiteboard
I learned I need a whiteboard in my home lab.
oliverpool | 11 days ago
Regarding the staging fallback: Caddy will not use a certificate retrieved on staging, it is only used as a way to check if the challenge is solvable, without being hindered by the rate-limiting of LE prod. Once staging is successful, Caddy retries against prod immediately.
Regarding the monitoring: a soon-to-expire certificate should trigger an Uptime-Kuma alert if configured correctly (
[ ] Certificate Expiry Notification).white-star | 10 days ago
I started removing systemd-resolved from my linux machines. Too much troubleshooting complexity. I don't need a third or fourth way to cache DNS between my ISP, router, and apps. What is the point of it? Didn't ask for it.
0x2ba22e11 | 9 days ago
The post suggests using log base alerts to check if the TLS certificate renewal is working. I suggest that you'll get more bang for your buck by having alerts on your certificates having less than a week left to renew. It'll catch the same problem, and also problems like "certbot renewed successfully but didn't manage to install the new certificate" or "caddy didn't pick the new certificate up in a timely fashion".
Good job having site monitoring that caught the invalid cert in prod tho. Could have been worse.
heavyrain266 | 10 days ago
Why would you want an init system to handle DNS resolution? That thing is a huge pile of junk, it even tries to replace sudo through the run0 gimmick.
yaxley_peaks | 10 days ago
systemd-resolvd is not part of the init system though. It is a DNS resolver daemon that just happens to be developed by and be a part of the systemd suite of software.