The future of SRE will be the company putting some amount of money on a prediction market against the site going down and you get to take home the winnings as long as the site stays up.
Hah, I know the feeling. I installed Ubuntu on a PC recently, it obviously happened to be one of the days they got DDOSed and apt repos were unreachable. I had other things to take care of, so I put it aside for the next week or so. It didn't help very much, cause after picking it back up, halfway through, Snapcraft went down.
I vibe coded a script that interacts with both Gitlab and Github via their APIs and I've been using it pretty heavily since this morning. I crossed the streams! Goodness, I didn't know it would be _this_ bad!
Insane, we have to come up with contingency plans now for long-duration GitHub outages because we can't safely do deployments. For a service we're paying thousands of $ per year for even though we host runners ourselves...
Same here. You’d think they could at least separate out the GitHub-hosted and self-hosted runners, so you’re still able to dispatch jobs if the self-hosted runners are down.
I’m not convinced they actually do, because GHE on the cloud tends to have the same problems as the main outages. Probably costs extra to be “single tenant” or whatever
Depending on how many thousands of $ per year, it would probably be cheaper and more reliable to self-host GitLab. It's better in terms of organisational structure (you can have one, including access and secret inheritance), and (personal view) Gitlab-CI is better than GitHub Actions because it doesn't push you towards a JavaScript/NPM style dependency hell. And it's actually fairly easy to self-hosted, with options from a single machine with an omnibus package that handles everything to a full blown autoscaling Kubernetes deployment.
Same thoughts - we use an action to ship to production, its builds an image, pushes it to ECS which triggers a deployment.
We can't be blocked here. Seems silly what we settled on this, but for a long time GitHub had been reliable enough for many years, but things are sliding down the pan as of late.
Maybe we need a split between source management and distribution? The former looks like git[hub] to me, the latter maybe more like a Linux distro repo?
It's always best to be portable - always be able to do builds and releases locally (at least, once you get the keys - it shouldn't be possible by default), then add things like github actions on top as convenience.
It's funny, when we were acquired they started moving us to Github actions but it seems that maybe we should stay on our old crusty self-hosted Jenkins setup...
oh man spent so much time trying to debug what's going on. I have a complex setup with GitHub Actions and self hosted runners so I thought it's something broken in my CI setup
It's (a) they're under massively increased load because everyone's vibing up new projects these days, (b) they've been in a weird frankenstein "on azure but also we have our own control plane" state for years and they're pushing to no longer have that be the case.
I don't think vibecoding at Github has much to do with it.
Your speculation is that their competitors would naturally not see a commensurate increase in instability while “only” handling 20% of the same crisis?
I don’t buy the excuse. I want to hitch my wagon to those “mysteriously lucky” competitors. (And have. And haven’t had similar issues to Github, since.)
Competitors would be long tail, so a different mode of traffic entirely. Maybe they get spikes that are more easily whack-a-moled than the constant hammering that GitHub receives.
I think it’s much more than 80%, it’s probably the default recommendation and folks who aren’t technical would just accept it. Probably closer to 95% or more
Isn't the relative increase more of interest? If someone was only owning 10% of the market, and they've only gotten 8% (percentage points) of the 20%-not-GH LLM-related increase, they'd still be seeing a very similar spike compared to their baseline as GitHub.
I started using an agent (Codex) on my repo and it went from a a few dozen clones to thousands (3383 this week). I dunno what the agents are doing to clone the repo so many times -- I'm not running 3000 agents or prompts, maybe 10 or so this week. But if this is typical, a 1000x increase in usage across the board can't be good on the system.
GitHub had a blog post about this recently. They reported a significant uptick in volume (repos created, PRs, etc.), which they attribute to AI usage and tooling.
Microsoft has boasted 30% of their code written by AI.[1] However we could only guess if AI generated code is the issue or something else, or a combination of things.
That being said there was a noticeable trend starting around 2022.[2] That being said they’ve also been doing a big migration to Azure. It’s likely a combination of things.
This gets posted every time GitHub is down. This chart is not accurate. It is based on data scraped from GitHub's status page and that data is missing historical incidents from the pre-Microsoft era.
Yeah, it’s not even consistent with their own incident history. I spot checked it and consistently found incidents with downtime/elevated error rates in months listed as 100.00000% uptime on that chart.
The unofficial and offical charts are both lying. The GitHub one ignores actual outages and the unofficial ones count minor display bugs in minor features as a “github outage”.
It could be many things. Microsoft mismanaging stuff. Azure. Vibe-coded Github. So much AI slop being committed it adds an extra burden on the servers, etc.
All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.
Do you know of a single service at a single company that actually does that?
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I've been oncall for a different G service that nearly paged on every error. It used the standard error budget tooling, but on hundreds of user buckets because the engineering around locality-specific configuration was... suspect. Many of these buckets had single-digits user. If a user was on their phone and lost signal, I was paged. Very poor oncall experience.
I worked at a large fintech moving billions of dollars in volume a day.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
Client network timeout shouldn't result in 500. With 408 and retry you should, dependent on the business criteria, get either an upsert (transaction is retried) or 422 (validation that given entry already exists).
Even if it's "DB in datacenter I tried to save to was hit by meteor" event, you can cater for this not to result in 500 (ie - DB unreachable, retry in a couple of minutes); the question is if you want to.
The sub-service at IBM cloud I worked on had an insanely small error budget such that pages were nearly constant. On call was hell week until a few of us insisted on fixing the issues. The "few" of us were contractors. The employees seemed more than willing to just let the pages continue.
Some companies pay more if people are paged. It can create a perverse incentive not to fix problems or, in extreme cases, to watch things going wrong, waiting for the page, and then being ready to fix it straight away.
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
Bitflips are something that can happen in consumer-grade RAM, so that tracks (and it's comforting that wayward cosmic rays are a substantial reason for an application's crashes!), but on enterprise servers, they will run ECC RAM that is very resistant to bit flips.
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
Bitflips specifically may not be; things like network issues, noisy neighbors, row/rack/host maintenance (leading to a downed and migrated host) absolutely are things that happen at high frequency at scale and cause your background level of errors to be more than 0.
It’s where monitoring for 9s is more important at that scale than absolute errors. So long as degradation is graceful or retried it should not be a massive problem.
It does require constant tuning and adjustment though.
that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.
It depends what you're monitoring. If it's response codes from user generated queries, then I'd agree with you.
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
Yeah, no, nobody runs cloud services like that. At AWS most alarms required failures in 3 consecutive 5 minute periods. Critical things could be on 3 consecutive 1 minute windows - but that alarm starts a 15 minute escalation for the oncall engineer to check in, and they have to validate the issue isn't a false alarm before updating the status page would even be considered
Re: "page for all 500s": there's a world of difference between "page me with a critical alert at 3am" and "notify me on Monday morning when my normal workday starts". At the extremes:
If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!
If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.
Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.
I've worked in large orgs where we could (at at some times did) have around the world rotations. They don't work well. It've very hard to maintain real team cohesion, and you end up with really superficial operations. People tend not to dig in really deep, find good fixes, etc. Lots of superficial bandages.
One team can't troubleshoot AND FIX every possible subsystem, so you just end up with lots (growing to hundreds) of people "on-call" anyway.
As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).
Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.
Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.
I'm sure you're not in ops. Or in a dev org of a service with decent request rates.
What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.
You only do this when you’re trying to use incident management as a hammer to make a point to somebody whom you have otherwise failed to convince to fix something through persuasive argument. Ie, it’s punitive.
No, monitoring for HTTP response code is a subset of observability and not one that generally gives you the best insights into which subsystems are misbehaving nor why.
There are synthetic tests, where you can generate API request calls or even simulate an entire user journey. These allow you to control the user agent, the payloads, and thus you know anything errors back are actual errors. These are triggered by the observability platform (think like running a cron-job) and thus you're not tied to user activity to see when problems arise.
There are other metrics outside of HTTP response codes too. Think like free RAM, CPU usage, disk space, etc. This is just naming some obvious ones because these types of metrics are generally bespoke to the type of application your monitoring. And with these types of monitors, you'd not just have an alert when things have failed, but ideally have alerts when an irregular trend is showing that things are likely to fail too. This latter type of monitors helps you get ahead of the problem before it become customer facing.
Then you have more traditional stuff like logs. This will also be bespoke to the application. But you'd expect errors in logs to get surfaced quickly. Assuming Github have good hygiene in what's being logged.
Tie that up with APMs, RUM, and other goodies like that and you'll have diagnostics to investigate issues when they appear.
(this is just a super high level view of observability too)
> Even a synthetic probe needs a few failures to trigger an alert.
It doesn't "need" that. That just how most people set it up because it’s an easy sane default that allows for network jitter without inexperienced engineers thinking about different conditions triggering different types of responses.
If you’re measuring internal APIs from an observablity solution that’s has nodes already inside you’re network enclave, then there is a strong argument for alerting early.
> You should not alert on cpu, ram, etc
That’s not true to say as an absolute statement. And a generalisation it heavily depends on the system your monitoring and how it behaves under pressure.
But in any case, I wasn’t suggesting CPU alerts were the end goal. I said:
> these types of metrics are generally bespoke to the type of application your monitoring.
Ie you’ll use metrics but those metrics will be highly specific.
The CPU examples were an illustration as to what a “metric” is (it might seem obvious but not everyone is an expert) but the point was HTTP response codes aren't the only types of metrics one should be capturing and watching.
Ah, yes, I misunderstood. And I have seen cases where a direct CPU alert makes sense, but 99 times out of 100 times I see it, it's nothing but trouble. Worse, I tend to see the cpu alert when there are no end to end synthetic alerts, 500 alerts, queue depth alerts, etc.
If your requests are fast and cheap, you can probe frequently relative to your goals, but often that's not really possible (think, long SQL queries, or scheduling a container/pod). There you need several datapoints, or possible fewer augmented with other signals.
Talking about long SQL queries, I quite like throwing CPU alerts on database servers. They'll be a low priority alert (ie no out of hours "pagers") so just something that goes into a slack channel. But they're a good indicator of when developers have poorly optimized SQL, or the DB schema is poorly defined (eg missing indexes), or the DB server itself is poorly sized.
This wouldn't be something you'd expect to need in production and definitely not something you'd rely on as a notice of a production outage. But it is an example of one of those 1% occasions where a CPU alert does add value to the overall observability of the application.
But this also ties into your excellent point about how you'd use CPU and other data points to build a picture of what's happening in your application.
> All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.
Is it true that official service status pages are updated automatically?
I'm not arguing with what you're saying, but it does make me wonder: What exactly is the point of the status page, if "it is normal for users to already see errors before GitHub officially counts it as an outage"?
Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"
Yes, Thais can be be really frustrating when you’re trying to get work done. There needs to be more competition and better alternatives and the LLMs need to offer easier connection to these alternatives.
It's an eye opener. Think about it - today, it was a mistake. But, what if it really happened? What if you really lost access to all your years of hard work? It's a wake up call. A blessing in disguise to store what matters to you the most locally, backed up offline. Never trust any single provider. Be it MS or Google or Apple. RAID is the way.
People should use something that keeps a local copy of their code and just copies it to Github and to other contributors with a sync process to push and pull changes. Some sort of 'distributed source control system' maybe. Then people would only need a 'hub' to connect to people, and it'd be easier to move somewhere else.
This gets tiresome. Github is a lot more than a host for Git repositories. If you want to suggest that people use something else, you need to suggest a replacement that has the features people use Github for.
yeah, #1, it is free private file storage, and #2, it's a download portal for free as in beer software replacing paid offerings. that's what it is for 99.99% of people.
being a host for git repositories has never been its core competency. neither has its groupware offering.
does it even serve OSS well? a very interesting criteria is, "Have mature or adopted end-user-facing OSS recently merged a large PR from an unallied contributor?" The answer is overwhelming no. This is why there is so much innovation in this space.
I think you missed the joke, which is that the parent poster you're replying to is suggesting a 'solution' to the problem which evolved in complexity until he was just describing Github again.
What you just described is Fossil. It has an auto-sync feature that makes everything feel distributed.
Just set up a Kubernetes deployment and you’re set.
But as others mention, GitHub’s primary strength is collaboration. If you want decentralized, solve this by creating a decentralized collaboration tool on top of fossil and/or git.
For example, how to do pull requests and code reviews?
I think they were intending to evoke the image of RAID rather than literally referring to a redundant array of inexpensive disks. You host your code on Github, Gitlab, and at home, then you survive a Github outage. It's a redundant array. Not sure it's inexpensive, though.
This happened to me as well—thankfully not my personal account that I use for work, but the organization associated with an open source project I worked on was suspended. It similarly took 2 months for GitHub to restore the organization.
> Our team is currently experiencing an unexpectedly high volume of tickets which has resulted in longer response times than we prefer. We acknowledge the long wait and apologize for the experience.
> Sometimes our abuse detecting systems highlight accounts that need to be manually reviewed. We've cleared the restrictions from your account…
Fully self-hosted IMO can be an overcorrection. The issue isn’t “relying on other people”—it’s relying on GitHub, when they’ve made it clear they don’t care about uptime and they don’t care about support turn-around-time.
Well yes, my git repositories sit on my laptop, that's the entire point. If github banned my country because its president has a tis, I can push my entire commit history to another company. Same with anyone else who's working on it.
It would be a pain as I'd have to set up a few integrations again, but github is far lower down the risk scale than the vast majority of SAAS providers
Another outage at GitHub with actions and pages not working thanks to the AI agents Copilot and Tay.ai creating more issues. Last time this happened was 6 days ago. [0]
This time today it was caused by friendly fire by the automatic suspension of the GitHub Actions bot which is now a "Ghost" user. Since there is no CEO of GitHub to contact it we are just going to see more [1] of this again.
You might need to push a critical change soon, but now you cannot. You won't get any of these issues if you self hosted as I said 6 years ago...[2]
Are there any GitHub Actions-compatible CI services out there that don't rely on their infrastructure? I know of depot's but no others; are these resilient to these outages or do they still lose functionality? I imagine the latter but I don't know.
github actions themselves can be self hosted, its quite nice actually to be able to keep your same patterns as cloud hosted actions and with one line change to the yaml have it running on your own hardware. I do this for actions that take 6-7 hours so I am not burning through the 3000 minutes that come free with my account.
Founder of Depot here. To my knowledge, we are the first engine to support different syntaxes in this compatible way via Depot CI [0]. Great time to try it out and let us know your thoughts! We’ve built a lot of cool stuff into it like parallel steps, custom images, and a full CLI/API interface so you can literally everything without going into the web app.
As someone who partially uses depot but was still affected by this github issue, we obviously haven't moved over enough. We use your runners but github is still blocking us.
Hope you don't mind the public ask, it seems useful for others.
If we're using depot runners, and want to use them directly, or move off of github actions being the controller for when things run: what do you suggest?
Yes, triggering Depot CI via the CLI is the sure fire way to avoid all dependencies on GitHub.
We’d need more details around what you’re seeing. It is true that if auth across GitHub is broken than we can’t copy your actions out to be used by Depot CI. However, we have a solution in the works for that as well.
In short, Depot CI, our own engine and control plane is not dependent on upstream actions control plane. But still has to listen for commit events to know if/when to run jobs on things like PRs. This to is being removed in the future.
Are you able to bring your own runners? Our org is heavily invested in self-hosted runners at this point and have gotten a pretty tremendous value from it. I think we'd be wise to get away from GitHub's control plane but keep running jobs in our own infra.
Is there a tier for open source organizations? Do I have to admin any of AWS that runs behind the scenes or can I pay a fixed price to depot and get it to solve everything out of my way?
I used to use Cirrus CI as an alternative to GitHub Actions and am looking for a new alternative. I wonder if Depot could fit in the same way for my needs. I need to run builds and tests in Windows, Linux and macOS.
We currently use external runners (Blacksmith.sh), but that didn't shield us from this as GitHub actions is still the control plane for triggering and monitoring them.
We're now considering Buildkite (apparently they have a GH actions migration tool) or self hosting something (GitLab CI, maybe even Jenkins), as it looks like that would've kept ticking over since we're still seeing webhooks being triggered today during the downtime.
It's big enough that every time it goes down, it surely stops somebody from pushing fix for what they currently have broken, so I wonder if status page services see some kind of ripple from github outages.
About an hour ago I was having trouble browsing repo files in the browser and I thought "A disturbance in the force, is Github down?" Refreshed HN and loaded up their status site. Nada.
(Ofc, in a sensible universe, we just brush that off to a JS/Firefox glitch or my ISP.)
And yet, here I am. My code is not compiling, my AI isn't vibing, nonetheless I can't work! Two more hours before I can get off!
I am trying to refrain my "off topic" rants... but such microsoft github abuse is generating so much hate due to their dominant market position, it is hard.
Github measures/reports the SLA of the individual services.
The external page linked above goes the other extreme and considers it a bad status whenever any individual service is degraded.
In reality the majority of people only use 3 or 4 of the core services the majority of the time but since there's no "core services" SLA/uptime the usability of github for the majority of people is slightly obfuscated.
Part of it is that it considers downtime in any of the services GitHub provides as GitHub being down. So if GitHub had 100 different services, and only one of them was down at any given time (but at least one was always down), then it would show 0% uptime.
Whilst you're waiting for it to come back, try out AGENT-CI (which is a project I built.), which runs GitHub Actions on your machine: https://agent-ci.dev. (Open source, etc.)
No, it's not like "act," because it uses the standard Github runner, the difference is that the control plane is an emulation of api.github.com, because of this we can do all kinds of nice things:
Caching in ~0 ms. Pause on failure, so you can let your AI agent fix it and retry without pushing.
I had extremely bad experience trying to setup act on my Macbook. If this is something that actually works (and doesn't steal my credentials), I'm willing to try it despite AI non-features.
Yea, I've had only barely-success on only a few projects with act. Usually due to steps/scripts that use github-internal APIs, but afaict far from always.
I like that it exists, but what a freaking mess that it's necessary and so difficult to do.
I did not say that, what I said was: It's not like `act` because it's not a rewrite of the runner. It's the standard runner... So the one that actually runs GitHub Actions.
I have tried to use act many times, and many times I've failed.
P.S. pause on failure is also helpful for humans, but I'm trying to be realistic about where the future of programming is going...
`github-actions[bot]` was disabled for some time, if that's the actor which does the checkout in your setup it could be related. FWIW it's back to working now.
We use TeamCity for CI builds, before that Jenkins. Only accessible from the inside of the network.
Even though it's selfhosted and we don't have a dedicated infrastructure team, I don't remember it ever being down in the last 12 years I have been working here.
If you don't want to self-host Gitea/Forgejo, I recommend SourceHut for private repos and Codeberg for public ones. Happy to answer any questions you might have for either based on my experience!
Looks lik a terrible source. Like someone ran Claude on the codebase, didn't analyse the results, then vibe coded a blog post. And the dustri.org link doesn't work for me
Shout out to all my SF 5am crew checking if their overnight prs passed CI. Real 597 “member of technical staff” energy. I guess we should expect this, it is a Tuesday!
free service is down again, let's everyone that use the service for free complain again!!! (sorry for the sarcastic comment but i find it crazy how people feel they are entitled when it's free)
EDIT: sorry i meant this rant at the one complaining for the free service not for the paid customers (which is unacceptable)
I've been against self hosting internal tools for a long time mainly because of the devops and other overhead. But AI based devops makes it so easy now to spin up whatever you want now that I'm reconsidering that. I use a lot of ansible for several of our deployments. At this point, most of that is managed via codex.
For Git, all you technically need is ssh access and some backup strategy for your server. It would be bare bones but workable. And there are of course plenty of OSS things that are a lot nicer than that.
I'm still using gh and gh actions and we are mostly below the freemium layer with that. But it is kind of slow and honestly a dedicated vm plus some high CPU/memory workers we can spin up on a need to have basis might be a lot faster. With GH outages becoming more common, my hand might be forced a bit.
In recent weeks, I've spun up listmonk (mailing list solution), matrix (as a slack alternative), and a few other things specific to our software stack. A github alternative would be more of the same. We don't need a lot.
The main objection is that with more moving parts to worry about, the workload for me also increases. Things need updating, monitoring, backups, alerting (and responding to alerts), etc. That sucks up my time and that is scarce.
Another reason for self hosting these days is that with agentic AI tools, self hosted things are a lot easier to integrate into agentic systems. If it is self hosted, you don't have to worry about API limitations, rate limitations, walled gardens, etc. All the traditional SAAS silos are becoming a problem from that point of view. The more locked down it is, the bigger the motive for moving away from it. That's why we ditched Slack for Matrix. Slack is hopelessly locked down and tedious to deal with. Matrix is super easy for this.
This is great because I finally set up Actions yesterday for a new project of mine and of course it’s failing today and thinking I screwed up the yaml.
If you want an alternative to GitHub Actions, you could self-host Forgejo Actions, but I'm not that happy with the design.
I much prefer Woodpecker CI, which is an open source fork of Drone.io. It supports multiple Git backends like GitHub, Gitea, Forgejo, Gitlab, Bitbucket. It supports running jobs locally, on Docker, and on Kubernetes. And there's autoscalers built in for AWS, Hetzner, Linode, Vultr, and Scaleway. There's a bunch of 3rd party plugins (https://woodpecker-ci.org/plugins) for custom integrations. The UX is also very simple, with OAuth used not only for authentication/authorization but also setting up & accessing repos. The system architecture is great, with separate components that run stateless connected to a database, and a custom plugin is any program that takes environment variables and does stdio. The config file is a good balance of ugly YAML and convenience syntax like shell-style parameter expansion variables.
It probably takes less than 15 minutes to install, set up, and run WoodpeckerCI for a small team, so it's not a big investment to try out or host. With the autoscaling plugins it lets you scale your workload up to whatever size. Honestly you could run it on a laptop since it's written Go.
The last two projects I built I did the CI/CD manually with a small win32 service that polls git and builds+deploys the main service locally. It's barely 200 lines of code. Not much to go wrong. "dotnet publish" is not difficult to wrap.
The latest language models have enabled this sort of thing for me. I can integrate a mini Jenkins into every project within a 5-10 minute prompting session. This sort of code isn't hard. It's just tedious, and the LLMs absolutely rock at boring repetitive stuff. Having a win32 service start up successfully on the very first try is something I haven't experienced until 2026.
That works for relatively simple scenarios. When you have to add deploying sql changes or something having to update something in the cloud, you'd have to include a lot more plumbing.
In my world CI/CD and db migrations are 2 different things working together. CI/CD at heart is rather simple for many setups. Migrations need quite a lot scrutiny, you really want to mess up there. But if you run on gihub actions with 50/50 uptime, does it matter?
Deploying SQL changes is actually trivial if you are using SQLite.
I agree in a hosted+shared SQL scenario you have to be a little bit more careful with all of this. Arguably, you should have a separate schema management phase in these cases.
But if you are just SQLite embedded in the service, you can use the user_version pragma to track schema version and perform deterministic migrations (assuming a user didn't manually jack with the file in-between).
If you would like less dependence on GitHub for issues and PRs, please check out GitSocial, it stores everything in git itself, making them portable and offline-first.
What problem is github solving that has led it to become critical infrastructure for so many? Is it that everyone is remote and VPNs are too much of a hassle to give everyone access to a build server? Is the serving as the authoritative auth for development services? Does it provide better compliance reporting? It just isn't apparent to me what github offers that you can't get elsewhere with at the same cost and effort. I've been in some pretty large orgs with distributed personnel, but this just hasn't ever been a problem.
GitHub solved the original "code collaboration" problem, and now it's a default easy way to outsource repo management. It also has the most integrations. A lot of companies grew up using GitHub.
GitHub was, once upon a time, quite stable. Things have changed: more features, more usage, and automated agents.
I know what it does, but why is it such a problem that Actions is down? I think you did kind of answer it: "A lot of companies grew up using GitHub," i.e., they are using it as infrastructure by default, not because it does something that otherwise can't be done.
It’s well integrated into massively underpriced agentic coding (and noncoding) workflows, I doubt there’s much more reason than that. The hip thing to do now is hold all your docs in github instead of notion so your agent can traverse them locally
My first time using GH Actions was last week. GH was so flaky that pulling a submodule failed >50% of the time. I had to write a script to retry pulling the submodule in a loop.
I've done some hacky shit in CI scripts, but none made me more mad than that one.
Is it about funds? Why Github is not catching up with the traffic? I know there's a mass rush on Github recently specially due to Claude Code leading users to use Github. sometimes even persuasive.
Sometimes it is. There are some incredibly brute force yet simple and elegant pattern that power some of the biggest scale system you could think of.
It is relatively easy to scale a collection of simple things to extreme and exhibit complex behavior together. It is a lot harder to scale something complex to extreme. But too many times the latter is the default - designed wrong from the ground up and stuck in scaling hell.
>Copilot: Do you want me to implement consequences for you or babble on and on about what might entirely be a figment of your imagination (Github is up and you're on a 48 hour bender without sleep)
I think this can only happen if there are viable alternatives.
For instance, the UI at setups such as https://git.devuan.org/Daemonratte/gtk2-ng is quite ok-ish, in my opinion. Granted, it is mostly copy/paste from github but that still is about 1000000x better than sourceforge's interface - and gitlab's UI too (I just hate gitlab's UI, they seem to love complexity and a billion features only 0.000001% ever need; GitHub, with all its faults, is for the most part really simple - not everywhere, e. g. GitHub wiki setup sucks, but by and large I think it is simple overall).
Once you get used to it, it's not so bad which is probably true for all functional UIs. I switch between gitlab and GitHub quite a lot and I can't say which one is objectively better. I do like that cross-linking is easier in GitHub but I prefer gitlab ci over GitHub actions. Too bad that gitlab ci runner has removed the command to run ci locally but third party foss solutions are there.
If replacing github wholesale isn't viable, how does the story for replacing GitHub Actions look like currently? I don't remember the pre-Github-Actions days of everyone using CircleCI with a github integration in a negative light. I've noticed that since then a couple of CI providers have sprung up that differentiate themselves with faster build speeds, but I haven't really kept up with that market
I initially thought it was because I ran out of action minute, and was about to upgrade my plan
Lucky I came here before hitting the confirm payment button
The main operating model with git is going to go back to decentralized. Setting up and managing something like https://forgejo.org/ is a way better experience than constant interruptions by a faulty service that can't meet demand.
The open source contribution model as we once knew it is dead; you're not going to accept patches from random agents. The risk is way too high. And you can see that increasingly "AI Slop" makes it difficult to be a maintainer of any semblance of a popular repo.
So what's the value? A durable place to store work? hah.
Discovery? That part of Github has always been shitty.
So that leaves.. Github Actions? The thing that is down every other day and has been the subject of a few ~rug pulls~/attempted price hikes that are almost surely coming back?
This is your periodic reminder that Github is growing at ~14x (1400%!) annually. This would be incredible growth for a young, unprofitable, VC-funded startup, even Uber never achieved more than ~3x AFAIK. For a widely-established company that was already very well known and a market leader in its niche for many years? Absolutely unprecedented.
This is a conservative estimate assuming linear growth, the actual number is likely going to be higher. Much higher.
It's not too hard to grow 14X YoY if you start from a hundred customers. If you have hundreds of millions? Yeah, not so easy.
In my mind there's no doubt Github datacenters can't handle the recent load that came after agentic AI. They just need to get new servers. It's simple as that.
It's crazy to us how Github Actions have these issues but Azure DevOps never has these hiccups for us even though we hear they're on the "same infra". We're happy to stick with DevOps.
Who says Azure devops is on the same infra as GitHub? I mean, sure they're both hosted in Azure data centers, but there's very little else shared between them AFAIK. I used to work for Microsoft and I heard about the grand plans to merge the two but I don't think it ever really happened.
Microsoft is really working hard to kill off GitHub now. That's quite amazing.
We have already seen this in the last some weeks, but now this has become a meme that keeps on giving. GitHub down! GitHub up again. GitHub Down! GitHub ... ...
As an Indy hacker I want to see GitHub succeed, but I ditched actions years ago - (shocking) false economy. Spend entire nights pushing to actions over and over only for complaints about weird/niche dependency issues and other oddities - the cycle time's just too slow and the DX is no fun (my pain doesn't even factor in outages; just the feature itself as it's intended to be experienced). I want to spend time talking to users and building features, not debugging weird syntax or dependency issues on a remote machine non-interactively.
So why are Actions so unreliable anyway? Occam's Razor would probably suggest the domain is inherently complex/difficult; but other providers show that reliability is possible. What would Occam's Razor suggest next? Poor management..?
> How do you ensure you didn’t forget to run the tests?
Reasonable concern. In ~10 years of indy development, I haven't forgotten to run tests before pushing to main, ever. So setting up and maintaining complicated machinery to solve a problem that could (but never has) happened doesn't justify taking focus off other more important things, namely building.
The benefit probably increases with team size (I'm a team of 1, so I appreciate the luxury of being able to dodge CI/CD entirely).
I think it comes down to risk tolerance. For an established company that wants to avoid upsetting users at all costs, CI/CD makes sense. But for a nimble 'move fast and break things' startup, it can steal dev time for very little upside.
Say a disaster happens and someone pushes to main without running tests, 9 times out of 10 it will be of ~zero consequence (either the code works first time, it was a cosmetic change that hardly affected users etc).
I know there are horror stories and CI/CD would have prevented some of those, but IME they're just not that common nor severe for small operations, and even when they happen, only a small subset are irreversible/unfixable.
I recently switched from GitHub Actions to Buildkite + self-hosted runners.
Setting it all up would have been tediously annoying eight months ago (Buildkite requires setting up GitHub webhooks for each repo).
Last week I just had codex set up everything, ephemeral vm runners and all, using a couple of low-spec refurb mac minis, Buildkite’s API, a short-lived API token, and migrate my repositories one by one.
So far so good, it’ll pay for itself within two to three months, and following today’s outage I suggested at work that we experiment with the same set up.
[OP] cebert | 6 hours ago
fidotron | 6 hours ago
cpfohl | 6 hours ago
https://news.ycombinator.com/item?id=47237377
Waterluvian | 6 hours ago
cpfohl | 5 hours ago
ramon156 | 6 hours ago
... You're off the hook this time./s
Andrex | 5 hours ago
cpfohl | 5 hours ago
- So many super-heroes/super-villains
folkrav | 5 hours ago
thesdev | 5 hours ago
JsonDemWitOster | 5 hours ago
I vibe coded a script that interacts with both Gitlab and Github via their APIs and I've been using it pretty heavily since this morning. I crossed the streams! Goodness, I didn't know it would be _this_ bad!
zombot | 3 hours ago
swyx | 3 hours ago
spooky action at a distance
nivekney | 6 hours ago
LorenDB | 5 hours ago
bouk | 6 hours ago
sebmellen | 6 hours ago
ketzu | 6 hours ago
On my repo the jobs do not get scheduled on the PRs at all, so I assume that separation wouldn't help for todays issue.
voxic11 | 4 hours ago
anon7000 | 3 hours ago
sofixa | 6 hours ago
hsbauauvhabzb | 5 hours ago
sofixa | 5 hours ago
lazystone | 5 hours ago
PunchyHamster | 4 hours ago
decodebytes | 6 hours ago
We can't be blocked here. Seems silly what we settled on this, but for a long time GitHub had been reliable enough for many years, but things are sliding down the pan as of late.
mystifyingpoi | 5 hours ago
dnnddidiej | 6 hours ago
re-thc | 5 hours ago
Wait until you charge you for self-hosting runners.
Oh wait. They already tried.
the8472 | 5 hours ago
pluc | 4 hours ago
You can now hire me as an overpriced consultant instead of paying Microsoft.
cryo32 | 4 hours ago
Been burned too many times on that one.
999900000999 | 3 hours ago
Move to EC2.
Darn AWS is down.
Alright, run it on a Mac Mini in your basement. Ahh dawn, your ISP is having issues. Good thing you have a backup 5G hotspot.
Ohh no, the power is out.
Eventually you have to trust someone else.
GitHub is a tragedy of the Commons. Too many people are using it, and Microsoft isn't willing to handle it correctly.
Feels like a very good business opportunity. Minimum 50k yearly contracts, GitHub with actual uptime. GitPro ?
bee_rider | 3 hours ago
sleight42 | 3 hours ago
999900000999 | 2 hours ago
This is supposed to be Hacker News! Who is coming up with a startup to fill the gap !
cryo32 | 3 hours ago
Aggregate risk is too high.
bouk | 3 hours ago
matt_kantor | 2 hours ago
You should never entirely depend on a third party service to run your tests, either.
Cthulhu_ | 3 hours ago
yoyohello13 | 3 hours ago
Salgat | 3 hours ago
mohsen1 | 6 hours ago
heeton | 6 hours ago
mohsen1 | 5 hours ago
altern8 | 6 hours ago
insanitybit | 6 hours ago
I don't think vibecoding at Github has much to do with it.
altern8 | 6 hours ago
That makes sense. Thank you!
gilrain | 6 hours ago
datsci_est_2015 | 5 hours ago
gilrain | 5 hours ago
I don’t buy the excuse. I want to hitch my wagon to those “mysteriously lucky” competitors. (And have. And haven’t had similar issues to Github, since.)
datsci_est_2015 | 3 hours ago
Tough to say as this is all speculative, though.
porridgeraisin | 3 hours ago
abejfehr | 5 hours ago
necovek | 2 hours ago
vitally3643 | 3 hours ago
Think critically.
ModernMech | 4 hours ago
12_throw_away | 2 hours ago
agentic "ai" is going great
[OP] cebert | 6 hours ago
gilrain | 6 hours ago
rwmj | 5 hours ago
abejfehr | 5 hours ago
llbbdd | 3 hours ago
dawnerd | 3 hours ago
rossant | 3 hours ago
cautiouscat | 6 hours ago
That being said there was a noticeable trend starting around 2022.[2] That being said they’ve also been doing a big migration to Azure. It’s likely a combination of things.
1: https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-a...
2: https://www.reddit.com/r/sysadmin/s/LOMPaSv3wY
jampekka | 6 hours ago
https://damrnelson.github.io/github-historical-uptime/
https://news.ycombinator.com/item?id=47591928
chilmers | 5 hours ago
sarchertech | 5 hours ago
Gigachad | 4 hours ago
sarchertech | 4 hours ago
coreyh14444 | 5 hours ago
r0b05 | 5 hours ago
martinald | 5 hours ago
AlienRobot | 5 hours ago
a10c | 6 hours ago
Which certainly made me shit myself, briefly.
grim_io | 6 hours ago
lachieh | 3 hours ago
DonHopkins | 3 hours ago
https://www.youtube.com/watch?v=LGeOee7x5lY
drcongo | 5 hours ago
jaapz | 5 hours ago
echelon | 5 hours ago
Maybe the Github Actions infrastructure isn't run like that.
edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262
TheDong | 5 hours ago
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
CBLT | 5 hours ago
echelon | 5 hours ago
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.
sunrunner | 4 hours ago
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
LPisGood | 4 hours ago
https://youtu.be/zR9PpXWsKFQ
eithed | 3 hours ago
Even if it's "DB in datacenter I tried to save to was hit by meteor" event, you can cater for this not to result in 500 (ie - DB unreachable, retry in a couple of minutes); the question is if you want to.
theta_d | 4 hours ago
alexfoo | an hour ago
Doohickey-d | 5 hours ago
Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
KPGv2 | 5 hours ago
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
Anon1096 | 4 hours ago
maccard | 4 hours ago
bobthepanda | 3 hours ago
It does require constant tuning and adjustment though.
awithrow | 5 hours ago
hnlmorg | 4 hours ago
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
jordemort | 5 hours ago
swiftcoder | 4 hours ago
compumike | 4 hours ago
If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!
If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.
Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.
wasmitnetzen | 3 hours ago
bobthepanda | 3 hours ago
lokar | 2 hours ago
alexfoo | an hour ago
As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).
Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.
Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.
hvb2 | 4 hours ago
I'm sure you're not in ops. Or in a dev org of a service with decent request rates.
What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.
A 50 year old bank API? Maybe...
rhyperior | 4 hours ago
hnlmorg | 5 hours ago
If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.
But effective monitoring is harder than people assume.
dncornholio | 4 hours ago
Isn't that what monitoring actually is? The issue seems to be in their testing, not monitoring.
hnlmorg | 3 hours ago
There are synthetic tests, where you can generate API request calls or even simulate an entire user journey. These allow you to control the user agent, the payloads, and thus you know anything errors back are actual errors. These are triggered by the observability platform (think like running a cron-job) and thus you're not tied to user activity to see when problems arise.
There are other metrics outside of HTTP response codes too. Think like free RAM, CPU usage, disk space, etc. This is just naming some obvious ones because these types of metrics are generally bespoke to the type of application your monitoring. And with these types of monitors, you'd not just have an alert when things have failed, but ideally have alerts when an irregular trend is showing that things are likely to fail too. This latter type of monitors helps you get ahead of the problem before it become customer facing.
Then you have more traditional stuff like logs. This will also be bespoke to the application. But you'd expect errors in logs to get surfaced quickly. Assuming Github have good hygiene in what's being logged.
Tie that up with APMs, RUM, and other goodies like that and you'll have diagnostics to investigate issues when they appear.
(this is just a super high level view of observability too)
lokar | 3 hours ago
You should not alert on cpu, ram, etc
hnlmorg | 3 hours ago
It doesn't "need" that. That just how most people set it up because it’s an easy sane default that allows for network jitter without inexperienced engineers thinking about different conditions triggering different types of responses.
If you’re measuring internal APIs from an observablity solution that’s has nodes already inside you’re network enclave, then there is a strong argument for alerting early.
> You should not alert on cpu, ram, etc
That’s not true to say as an absolute statement. And a generalisation it heavily depends on the system your monitoring and how it behaves under pressure.
But in any case, I wasn’t suggesting CPU alerts were the end goal. I said:
> these types of metrics are generally bespoke to the type of application your monitoring.
Ie you’ll use metrics but those metrics will be highly specific.
The CPU examples were an illustration as to what a “metric” is (it might seem obvious but not everyone is an expert) but the point was HTTP response codes aren't the only types of metrics one should be capturing and watching.
lokar | 2 hours ago
If your requests are fast and cheap, you can probe frequently relative to your goals, but often that's not really possible (think, long SQL queries, or scheduling a container/pod). There you need several datapoints, or possible fewer augmented with other signals.
hnlmorg | 2 hours ago
Talking about long SQL queries, I quite like throwing CPU alerts on database servers. They'll be a low priority alert (ie no out of hours "pagers") so just something that goes into a slack channel. But they're a good indicator of when developers have poorly optimized SQL, or the DB schema is poorly defined (eg missing indexes), or the DB server itself is poorly sized.
This wouldn't be something you'd expect to need in production and definitely not something you'd rely on as a notice of a production outage. But it is an example of one of those 1% occasions where a CPU alert does add value to the overall observability of the application.
But this also ties into your excellent point about how you'd use CPU and other data points to build a picture of what's happening in your application.
lokar | 2 hours ago
idle CPU is often wasted CPU
re-thc | an hour ago
Who says public status page equals internal monitoring.
They likely know faster than you. Whether they post it publicly is a different issue (hint: SLA penalties, news impacting stock etc)
hnlmorg | 44 minutes ago
Are you sure you’re replying to the right comment?
logifail | 3 hours ago
Is it true that official service status pages are updated automatically?
baby_souffle | 3 hours ago
Depends. Typically no because there’s an art to crafting the actual message around impact… but sometimes yes it is automated
registeredcorn | 2 hours ago
Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"
filleduchaos | 2 hours ago
simonjgreen | 5 hours ago
jordemort | 5 hours ago
swiftcoder | 4 hours ago
PunchyHamster | 4 hours ago
simonjgreen | 3 hours ago
re-thc | 5 hours ago
No, it's not. Official updates = potential SLA penalties. Always requires approval.
drcongo | 2 hours ago
chrisjj | an hour ago
There's a threshold. It shows only once 1000 users complain.
/i
dvduval | 5 hours ago
weird-eye-issue | 5 hours ago
superxpro12 | 4 hours ago
denisw | 4 hours ago
weird-eye-issue | 4 hours ago
neya | 3 hours ago
onion2k | 2 hours ago
coldpie | 2 hours ago
ornornor | 2 hours ago
doctorpangloss | 2 hours ago
being a host for git repositories has never been its core competency. neither has its groupware offering.
does it even serve OSS well? a very interesting criteria is, "Have mature or adopted end-user-facing OSS recently merged a large PR from an unallied contributor?" The answer is overwhelming no. This is why there is so much innovation in this space.
danudey | an hour ago
fusishch | 2 hours ago
Just set up a Kubernetes deployment and you’re set.
But as others mention, GitHub’s primary strength is collaboration. If you want decentralized, solve this by creating a decentralized collaboration tool on top of fossil and/or git.
For example, how to do pull requests and code reviews?
40four | an hour ago
gopalv | 2 hours ago
The day it broke away and became centralized was when we had a PR + mandatory "Required actions" to merge to main.
marricks | an hour ago
Gosh, it's hard figuring out what changes Lorne made if only we had a system to merge those changes. Enter git
Gosh it's hard figuring out what packages Rachel had to make this work. Enter rubygems/pip/npm
Gosh it's hard figuring out sync these changes across a network. Enter github
Gosh it's hard figuring out how to get those packages working on my operating system. Enter docker
Gosh centralizing our distributed version control software system onto one website is getting really unreliable. Enter fossil(?????)
If we go any further having one computer per business with a sign up sheep is starting to sound pretty fucking attractive.
corvad | 2 hours ago
PokemonNoGo | 2 hours ago
filleduchaos | 2 hours ago
jrockway | 2 hours ago
mpaco | 2 hours ago
Proudly self-hosting Forgejo since then.
MatthiasPortzel | 2 hours ago
> Our team is currently experiencing an unexpectedly high volume of tickets which has resulted in longer response times than we prefer. We acknowledge the long wait and apologize for the experience.
> Sometimes our abuse detecting systems highlight accounts that need to be manually reviewed. We've cleared the restrictions from your account…
Fully self-hosted IMO can be an overcorrection. The issue isn’t “relying on other people”—it’s relying on GitHub, when they’ve made it clear they don’t care about uptime and they don’t care about support turn-around-time.
iso1631 | 2 hours ago
It would be a pain as I'd have to set up a few integrations again, but github is far lower down the risk scale than the vast majority of SAAS providers
rvz | 6 hours ago
This time today it was caused by friendly fire by the automatic suspension of the GitHub Actions bot which is now a "Ghost" user. Since there is no CEO of GitHub to contact it we are just going to see more [1] of this again.
You might need to push a critical change soon, but now you cannot. You won't get any of these issues if you self hosted as I said 6 years ago...[2]
[0] https://www.githubstatus.com/incidents/g6ffrm0rfvz9
[1] https://news.ycombinator.com/item?id=48085501
[2] https://news.ycombinator.com/item?id=22867803
throwatdem12311 | 6 hours ago
SideburnsOfDoom | 6 hours ago
Or maybe it's before the GitHub internal devs are online and deploying changes.
baalimago | 6 hours ago
comboy | 6 hours ago
sh-cho | 6 hours ago
Andrex | 5 hours ago
jaapz | 5 hours ago
bobmcnamara | 5 hours ago
kminehart | 6 hours ago
conroydave | 6 hours ago
mdrachuk | 5 hours ago
kminehart | 5 hours ago
asimovDev | 5 hours ago
kylegalbraith | 5 hours ago
[0] https://depot.dev
heeton | 5 hours ago
Hope you don't mind the public ask, it seems useful for others.
If we're using depot runners, and want to use them directly, or move off of github actions being the controller for when things run: what do you suggest?
Trigger the workflows directly on depot via CLI?
kylegalbraith | 5 hours ago
We’d need more details around what you’re seeing. It is true that if auth across GitHub is broken than we can’t copy your actions out to be used by Depot CI. However, we have a solution in the works for that as well.
In short, Depot CI, our own engine and control plane is not dependent on upstream actions control plane. But still has to listen for commit events to know if/when to run jobs on things like PRs. This to is being removed in the future.
kevinminehart | 5 hours ago
kylegalbraith | 5 hours ago
[0] https://depot.dev/products/ci
a1o | 5 hours ago
I used to use Cirrus CI as an alternative to GitHub Actions and am looking for a new alternative. I wonder if Depot could fit in the same way for my needs. I need to run builds and tests in Windows, Linux and macOS.
ttouch | 5 hours ago
https://www.blacksmith.sh/ and https://runs-on.com/
They also say that they're much cheaper than github
kevinminehart | 5 hours ago
4lun | 5 hours ago
We're now considering Buildkite (apparently they have a GH actions migration tool) or self hosting something (GitLab CI, maybe even Jenkins), as it looks like that would've kept ticking over since we're still seeing webhooks being triggered today during the downtime.
kylegalbraith | 5 hours ago
efromvt | 6 hours ago
comboy | 6 hours ago
JsonDemWitOster | 5 hours ago
(Ofc, in a sensible universe, we just brush that off to a JS/Firefox glitch or my ISP.)
And yet, here I am. My code is not compiling, my AI isn't vibing, nonetheless I can't work! Two more hours before I can get off!
sylware | 6 hours ago
matt_kantor | 5 hours ago
sylware | 3 hours ago
I am trying to refrain my "off topic" rants... but such microsoft github abuse is generating so much hate due to their dominant market position, it is hard.
couAUIA | 6 hours ago
Thanks for pointing out that nobody is using that thing
chocrates | 6 hours ago
Andrex | 5 hours ago
Jesus, that's both horrible and seems within reach.
LorenDB | 5 hours ago
https://mrshu.github.io/github-statuses/
a1o | 5 hours ago
alexfoo | 4 hours ago
The external page linked above goes the other extreme and considers it a bad status whenever any individual service is degraded.
In reality the majority of people only use 3 or 4 of the core services the majority of the time but since there's no "core services" SLA/uptime the usability of github for the majority of people is slightly obfuscated.
JCTheDenthog | 4 hours ago
Miner49er | 5 hours ago
pistoriusp | 6 hours ago
No, it's not like "act," because it uses the standard Github runner, the difference is that the control plane is an emulation of api.github.com, because of this we can do all kinds of nice things:
Caching in ~0 ms. Pause on failure, so you can let your AI agent fix it and retry without pushing.
ramon156 | 6 hours ago
Is what it boils down to.
> codex "Fix this pipeline, use `act` to verify your changes"
Xirdus | 5 hours ago
Groxx | 4 hours ago
I like that it exists, but what a freaking mess that it's necessary and so difficult to do.
pistoriusp | 13 minutes ago
pistoriusp | 5 hours ago
I have tried to use act many times, and many times I've failed.
P.S. pause on failure is also helpful for humans, but I'm trying to be realistic about where the future of programming is going...
a1o | 5 hours ago
I started playing with proxmox VMs and containers in them (docker and tart) to see if I can build some local infrastructure to properly solve this…
pistoriusp | 5 hours ago
The jobs runs via containers.
skinfaxi | 5 hours ago
pistoriusp | 4 hours ago
gib444 | 6 hours ago
- GitHub
- Hiring budgets
- RAM (/personal computing in general)
- Electricity
- Media/Content
- Truth
rock_artist | 5 hours ago
aa-jv | 5 hours ago
This is why we don't use Github Actions, kids.
Seriously, its a proprietary build service that puts the keys to the kingdom in someone elses' control. Just: No!
Print this status page to PDF so you've got it handy next time someone castigates you for not using Github Actions, folks.
vucetica | 5 hours ago
dsco | 5 hours ago
eatyourpeas | 5 hours ago
maratc | 5 hours ago
carreau | 5 hours ago
hkleppe | 5 hours ago
Mashimo | 5 hours ago
Even though it's selfhosted and we don't have a dedicated infrastructure team, I don't remember it ever being down in the last 12 years I have been working here.
liamdoyle | 5 hours ago
I like being able to vote with my (teams) wallet and I'm tired of staying out of convenience
rebolek | 5 hours ago
moonrailgun | 5 hours ago
BrunoBernardino | 5 hours ago
danieloj | 5 hours ago
shwetanshu21 | 5 hours ago
peterspath | 5 hours ago
cryptos | 5 hours ago
gib444 | 4 hours ago
Anyway. Forgejo's response to it: https://floss.social/@forgejo/116494295922963052
adamddev1 | 5 hours ago
hk1337 | 5 hours ago
I'm guessing related to this? The blog post is dated 11 days ago but I just noticed a blue banner on my actions page today.
r0b05 | 5 hours ago
fooster | 5 hours ago
TestUser00 | 5 hours ago
jonathanbull | 5 hours ago
amirhirsch | 5 hours ago
markfsharp | 5 hours ago
hansmayer | 5 hours ago
hmmdog | 5 hours ago
Cupprum | 5 hours ago
tcp_handshaker | 5 hours ago
"Microsoft’s GitHub was positioned to win the AI coding race. Outages got in the way" - https://www.cnbc.com/2026/05/22/microsoft-was-positioned-to-...
parisiansam | 5 hours ago
EDIT: sorry i meant this rant at the one complaining for the free service not for the paid customers (which is unacceptable)
nelsonfigueroa | 5 hours ago
robbie-c | 5 hours ago
a1o | 5 hours ago
jmkni | 4 hours ago
PunchyHamster | 4 hours ago
gloosx | 4 hours ago
https://github.com/pricing
devil1432 | 5 hours ago
mattbrewsbytes | 5 hours ago
jillesvangurp | 5 hours ago
For Git, all you technically need is ssh access and some backup strategy for your server. It would be bare bones but workable. And there are of course plenty of OSS things that are a lot nicer than that.
I'm still using gh and gh actions and we are mostly below the freemium layer with that. But it is kind of slow and honestly a dedicated vm plus some high CPU/memory workers we can spin up on a need to have basis might be a lot faster. With GH outages becoming more common, my hand might be forced a bit.
In recent weeks, I've spun up listmonk (mailing list solution), matrix (as a slack alternative), and a few other things specific to our software stack. A github alternative would be more of the same. We don't need a lot.
The main objection is that with more moving parts to worry about, the workload for me also increases. Things need updating, monitoring, backups, alerting (and responding to alerts), etc. That sucks up my time and that is scarce.
Another reason for self hosting these days is that with agentic AI tools, self hosted things are a lot easier to integrate into agentic systems. If it is self hosted, you don't have to worry about API limitations, rate limitations, walled gardens, etc. All the traditional SAAS silos are becoming a problem from that point of view. The more locked down it is, the bigger the motive for moving away from it. That's why we ditched Slack for Matrix. Slack is hopelessly locked down and tedious to deal with. Matrix is super easy for this.
halapro | 4 hours ago
Technically Dropbox is just rsync.
Also https://xkcd.com/1319/ but for maintenance.
Barbing | 2 hours ago
21asdffdsa12 | 5 hours ago
theologan | 5 hours ago
jamie_davenport | 5 hours ago
Perfect timing that we post https://www.jxd.dev/writing/building-plain just as this latest incident started.
stevenhubertron | 5 hours ago
mghackerlady | 5 hours ago
katss | 4 hours ago
trollbridge | 4 hours ago
Something’s wrong when my own infrastructure is more reliable than Microsoft’s.
stuff4ben | 4 hours ago
smilespray | 4 hours ago
dbuckman | 4 hours ago
dbuckman | 4 hours ago
0xbadcafebee | 4 hours ago
I much prefer Woodpecker CI, which is an open source fork of Drone.io. It supports multiple Git backends like GitHub, Gitea, Forgejo, Gitlab, Bitbucket. It supports running jobs locally, on Docker, and on Kubernetes. And there's autoscalers built in for AWS, Hetzner, Linode, Vultr, and Scaleway. There's a bunch of 3rd party plugins (https://woodpecker-ci.org/plugins) for custom integrations. The UX is also very simple, with OAuth used not only for authentication/authorization but also setting up & accessing repos. The system architecture is great, with separate components that run stateless connected to a database, and a custom plugin is any program that takes environment variables and does stdio. The config file is a good balance of ugly YAML and convenience syntax like shell-style parameter expansion variables.
It probably takes less than 15 minutes to install, set up, and run WoodpeckerCI for a small team, so it's not a big investment to try out or host. With the autoscaling plugins it lets you scale your workload up to whatever size. Honestly you could run it on a laptop since it's written Go.
(to clarify for beginners: the config file docs are found in a section called "workflow syntax" (https://woodpecker-ci.org/docs/usage/workflow-syntax) and variable parameter expansion is buried deep in an environment variables page called "string operations" (https://woodpecker-ci.org/docs/usage/environment#string-oper...). poorly organized docs aside, the system itself works well)
j45 | 4 hours ago
bob1029 | 4 hours ago
The latest language models have enabled this sort of thing for me. I can integrate a mini Jenkins into every project within a 5-10 minute prompting session. This sort of code isn't hard. It's just tedious, and the LLMs absolutely rock at boring repetitive stuff. Having a win32 service start up successfully on the very first try is something I haven't experienced until 2026.
starik36 | 3 hours ago
peheje | 3 hours ago
"Update something in the cloud" <- What do you mean?
Yokohiii | 2 hours ago
That only works on extremely simple setups and has risks. If you have only a single server, you can stall it. Now, how to roll back?
Yokohiii | 3 hours ago
bob1029 | 52 minutes ago
I agree in a hosted+shared SQL scenario you have to be a little bit more careful with all of this. Arguably, you should have a separate schema management phase in these cases.
But if you are just SQLite embedded in the service, you can use the user_version pragma to track schema version and perform deterministic migrations (assuming a user didn't manually jack with the file in-between).
dncornholio | 4 hours ago
Self hosted Gitlab with self hosted (or AWS) runners running your pipelines.. We only use Github as a mirror for our public repositories.
delf | 3 hours ago
CSMastermind | 3 hours ago
tom1337 | 3 hours ago
shadowbip | 3 hours ago
ibejoeb | 3 hours ago
Xorlev | 3 hours ago
GitHub was, once upon a time, quite stable. Things have changed: more features, more usage, and automated agents.
ibejoeb | 3 hours ago
repeekad | 3 hours ago
vervas | 3 hours ago
vitally3643 | 3 hours ago
I've done some hacky shit in CI scripts, but none made me more mad than that one.
Onplana | 3 hours ago
discordianfish | 3 hours ago
surdu | 3 hours ago
https://www.githubstatus.com/uptime?page=31
13hunteo | 3 hours ago
Barbing | 2 hours ago
gbear605 | 2 hours ago
surajrmal | 2 hours ago
discordianfish | 2 hours ago
AlienRobot | 2 hours ago
Barbing | 3 hours ago
If Google owned GitHub would they be better positioned to scale?
Aperocky | 2 hours ago
It is relatively easy to scale a collection of simple things to extreme and exhibit complex behavior together. It is a lot harder to scale something complex to extreme. But too many times the latter is the default - designed wrong from the ground up and stuck in scaling hell.
booleandilemma | 3 hours ago
throwawaypath | 2 hours ago
thepaulmcbride | 3 hours ago
cyanydeez | 3 hours ago
pnvdr | 3 hours ago
shevy-java | 2 hours ago
For instance, the UI at setups such as https://git.devuan.org/Daemonratte/gtk2-ng is quite ok-ish, in my opinion. Granted, it is mostly copy/paste from github but that still is about 1000000x better than sourceforge's interface - and gitlab's UI too (I just hate gitlab's UI, they seem to love complexity and a billion features only 0.000001% ever need; GitHub, with all its faults, is for the most part really simple - not everywhere, e. g. GitHub wiki setup sucks, but by and large I think it is simple overall).
danudey | 2 hours ago
Bnjoroge | 2 hours ago
dilawar | an hour ago
wongarsu | an hour ago
joshuanapoli | an hour ago
thesurlydev | an hour ago
ripitrust | 3 hours ago
nickstinemates | 3 hours ago
The open source contribution model as we once knew it is dead; you're not going to accept patches from random agents. The risk is way too high. And you can see that increasingly "AI Slop" makes it difficult to be a maintainer of any semblance of a popular repo.
So what's the value? A durable place to store work? hah.
Discovery? That part of Github has always been shitty.
So that leaves.. Github Actions? The thing that is down every other day and has been the subject of a few ~rug pulls~/attempted price hikes that are almost surely coming back?
miki123211 | 3 hours ago
This is a conservative estimate assuming linear growth, the actual number is likely going to be higher. Much higher.
It's not too hard to grow 14X YoY if you start from a hundred customers. If you have hundreds of millions? Yeah, not so easy.
[1] https://x.com/kdaigle/status/2040164759836778878
rightbyte | 2 hours ago
bdangubic | 3 hours ago
ceheaaf | 3 hours ago
emirhanerkan | 2 hours ago
qwerpy | 2 hours ago
Reminds me of the occasional “JavaScript developer tries to vibe debug a Linux kernel issue” comments we get here.
hxii | 2 hours ago
With all the recent negativity – how are they not even TRYING to fix the damn thing?
sibidharan | 2 hours ago
ValentineC | 2 hours ago
https://www.reddit.com/r/GithubCopilot/comments/1toa9tf/mode...
bezier-curve | 2 hours ago
[1] https://github.com/ericc-ch/copilot-api
sigbottle | 2 hours ago
mcrittenden | an hour ago
vyrotek | 2 hours ago
sgerenser | 2 hours ago
hartjer | 2 hours ago
shevy-java | 2 hours ago
We have already seen this in the last some weeks, but now this has become a meme that keeps on giving. GitHub down! GitHub up again. GitHub Down! GitHub ... ...
nomilk | 2 hours ago
So why are Actions so unreliable anyway? Occam's Razor would probably suggest the domain is inherently complex/difficult; but other providers show that reliability is possible. What would Occam's Razor suggest next? Poor management..?
frisbee6152 | 2 hours ago
nomilk | 2 hours ago
xixixao | an hour ago
You’d need at least some hash of sources + test results, and check that it matches that (in CI).
And you’d still deal with environment differences.
nomilk | an hour ago
Reasonable concern. In ~10 years of indy development, I haven't forgotten to run tests before pushing to main, ever. So setting up and maintaining complicated machinery to solve a problem that could (but never has) happened doesn't justify taking focus off other more important things, namely building.
The benefit probably increases with team size (I'm a team of 1, so I appreciate the luxury of being able to dodge CI/CD entirely).
csomar | 45 minutes ago
nomilk | 28 minutes ago
Say a disaster happens and someone pushes to main without running tests, 9 times out of 10 it will be of ~zero consequence (either the code works first time, it was a cosmetic change that hardly affected users etc).
I know there are horror stories and CI/CD would have prevented some of those, but IME they're just not that common nor severe for small operations, and even when they happen, only a small subset are irreversible/unfixable.
juanre | an hour ago
ElFitz | 2 hours ago
Setting it all up would have been tediously annoying eight months ago (Buildkite requires setting up GitHub webhooks for each repo).
Last week I just had codex set up everything, ephemeral vm runners and all, using a couple of low-spec refurb mac minis, Buildkite’s API, a short-lived API token, and migrate my repositories one by one.
So far so good, it’ll pay for itself within two to three months, and following today’s outage I suggested at work that we experiment with the same set up.
They’re considering it.
paulbjensen | 2 hours ago
timedude | 2 hours ago
gt010 | 2 hours ago
suis_siva | an hour ago
vednig | an hour ago
mattaustin | an hour ago
saltyoldman | an hour ago