Oof, off topic but the trains were out of service here for my commute last night so I though from the headline this meant that somehow all trains everywhere just stopped working. Glad to see it’s just some Saas product that’s down
You don't think "railway" at least conjures ideas about the company? It's not some random word. Not every company needs to be "helps you ship software quickly inc"
Possible empirical justification: Non-tech and more "typical" orgs (train companies...) don't spend lots of money on slick-sounding one-word .com domains.
Shell was originally very literal though. They sold seashells.
> The "Shell" Transport and Trading Company (the quotation marks were part of the legal name) was a British company, founded in 1897 by Marcus Samuel, 1st Viscount Bearsted, and his brother Samuel Samuel. Their father had owned an antique company in Houndsditch, London, which expanded in 1833 to import and sell seashells, after which the company "Shell" took its name.
IDK, it looks like servers were up, connectivity worked well, and some builds were failing. Wouldn't call that a big issue, and the same thing was happening with Vercel due to their git clones etc. yesterday too.
Joke about train line aside, I think Railway fits right in the spot that Heroku left.
They have a nice UI, support deploy any kind of backend-involved apps as long as it can be built into a docker container. While many PaaS out there seems to prioritize frontend only apps.
And they have a free plan, so people can just quickly deploy some POC before decide if it's good to move on.
Anyone know if there is any other PaaS that come with a low cost starter plan like this (a side from paying for a VPS)?
Render.com has a similar value proposition. I’ve used them and am pretty happy. Railway seems to have more bundled observability built in, that i’d like in render.
VPS + Dokploy gives you just as much functionality with an additional performance boost. Hostinger has great prices and a one-click setup. Good for dozens of small projects.
+1 for dokploy, it's very flexible and allows me to setup my sites how I need. Especially as it concerns to the way I setup a static landing page, then /app goes to the react app. And /auth goes to a separate auth service, etc.
Been building an open source version of railway at https://canine.sh. Offers all the same features without the potential of a vendor lock-in / price gouging.
Plenty of screenshots and exact step by step instructions. Throwing an "example git repo" with no documentation won't get you any users.
Put your shoes into that of a Heroku/Vercel user. DevOps is usually Somebody Else's Problem. They are not going to spend hours debugging kubernetes so if you want to sell them a PaaS built on Kubernetes, it has to be fool proof. Coolify is an excellent example, the underlying engineering is average at best (from a pure engineering point of view it's a very heavy app that suffers from frequent memory leaks, they have a new v5 rewrite but it's been stuck for 2 years) but the UI/UX has been polished very well.
Yeah working through documentation still. The goal isn’t so much to replace coolify. Mostly born out of my last start up that ran a $20M business, 15 engineers, with about 300-1000qps at peak, with fairly complex query patterns.
I think the single VPS model is just too hard to get working right at that scale.
I think north flank / enterprise applications, would be a better comparison of what canine is trying to do, rather than coolify / indie hackers. The goal is not take away kubernetes, but to simplify it massively for 90% of use cases but still give full k8s api for any more advanced features
Yes, have you seen miget.com by any chance? You can start with the free tier, and can have a backend with a database for free (256Mi plan). If you need more, just upgrade. They redefined cloud billing. Worth checking.
I've had about one third of my Railway services affected. I had no notification from Railway, and logging in showed each affected service as 'Online', even though it had been shut down.
I'm pretty annoyed. I am hosting some key sites on Railway. This is not their first outage recently, and one time a couple of months ago was just as I was about to give our company owner a demo of the live product.
First off, super duper sorry. It's sometimes a good/bad thing if I can remember someones handle. ...and I specifically remember the support thread where we did have an outage before your demo :| - the number one goal for us is to deliver a great product. Number two is that we should never embarrass a user, outages do exactly that.
We just wrapped up the post mortem and that'll be published soon where it explains why the dashboard was reporting the state of the application incorrectly and would be more than happy to credit you for the impact to keep your business. That said, totally understand if two is way too much impact for your services.
This is great, not 10 minutes before this outage did I present Railway as a viable option for some small-scale hosting for prototypes and non-critical apps as an alternative to the Cloud giants
It always happens that way. I guarantee some people migrated from Heroku to Railway and bragged about future stability to the team, only to experience this.
This affected a seemingly random set of services across three of my accounts (pro and hobby, depending on if this is for work or just myself.) That ranges from Wordpress to static site hosting to a custom Python server. All of the deployments showed as Online, even after receiving a SIGTERM.
While 3% is 'good', that's an awfully wide range of things across multiple accounts for me, so it doesn't feel like 3% ;) Please publish the post mortem. I am a big fan of Railway but have really struggled with the amount of issues recently. You don't want to get Github's growing rep. Some people are already requesting I move one key service away, since this is not the first issue.
Finally, can I make a request re communication:
> If you are experiencing issues with your deployment, please attempt a re-deploy.
Why can't Railway restart or redeploy any affected service? This _sounds_ like you're requiring 3% of your users to manually fix the issue. I don't know if that's a communication problem or the actual solution, but I certainly had to do it manually, server by server.
Totally! People who see the impact will likely see more impacted than say, 3% of their services. Not all disruption created equal.
We rolled out a change to update our fraud model, and that uses workload fingerprinting
Since, in all likelyhood, your projects are similarly structured, there will be more impacted workloads if the shape of your workloads was in the "false positive" set
Will have more information soon but very valid (and astute) feelings!
> We rolled out a change to update our fraud model, and that uses workload fingerprinting
> Since, in all likelyhood, your projects are similarly structured...
Thanks for the info. For what it's worth and to inform your retrospective, this included:
* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday
* A Docusaurus-generated static site. Completely static.
* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)
These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.
Some things come to mind for your post-mortem:
* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.
* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.
We have more info coming soon but I think the best way to frame this is actually working backwards and then explain how it impacted yours and other services.
So Railway (and other cloud providers) deal with fraud near constantly. The internet is a bad and scary place and we spend maybe a third to half of our total engineering cycles just on fraud/up-time related work. I don't wanna give any credit to script kiddies to the hostile nation states but we (and others) are under near and constant bombardment from crap workloads in the form of traffic, or not great CPU cycles, or sometimes more benignly, movie pirating.
Most cloud providers understandably don't like talking about it because ironically, the more they talk about it- the bad actors do indeed get a kick from seeing the chaos that they cause work. Begin the vicious cycle...
This hopefully answers:
> If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
In our 5 year history, this is the third abuse related major outage. One being a Nation State DDoS, one being coordinated denial. This is the first one where it was a false positive taking down services automatically. We tune it constantly so its not really an issue except when it is.
So- with that background, we tune our boxes of lets say "performance" rules constantly. When we see bad workloads, or bad traffic, we have automated systems that "discourage" that use entirely.
When we updated those rules because we detected a new pattern, and then rolling it out, that's when we nailed the legit users, since this used the abuse pattern, it didn't show on your dash, hence the immediate gaslighting.
Which leads to the other question:
> How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_.
We don't want to tell fraudulent customers if they are effective or not. For this instance, it was a straight up logic bug on the heuristics match. But we have done this for our existence like black holing illegitimate traffic for example, then ban. We did this because some coordinated actors will deploy, get banned with: "reason" and then they would have backup accounts after they found that whatever they were doing was working. If you knew where to look, sometimes they will brag on their IRCs/Discords.
Candidly, we don't want to be transparent about this, but any user impact like this is the least we can do. Zooming out, macro wise, this is why Discord and other services are leaning towards ID verification. ...and it's hard for people on the non service provider side to appreciate the level of garbage out there in the internet. That said, that is an excuse- and we shovel that so that you can do your job and if we stop you, then thats on us which we own and hopefully do better about.
That said, you and others are understandably miffed (understatement) all we can do is work through our actions to rebuild trust.
Second complete outage on railway in 2 months for us (there was also a total outage on December 16th), and many issues with stuck builds and other minor issues in the months before that.
Looking to move. It's a bit of hassle to setup coolify and Hetzner but I have lost all trust.
Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.
Affected by the outage since about 6:15 AM PT this morning. We're still down as of 9:00 AM PT.
Our existing containers were in a failure state and are now are in a partial failure state. Containers are running, but underlying storage/database is offline.
Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.
I'm glad Railway updated their status page, but more details need to be posted so everyone knows what to do now.
Everyone has outages, it's the way of life and technology. Communication with your customers always makes it less painful and people remember good communication and not the outage. Railway, let's start hearing more communication. Forum is having problems as well. Thanks.
Heard. Being transparent, usually the delay on ack is us trying to determine and correlate the issue. We have a post mortem going out but we note that first report was in our system 10 minutes before it was acked, to which the platform team was trying to see which layer the impact was at.
That said, this is maybe concern #1 of the support team. Where we want the delta between report and customer outage detected to be as small as possible. The way it usually works is that we have the platform alarms and pages go first, and then the platform engineer usually will page a support eng. to run communications.
Usually the priority is to have the platform engineer focus on triaging the issue and then offload the workload to our support team so that we can accurately state what is going on. We have a new comms clustering system that rolling out so that if we get 5 reports with the similar content, it pages up to the support team as well. (We will roll this out after we communicated with affected customers first.)
In situations like this, please dedicate at least one team member to respond as quickly as possible to the Railway Help Station posts. That's where your customers are going for communication and support.
this is the hidden cost of the move to paas. you trade devops velocity for a massive single point of failure. once the global control plane goes down, your entire risk profile is basically out of your hands.
We weren’t affected, but as a startup I’ll take a minor outage over getting stonewalled by GCP/Azure/AWS any day. Railway has consistently been responsive and actually understands the problem you’re describing. With the big three, unless you’re spending serious money or paying for premium support, you often just get links to docs instead of real help.
Repeating “~3% impacted” three times? Damage control. Got wrecked. DB SIGTERM’d, app dead for hours, before they even posted a status update. 3% is 100% outage when it’s your stuff: broken dashboards and zero warning.
jedbrooke | 6 hours ago
Gravityloss | 6 hours ago
esseph | 6 hours ago
nhubbard | 3 hours ago
esseph | 2 hours ago
jasoncartwright | 6 hours ago
et-al | 6 hours ago
lysace | 5 hours ago
locknitpicker | 6 hours ago
huydotnet | 6 hours ago
[OP] TealMyEal | 6 hours ago
ratorx | 5 hours ago
colesantiago | 6 hours ago
I am assuming that a domain like railway.com should be about trains.
Why does every tech company have to name themselves as a one word .com website and what they do is unrelated and vague to their own name?
Does every tech company think they are Apple and have to register every word in the dictionary and redefine it as a technology company?
Really bad name for a company.
thornewolf | 6 hours ago
colesantiago | 6 hours ago
blibble | 6 hours ago
could be called "entire" (https://entire.io/)
imiric | 5 hours ago
vimda | 5 hours ago
normie3000 | 5 hours ago
Netflix?
Liftyee | 5 hours ago
lbrito | 5 hours ago
Lotus
Jaguar
Caterpillar
Shell
its a human thing
caseyohara | 4 hours ago
> The "Shell" Transport and Trading Company (the quotation marks were part of the legal name) was a British company, founded in 1897 by Marcus Samuel, 1st Viscount Bearsted, and his brother Samuel Samuel. Their father had owned an antique company in Houndsditch, London, which expanded in 1833 to import and sell seashells, after which the company "Shell" took its name.
https://en.wikipedia.org/wiki/Shell_plc
searls | 6 hours ago
zachrip | 6 hours ago
linhns | 4 hours ago
ZoneZealot | 6 hours ago
Of course every service will have outages, it's just funny to see it so soon after saying:
> We're nuts for studying failure at the company [...]
(albeit a different 'failure' context)
tonyhb | 5 hours ago
IOW, doesnt look as bad as the title suggests?
huydotnet | 6 hours ago
They have a nice UI, support deploy any kind of backend-involved apps as long as it can be built into a docker container. While many PaaS out there seems to prioritize frontend only apps.
And they have a free plan, so people can just quickly deploy some POC before decide if it's good to move on.
Anyone know if there is any other PaaS that come with a low cost starter plan like this (a side from paying for a VPS)?
Tankenstein | 6 hours ago
mstank | 5 hours ago
dabbz | 4 hours ago
czhu12 | 5 hours ago
Onavo | 5 hours ago
You want docs like this:
https://coolify.io/docs/applications/ci-cd/github/setup-app
https://coolify.io/docs/applications/build-packs/dockerfile
https://coolify.io/docs/applications/build-packs/overview
Plenty of screenshots and exact step by step instructions. Throwing an "example git repo" with no documentation won't get you any users.
Put your shoes into that of a Heroku/Vercel user. DevOps is usually Somebody Else's Problem. They are not going to spend hours debugging kubernetes so if you want to sell them a PaaS built on Kubernetes, it has to be fool proof. Coolify is an excellent example, the underlying engineering is average at best (from a pure engineering point of view it's a very heavy app that suffers from frequent memory leaks, they have a new v5 rewrite but it's been stuck for 2 years) but the UI/UX has been polished very well.
czhu12 | 4 hours ago
I think the single VPS model is just too hard to get working right at that scale.
I think north flank / enterprise applications, would be a better comparison of what canine is trying to do, rather than coolify / indie hackers. The goal is not take away kubernetes, but to simplify it massively for 90% of use cases but still give full k8s api for any more advanced features
imiric | 5 hours ago
Heh.
Looks like a great product, although maybe mention some honest reasons to not use it, instead of the passive-aggressive marketing ones.
ktaraszk | 5 hours ago
vcanales | 5 hours ago
zachrip | 6 hours ago
[OP] TealMyEal | 6 hours ago
justjake | 5 hours ago
That said, we treat this exigently seriously!
Any downtime is unacceptable and we'll have a post mortem up in the next couple hours
jszymborski | 5 hours ago
jsheard | 5 hours ago
https://blog.railway.com/p/data-center-build-part-one
autonomousErwin | 5 hours ago
spollo | 5 hours ago
railway-rahul | 5 hours ago
vintagedave | 5 hours ago
Here's a sample log entry:
> 2026-02-11T14:35:11.916787622Z [err] 2026/02/11 14:35:03 [notice] 1#1: signal 15 (SIGTERM) received, exiting
I've had about one third of my Railway services affected. I had no notification from Railway, and logging in showed each affected service as 'Online', even though it had been shut down.
I'm pretty annoyed. I am hosting some key sites on Railway. This is not their first outage recently, and one time a couple of months ago was just as I was about to give our company owner a demo of the live product.
ndneighbor | 3 hours ago
First off, super duper sorry. It's sometimes a good/bad thing if I can remember someones handle. ...and I specifically remember the support thread where we did have an outage before your demo :| - the number one goal for us is to deliver a great product. Number two is that we should never embarrass a user, outages do exactly that.
We just wrapped up the post mortem and that'll be published soon where it explains why the dashboard was reporting the state of the application incorrectly and would be more than happy to credit you for the impact to keep your business. That said, totally understand if two is way too much impact for your services.
jpcompartir | 5 hours ago
ezekg | 5 hours ago
jpcompartir | 5 hours ago
This won't change my decision, but it is still impeccable timing
justjake | 5 hours ago
We'll have a post mortem for this one as we always write post mortems for anything that affects users
Our initial investigation reveals this affects <3% of instances
Apologies from myself + the Team. Any amount of downtime is completely unacceptable
You may monitor this incident here: https://status.railway.com/cmli5y9xt056zsdts5ngslbmp
vintagedave | 5 hours ago
This affected a seemingly random set of services across three of my accounts (pro and hobby, depending on if this is for work or just myself.) That ranges from Wordpress to static site hosting to a custom Python server. All of the deployments showed as Online, even after receiving a SIGTERM.
While 3% is 'good', that's an awfully wide range of things across multiple accounts for me, so it doesn't feel like 3% ;) Please publish the post mortem. I am a big fan of Railway but have really struggled with the amount of issues recently. You don't want to get Github's growing rep. Some people are already requesting I move one key service away, since this is not the first issue.
Finally, can I make a request re communication:
> If you are experiencing issues with your deployment, please attempt a re-deploy.
Why can't Railway restart or redeploy any affected service? This _sounds_ like you're requiring 3% of your users to manually fix the issue. I don't know if that's a communication problem or the actual solution, but I certainly had to do it manually, server by server.
justjake | 5 hours ago
We rolled out a change to update our fraud model, and that uses workload fingerprinting
Since, in all likelyhood, your projects are similarly structured, there will be more impacted workloads if the shape of your workloads was in the "false positive" set
Will have more information soon but very valid (and astute) feelings!
vintagedave | 4 hours ago
> Since, in all likelyhood, your projects are similarly structured...
Thanks for the info. For what it's worth and to inform your retrospective, this included:
* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday
* A Docusaurus-generated static site. Completely static.
* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)
These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.
Some things come to mind for your post-mortem:
* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.
* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.
ndneighbor | 2 hours ago
So Railway (and other cloud providers) deal with fraud near constantly. The internet is a bad and scary place and we spend maybe a third to half of our total engineering cycles just on fraud/up-time related work. I don't wanna give any credit to script kiddies to the hostile nation states but we (and others) are under near and constant bombardment from crap workloads in the form of traffic, or not great CPU cycles, or sometimes more benignly, movie pirating.
Most cloud providers understandably don't like talking about it because ironically, the more they talk about it- the bad actors do indeed get a kick from seeing the chaos that they cause work. Begin the vicious cycle...
This hopefully answers:
> If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
In our 5 year history, this is the third abuse related major outage. One being a Nation State DDoS, one being coordinated denial. This is the first one where it was a false positive taking down services automatically. We tune it constantly so its not really an issue except when it is.
So- with that background, we tune our boxes of lets say "performance" rules constantly. When we see bad workloads, or bad traffic, we have automated systems that "discourage" that use entirely.
When we updated those rules because we detected a new pattern, and then rolling it out, that's when we nailed the legit users, since this used the abuse pattern, it didn't show on your dash, hence the immediate gaslighting.
Which leads to the other question:
> How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_.
We don't want to tell fraudulent customers if they are effective or not. For this instance, it was a straight up logic bug on the heuristics match. But we have done this for our existence like black holing illegitimate traffic for example, then ban. We did this because some coordinated actors will deploy, get banned with: "reason" and then they would have backup accounts after they found that whatever they were doing was working. If you knew where to look, sometimes they will brag on their IRCs/Discords.
Candidly, we don't want to be transparent about this, but any user impact like this is the least we can do. Zooming out, macro wise, this is why Discord and other services are leaning towards ID verification. ...and it's hard for people on the non service provider side to appreciate the level of garbage out there in the internet. That said, that is an excuse- and we shovel that so that you can do your job and if we stop you, then thats on us which we own and hopefully do better about.
That said, you and others are understandably miffed (understatement) all we can do is work through our actions to rebuild trust.
port3000 | 4 hours ago
Looking to move. It's a bit of hassle to setup coolify and Hetzner but I have lost all trust.
iJohnDoe | 4 hours ago
iJohnDoe | 3 hours ago
Since there hasn't been any responses on the official support forum, maybe this will help someone.
I did a backup of our deployment first and did a Restart (not a Redeploy). Our service came back up thankfully.
Obviously do your own safety check about persistent volumes and databases first.
iJohnDoe | 5 hours ago
Our existing containers were in a failure state and are now are in a partial failure state. Containers are running, but underlying storage/database is offline.
Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.
I'm glad Railway updated their status page, but more details need to be posted so everyone knows what to do now.
Everyone has outages, it's the way of life and technology. Communication with your customers always makes it less painful and people remember good communication and not the outage. Railway, let's start hearing more communication. Forum is having problems as well. Thanks.
ndneighbor | 3 hours ago
Heard. Being transparent, usually the delay on ack is us trying to determine and correlate the issue. We have a post mortem going out but we note that first report was in our system 10 minutes before it was acked, to which the platform team was trying to see which layer the impact was at.
That said, this is maybe concern #1 of the support team. Where we want the delta between report and customer outage detected to be as small as possible. The way it usually works is that we have the platform alarms and pages go first, and then the platform engineer usually will page a support eng. to run communications.
Usually the priority is to have the platform engineer focus on triaging the issue and then offload the workload to our support team so that we can accurately state what is going on. We have a new comms clustering system that rolling out so that if we get 5 reports with the similar content, it pages up to the support team as well. (We will roll this out after we communicated with affected customers first.)
iJohnDoe | 2 hours ago
In situations like this, please dedicate at least one team member to respond as quickly as possible to the Railway Help Station posts. That's where your customers are going for communication and support.
HaZeust | 5 hours ago
engelo_b | 5 hours ago
arsalanb | 3 hours ago
everfrustrated | 2 hours ago
azebasse | 2 hours ago
Repeating “~3% impacted” three times? Damage control. Got wrecked. DB SIGTERM’d, app dead for hours, before they even posted a status update. 3% is 100% outage when it’s your stuff: broken dashboards and zero warning.