Appreciate the feedback. We got some feedback previously that things were "too technical" and not acknowledging it from the what users saw.
I've gone ahead and re-added the surrogate keys statement to the press release. Thank you for the feedback and if there's other things that you believe can be better please let me know!
> Why were they making CDN changes in prod? With their 100M funding recently they could afford a separate env to test CDN changes. Did their engineering team even properly understand surrogate keys to feel confident to roll out a change in prod? I don't think they're beating the AI allegations to figure out CDN configs, a human would not be this confident to test surrogate keys in prod.
We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.
> During and post-incident, the comms has been terrible. Initial blog post buried the lede (and didn't even have Incident Report in the title). They only updated this after negative feedback from their customers. I still get the impression they're trying to minimise this, it's pretty dodgy. As other comments mentioned, the post is vague.
Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.
> They didn't immediately notify customers about the security incident (people learned from their users). The apparently have emailed affected customers only, many hours after. Some people that were affected that still haven't been emailed, and they seem to be radio silent lately.
We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure
> Their founder on twitter keeps using their growth as an excuse for their shoddy engineering, especially lately. Their uptime for what's supposed to be a serious production platform is abysmal, they've clearly prioritised pushing features over reliability https://status.railway.com/ and the issues I've outlined here have little to do with growth, and more to do with company culture.
Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!
> Their forum is also getting heated, customers have lost revenue, had medical data leaked etc., with no proper followup from the railway team
There are team members in that thread linked, are you certain you linked the right thread? Happy to have a look at anything you believe we're missing!
I'm sorry, but there's a lot of spin here. Basically you guys handled this terribly, and your reliability has tanked recently, hence why customers that need reliability in production are leaving or have already migrated.
> We went deep on them, tested them prior, and then when rubber met road in production we ran into cases we didn't see in testing. The large issue, and mentioned in the blogpost, is that we didn't have a mechanism to to a staged release.
Honestly for a production-grade _platform_ company, that also does compliance (SOC2/3, HIPAA etc.), not having a staged release is negligent, and how you guys are handling this is a huge red flag. I've done such changes myself in production envs, for deployments that don't have the stakes you guys have. I'm normally more sympathetic on incidents, but the lack of transparency thus far from railway leaves me doubting more than anything.
> Our initial post definitely could have been more clear, and we revised it the moment we got customer feedback to do so.
Please read the room, there's still a lot of confusion about the blog post in this thread (https://news.ycombinator.com/item?id=47582295). The technical detail isn't there, we only know it about the surrogate keys from the status incident (https://status.railway.com/incident/X0Q39H56) which is not linked in the post. The blog post reads like PR compared to the initial incident status report, and the resolved timestamp does not match which is sloppy. Your little edit to the title only made it from a bad post to a slightly less bad post.
> We notified customers even before we did a wide release, as is process for anything security related. You create space for as much disclosure area as possible, and then follow up with a public disclosure
Emailing only affected users isn't working out, because affected people aren't yet emailed (I know one personally). Just check the post on your own forum (https://station.railway.com/questions/data-getting-cached-or... did you actually read it?) and see the list of people affected still not emailed, and left on read. You guy should email everyone, this is a security incident not a service interruption. There's a lot of loss trust by your customers now, i.e., if you guys can't figure out who to email, what else are you doing wrong?
> Do you have any specifics here? We're scaling the system at 100x YoY growth right now, working 24/7 to scale the entire thing. Again, all ears on if you have specific crits as we're always open to receiving feedback on how we can do things better!
Again, it's not an excuse if you're a _platform_ company that customers pay a lot of money to be reliable. You can't just keep saying you're open to feedback and being transparent as vanity. There's plenty of feedback on here, your twitter, your forum, and feedback is people are telling you to focus on reliability, because railway keeps breaking their deployments. If you don't care about reliability and prefer to scale with features, be honest about it. Railway's poor uptime does not lie.
> There are team members in that thread linked, are you certain you linked the right thread? Happy to have a look at anything you believe we're missing!
Did you read the thread? Yes, only _one_ employee commented 5 hours after my HN comment. Still almost everyone left of read, unanswered questions etc.
By way that's only one forum post, there are many that are just ignored, one where a user mentioned they're reporting railway to ICO for a GDPR breach, rightfully.
> Honestly for a production-grade _platform_ company, that also does compliance (SOC2/3, HIPAA etc.), not having a staged release is negligent, and how you guys are handling this is a huge red flag. I've done such changes myself in production envs, for deployments that don't have the stakes you guys have. I'm normally more sympathetic on incidents, but the lack of transparency thus far from railway leaves me doubting more than anything.
We do indeed have a staging environment as mentioned previously. The issue arose in the rollout to production as mentioned previously.
> The blog post reads like PR compared to the initial incident status report, and the resolved timestamp does not match which is sloppy.
I've gone ahead and added the surrogate key mention into the post mortem. We initially got in trouble for having it be too technical centric and not enough on the user impact. It's a delicate balance; apologies. As I mention, we are open to critical feedback here.
> Emailing only affected users isn't working out, because affected people aren't yet emailed (I know one personally). Just check the post on your own forum (https://station.railway.com/questions/data-getting-cached-or... did you actually read it?) and see the list of people affected still not emailed, and left on read.
We have people working directly in that thread. For anybody who believes they were affected but not reached out to, we're working directly with them. We do take this very seriously. If you know someone here, please have them reach out either there or directly to me at jake@railway.com
> Again, it's not an excuse if you're a _platform_ company that customers pay a lot of money to be reliable. You can't just keep saying you're open to feedback and being transparent as vanity.
In the directly linked tweet I've mentioned that we're focusing on scaling the current system vs adding new features. We absolutely do need to do better on reliability, and my point is "Is there a specific poor engineering practice you're seeing here, or is it just based on reliability". Either is a fine crit we just want to make sure all our basis are covered
> Did you read the thread? Yes, only _one_ employee commented 5 hours after my HN comment. Still almost everyone left of read, unanswered questions etc.
Indeed I've read the thread, and we have people working it (you can see as of 8 hours ago).
> We do indeed have a staging environment as mentioned previously. The issue arose in the rollout to production as mentioned previously.
You may have misunderstood, I said staged release, i.e., I'm referencing the rollout
> I've gone ahead and added the surrogate key mention into the post mortem. We initially got in trouble for having it be too technical centric and not enough on the user impact. It's a delicate balance; apologies. As I mention, we are open to critical feedback here.
You can do both. If you have different audiences, have two separate posts and mutually link to redirect audiences. Ask your sec staff instead of relying on paying customers to give post-hoc feedback on your dodgy disclosure practices. If I have ping a platform company to correct and clarify info about their security disclosure, I'm out.
It appears that your company experienced an incident during which a blog entry was made available in which readers became informed about certain information about a server condition that resulted in certain users receiving a barrage of indirect clauses etc. etc. etc.
Be more direct. Be concise. This blog post sounds like a cagey customer service CYA response. It defeats the purpose of publishing a blog post showing that you’re mature, aware, accountable, and transparent.
The problem is that these visible errors make us wonder what other errors in the post are less visible. Fixing them doesn’t fix the process that led to them.
A lot of people are confident in enough in their ability to spot AI infra that they are willing to dismiss a firsthand source on this, and I admit I have no idea why. There isn't any upside to making this claim, and anyway, I assure you that people need no help at all from AI to make these kinds of mistakes.
Their reply doesn't make much sense, they're supposedly soc2 compliant. How are they compliant but letting a single engineer push out a change like that?
I'm sure Claude didn't literally ship the feature itself with no oversight, but I also find it hard to believe that their approach to adopting AI didn't factor in at all. Even just like, the mental overhead of moving faster and adopting AI code with less stringent review leading to an increase in codebase complexity could cause it. Couple that with an AI hallucinating an answer to the engineer who shipped this change, I'm not sure why people are so quick to discount this as a potential source of the issue. Surely none of us want our infra to become less secure and reliable, and so part of preventing that from happening is being honest about the challenges of integrating AI into our development processes.
> I'm not sure why people are so quick to discount [AI] as a potential source of the issue.
Because (per the link above) the CEO said that (1) it was their fault, and (2) it had nothing to do with AI.
I understand that on this forum statements like this are inevitably greeted with some amount of skepticism, but right now I'm seeing no particular reason to disbelieve Jake, and the reason that "if they did use AI they'd deny it" should frankly not be considered good enough to fly around here. Like probably everyone in this comment section I'm open to evidence that they used AI to slop-incident themselves, but until we can reach that standard let's please calm down and focus on what we actually know to be true.
During this whole incident, Railway have made a wide range of misleading and straight out false claims to cover themselves, so them saying it wasn't AI is pretty much meaningless
So on the one hand you have a direct statement from the source that the cause of this incident is humans. On the other hand, while we all agree there is no specific evidence that AI caused the issue, the guy who made that statement, like, really loves AI.
In my life I have gone back and forth on the idea that 12 angry men is a kind of facile representation of how people think and what kinds of evidence really form the basis of a reasonable society. This comment section is doing a really good job of stretching my resolve to believe we are getting at least better.
Come on man, their CEO is a massive vibe coding proponent and his company spent $300,000 on Claude this month. But yeah, I'm sure Claude had nothing to do with any of it. I bet they don't use it to write any code.
This affected a seemingly random set of services across three of my accounts (pro and hobby, depending on if this is for work or just myself.) That ranges from Wordpress to static site hosting to a custom Python server. All of the deployments showed as Online, even after receiving a SIGTERM.
While 3% is 'good', that's an awfully wide range of things across multiple accounts for me, so it doesn't feel like 3% ;) Please publish the post mortem. I am a big fan of Railway but have really struggled with the amount of issues recently. You don't want to get Github's growing rep. Some people are already requesting I move one key service away, since this is not the first issue.
Finally, can I make a request re communication:
> If you are experiencing issues with your deployment, please attempt a re-deploy.
Why can't Railway restart or redeploy any affected service? This _sounds_ like you're requiring 3% of your users to manually fix the issue. I don't know if that's a communication problem or the actual solution, but I certainly had to do it manually, server by server.
Totally! People who see the impact will likely see more impacted than say, 3% of their services. Not all disruption created equal.
We rolled out a change to update our fraud model, and that uses workload fingerprinting
Since, in all likelyhood, your projects are similarly structured, there will be more impacted workloads if the shape of your workloads was in the "false positive" set
Will have more information soon but very valid (and astute) feelings!
> We rolled out a change to update our fraud model, and that uses workload fingerprinting
> Since, in all likelyhood, your projects are similarly structured...
Thanks for the info. For what it's worth and to inform your retrospective, this included:
* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday
* A Docusaurus-generated static site. Completely static.
* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)
These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.
Some things come to mind for your post-mortem:
* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.
* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.
We have more info coming soon but I think the best way to frame this is actually working backwards and then explain how it impacted yours and other services.
So Railway (and other cloud providers) deal with fraud near constantly. The internet is a bad and scary place and we spend maybe a third to half of our total engineering cycles just on fraud/up-time related work. I don't wanna give any credit to script kiddies to the hostile nation states but we (and others) are under near and constant bombardment from crap workloads in the form of traffic, or not great CPU cycles, or sometimes more benignly, movie pirating.
Most cloud providers understandably don't like talking about it because ironically, the more they talk about it- the bad actors do indeed get a kick from seeing the chaos that they cause work. Begin the vicious cycle...
This hopefully answers:
> If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
In our 5 year history, this is the third abuse related major outage. One being a Nation State DDoS, one being coordinated denial. This is the first one where it was a false positive taking down services automatically. We tune it constantly so its not really an issue except when it is.
So- with that background, we tune our boxes of lets say "performance" rules constantly. When we see bad workloads, or bad traffic, we have automated systems that "discourage" that use entirely.
When we updated those rules because we detected a new pattern, and then rolling it out, that's when we nailed the legit users, since this used the abuse pattern, it didn't show on your dash, hence the immediate gaslighting.
Which leads to the other question:
> How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_.
We don't want to tell fraudulent customers if they are effective or not. For this instance, it was a straight up logic bug on the heuristics match. But we have done this for our existence like black holing illegitimate traffic for example, then ban. We did this because some coordinated actors will deploy, get banned with: "reason" and then they would have backup accounts after they found that whatever they were doing was working. If you knew where to look, sometimes they will brag on their IRCs/Discords.
Candidly, we don't want to be transparent about this, but any user impact like this is the least we can do. Zooming out, macro wise, this is why Discord and other services are leaning towards ID verification. ...and it's hard for people on the non service provider side to appreciate the level of garbage out there in the internet. That said, that is an excuse- and we shovel that so that you can do your job and if we stop you, then thats on us which we own and hopefully do better about.
That said, you and others are understandably miffed (understatement) all we can do is work through our actions to rebuild trust.
Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.
Second complete outage on railway in 2 months for us (there was also a total outage on December 16th), and many issues with stuck builds and other minor issues in the months before that.
Looking to move. It's a bit of hassle to setup coolify and Hetzner but I have lost all trust.
> And my hosting provider is saying, "you are not allowed to push out your urgent fix, because we see that your app contains a far less urgent problem." There is no button that says "I understand, proceed anyway." Railway knows best.
We rolled this out quickly because of the React/NextJS CVE. I think this is actually a really good suggestion and we can look into it! Thank you for the thoughtful blogpost, and I'm sorry we let you down. We will work hard to re-earn your trust.
Bingo. Nix doesn't give you a generalizable-across-languages-and-ecosystems way of specifying specific versions without blowing up your package size, unless you hand Nix to your users (which we didn't want to do)
Maybe we were holding it wrong, but, we ultimately made the call to move away for that reason (and more)
Hey y'all! 3 years ago we built Nixpacks. However, we ran into some pretty large pains using Nix for dependency resolution
So today we're rolling out Railpack, the successor. It results in:
- Up to 75% smaller images
- Up to 5x faster builds
- Higher cache hit ratio
The goal is to provide a seamless alternative for the Dockerfile frontend. Railpack will automatically find your dependencies, you can add any additional ones, and it'll auto construct the build pipelines for you
I've gone ahead and re-added the surrogate keys statement to the press release. Thank you for the feedback and if there's other things that you believe can be better please let me know!
reply