How (and why) we rewrote our production C++ frontend infrastructure in Rust

24 points by abareplace 12 hours ago on lobsters | 2 comments

“Should we convert our _____ code to Rust?” is a question that comes up a lot. And a lot of the time the right answer is a pretty firm, “No.” So I thought it might be beneficial (and hopefully interesting) to walk through a case where the code involved was incredibly you-cannot-fuck-this-up business-critical and when we asked that question about it, we came back with a yes. Here’s what it was, how we got to that answer, and what we did about it.

First off, when we say “frontend” we don’t mean what most of the world (and most of our members) think of as frontend. Our frontend consists of the servers that sit out in front of member sites to perform caching, proxying, routing, access control, and TLS.

Now, when people think about that stuff, if they think about it at all, they generally think of it as the Apache part of their hosting. Which is… correct but incomplete. Yes, Apache is running on the frontend servers. But it takes no less than four custom Apache modules to make our service work. One of those modules takes essentially all of the decision-making about incoming requests out of Apache and passes it over to a custom server process (“nfsncore”) written in C++. When you add custom IP access controls in the UI, it’s nfsncore that applies them to incoming requests. When you set up a bunch of proxies to route to various custom daemons on your site, it’s nfsncore that figures out which one incoming requests should use. It handles redirecting to the right alias. It handles wildcard aliases (for the I think two people using them). It does your Strict-Transport-Security. It handles maintenance mode. It handles offline sites. It catches ACME requests needed to refresh your TLS certificates. It does a bunch of checks to get rid of stupid, broken requests.

As far as our service is concerned, nfsncore is kind of a big deal. No matter what you’re hosting with us, no matter what your tech stack is, nfsncore touches every request. Lots of parts of our service can take people down if things go wrong. This is the part that can take everybody down if things go wrong.

But, oops, I made a mistake. I should have said: One of the modules takes essentially all of the decision-making about incoming requests out of Apache and passes it over to a custom server process (“nfsncore”) written in Rust. As of yesterday, the C++ version is no longer running on any servers.

What would possess you to do that?

This is the first question that probably comes to mind for most people. Have we never heard, “If it ain’t broke, don’t fix it”? Was it not bad enough that we reinvented the wheel in the first place? We had to reimplement that reinvented wheel? Those are valid questions. If you’re going to undertake a full rewrite of a production application, you better have good answers. I can give a lot of good answers to that. Rust is a great language that provides top-notch safety for exactly this sort of usage. Rust is very fast. The Rust ecosystem is incredibly strong. Cargo makes it easy to avoid reinventing the wheel. Whereas with C++, there is no central ecosystem. Boost is as close as you get, and it can be pretty abstract and not always pleasant to work with. Also, our C++ code is getting very old and does some things in a way that makes it hard to extend to add new features. But, if I’m being completely honest, there is another factor.

When you’re routing URLs, hostnames are case insensitive. You handle that by converting them to lowercase. Here’s how you do that in C++.

std::transform( _stHost.begin(), _stHost.end(), _stHost.begin(), [](unsigned char c) { return std::tolower(c); } );

Here’s how you do that in Rust:

host = host.to_lowercase();

If it happens that you don’t know C++ or Rust, is one of those easier for you to guess what it does than the other?

Yes, the C++ syntax is very robust and flexible and (in a sense) elegant in its composability. And it’s doing the conversion in-place, which is very slightly more memory efficient than Rust’s builtin. But, come on, it’s 2026, and “lowercase a string” is still too much to ask from the language’s standard library? I have to do it one character at a time? Using two templates and a lambda function?

That may seem like a bit of a silly complaint. Which is because it’s a bit of a silly complaint. But the kernel of truth in it is that a whole lot of C++ is like that. C++ and its standard library take a very lowest-common-denominator approach. We can’t have nice things because we also have to share the playground with mainframes that still use EBCDIC and some 4-bit embedded microcontroller only found in Raytheon-built missiles.

The bottom line is that C++ has caused more than a few situations where we wanted to do something or add a feature and it’s just like… that’s a cool idea, but it’s just not worth the uphill battle against the language. And it was to the point where any change carries the risk of unforeseen consequences.

This software was a good fit for conversion because we’ve been treating C++ a little like Rust the whole time; if you use RAII and smart pointers and get really fussy about const, you’re well along the path toward memory safety. It’s just optional and you have to never make a mistake. But years and years of sticking pretty hard to that meant the famed Rust borrow checker, for example, wasn’t the same obstacle for us that it is for some teams.

Also, the codebase just isn’t that big. It’s less than 10% of the size of the PHP codebase for our member interface, for example. The complexity in this code isn’t in the length of it. It’s in the long-accrued knowledge that you have to do this to proxy requests here or they won’t connect. If you’re going to redirect http to https, you have to do these things in exactly that order or browsers will complain.

The purpose of the project was for the Rust version to match the C++ version perfectly at the time of conversion, so this post isn’t about any cool new features. But, hey, some of the shackles are off now. Stay tuned!

The Conversion Process: Stealing all your furniture and replacing it with exact duplicates

We knew from the start that we wanted to make the transition as seamless as possible, and that doing so would be rough. So we came up with a pretty detailed plan. And we actually managed to stick pretty close to it.

1. C++ Unit Tests. We added hundreds of unit tests to the C++ code. We had tests before but a lot of them were more along the lines of integration tests. Adding unit tests let us define the behavior for all kinds of cases at every level of the software.

2. Rust coding. We wrote the initial Rust code, using the C++ as a guide. Each C++ library (there are seven that make up the app) gets a Rust crate equivalent. Every single C++ unit test gets a Rust equivalent, on top of whatever additional unit tests we need to get thorough coverage. During this step we also wrote a bunch of little Rust command line tools that interact with the rest of our systems. Like a little thing that lets us look up an alias on one of the front ends and dump all the routing information. That shows that the code is working. And it’s a handy little helper to have when the alternative has been logging into the database and winging it on a five-line SELECT to try to get the information you need.

3. Interoperability testing. Apache interacts with nfsncore through one of our custom Apache modules using IPC. There are three different implementations of that IPC. There’s a client-side implementation in C using the Apache portable runtime used by that module; that didn’t change (much) during this migration. And both C++ and Rust have both client and server implementations. So we developed another set of tests focused on making sure that either client worked with the other server.

4. Functional testing. We have an entire harness for running the nfsncore Apache client module outside of Apache, so we have a whole set of tests that go along with that to make sure that the C++ and Rust versions both work with it and produce the same results.

5. Rust fuzz testing. Fuzz testing is a feature that was very accessible in Rust that we never had in C++. It takes a number of known inputs and starts trying random mutations of them to see if the software breaks. We tried a few hundred million of them.

At this point, we’ve tested everything, and we’re ready to go to production, right? Oh, no, dear reader, we’re just getting warmed up!

6. Replay testing. We wrote a client that could parse our log files from the live servers and replay them into the Rust version and make sure it got the same results using live data. (Allowing for a few variations, like a site that was available initially but happened to get disabled before we ran the test.) This is a great test but imperfect because the result from nfsncore sometimes differs from the HTTP status code because nfsncore’s code embeds some additional info to tell Apache (for example) which error page to return for a particular 503 error: offline site, maintenance mode, or service outage. And if nfsncore says “Success, get the content from the member site,” the member site might still return 404, which is what ends up in the logs. So the results frequently didn’t match the logs, but that didn’t indicate an actual problem. That made this testing less helpful than we hoped, and we probably wouldn’t pursue that tactic in the future.

7. Proxy testing. We wrote a proxy in Rust that would take input from the Apache module and run both the C++ and Rust versions of nfsncore side-by-side in real time and send all incoming requests to both of them, returning the C++ result to Apache and reporting any discrepancies. We deployed this on one server per day, starting with servers reserved for beta sites, until we hit 50% of our frontend servers. This is where we found a few fun bugs. Not only in the Rust version, but also in the original C++ version. Bugs we had faithfully reproduced. Real edge case stuff, mostly related to broken clients. So we fixed those in both versions and tried again. Once we could go three days without any discrepancies, except for a couple of incredibly fine slices where a member would happen to enable their site in the microsecond window between when the two versions asked for its status, we were ready to proceed.

8. Statistical analysis. Because of our load balancing, requests are distributed fairly evenly between servers at a given location. So although the individual accesses that pass through a given server are unique, in aggregate they are pretty similar. We sampled 150 million requests at random from the 50% of servers that were running the proxy code and another 150 million from the 50% of servers that were running the completely untouched original version of the C++ code. We analyzed the distributions of both request latency and HTTP status codes in both samples. We did that seven times. Once each day for a week. Latency was functionally the same (no bucket >0.1% different). Status codes were also <0.1% different in most cases. Where there were discrepancies (as high as 0.5% different), we investigated and confirmed that they were related to the edge cases we fixed in proxy testing.

  9. Staged deployment. At this point, we felt the Rust version was ready to go. But you can’t be too sure. So we resumed rolling out the proxy-C++-Rust trio one server per day until we hit 100% deployment. Then we went through one server per day and reversed it, making the Rust version authoritative instead of C++. Once that reached 100% deployment, we went through one server per day and removed the proxy and C++ version, leaving the Rust version running by itself in full production. And we reached 100% production of the Rust version yesterday.

Rust does everything at roughly the same speed the C++ version did. It’s a couple of percent slower, but both versions can handle full-bore production with ~10% utilization so there’s plenty of headroom.

The Rust version does have a few improvements, but alas they’re for us, not for you. It reports stats directly into our main telemetry system instead of writing them to stderr. It reports errors with stack traces through the same pipeline that our member interface uses so we don’t have to rely entirely on remote monitoring to tell us a frontend server is malfunctioning. Implementing those things in C++ would never happen: they weren’t worth the effort, and they weren’t worth the risk. That math is very different now.

We are really happy with this change. And the most successful IT projects are usually the ones nobody notices.* (Unless, of course, one writes a long blog post about it.) Now I’m not saying the past couple months haven’t had any speedbumps. Just recently we had to YOLO some kernel patches out there on pretty short notice, which is always a bit disruptive. But if you did happen to have anything weird happen in the past couple of months, it wasn’t from this. But I hope this project demonstrates, both to us and to our members, that despite how we typically behave, we are capable of acting like adults when the rubber meets the road.**

Since part of the reason we claim this was worth doing was the potential for future enhancements that Rust offered that C++ didn’t, all we have to do now is deliver on that. No problem! Let me just…

503 Service Unavailable

Please check back later.

*OK, technically, at least one person noticed, because we briefly made a mistake and let code slip onto one server that wasn’t as beta as we thought.

**Just as long as we’re all on the same page that it is an act.

RSS feed for comments on this post. TrackBack URI