fast-servers: an interesting pattern

25 points by lorddimwit a day ago on lobsters | 16 comments

fanf | a day ago

I’m worried I’ve misunderstood something, because this doesn’t make sense to me.

As I understand it, the idea is to implement the IO state machines using a thread for each state, and implement state transitions (eg from request to response) by passing an fd between threads. Threads are pinned to CPUs.

I can see a number of problems:

  • What if the number of states in the state machine doesn’t match the number of cores? I guess if you have a simple state machine and lots of cores you can run multiple copies of a state’s thread. But what if you have a complicated or dynamic state machine?

  • What if there’s an imbalance between states? e.g. reading a request requires less work than writing a response. Won’t some cores be overloaded and some starved?

  • It’s unfriendly to CPU caches: a CPU that has a request fresh in its cache discards that work, handing off to another CPU which has to re-load the request into its own cache.

  • The kernel and network cards are designed to be able to map network connections to CPUs efficiently, with hardware assistance. This thread hand-off design wrecks any kernel+hardware socket/CPU affinity.

For a really high-end example of the latter, see Drew Gallatin’s 2021 presentation on the Netflix Open Connect Appliance in which he talks about balancing socket affinity to avoid oversaturating the server’s internal interconnect bandwidth.

It seems to me that this design is over-fitted to servers that have very simple request/response state machines, and it prioritizes simplicity of the IO dispatch loop rather than whole-system performance.

geocar | 15 hours ago

It seems to me that this design is over-fitted to servers that have very simple request/response state machines

A lot of things are like that. Probably more than you are thinking of.

But what if you have a complicated

Ideally you won't do that: complicated things aren't fast for other reasons.

or dynamic state machine?

What I do is implement an interpreter as a simple state-machine. There are a few good designs for this: Arthur's is excellent because it is small, but Erlang's BEAM is very easy to implement as a state-machine as well, and it's much better-studied. Some smalltalk implementations are also implemented this way.

I do not mean I FFI over to Python.

I’m worried I’ve misunderstood something, because this doesn’t make sense to me.

It’s unfriendly to CPU caches: a CPU that has a request fresh in its cache discards that work, handing off to another CPU which has to re-load the request into its own cache.

You're incorrect about the cache effects.

Having one core do some kernelspace task, then do some userspace task, and then return to kernelspace will stall more than having a core do the kernel tasks and a core do the user tasks.

That is, if one core has written some data to a memory address, and a core concurrently reads that memory address, it does not actually wait for the memory controller, because the two cores share a cache layer and they know it.

Caching is incredibly difficult to reason about, so you should benchmark not speculate.

hyperpape | 11 hours ago

Caching is incredibly difficult to reason about, so you should benchmark not speculate.

Good advice, but your article doesn't really do that either. I assume you've personally run some tests, but you don't share them.

geocar | 9 hours ago

That's unfair. I believe the article does instruct the reader to benchmark and experiment, and uses those words exactly. There also exists in my github an implementation of a server that uses this method and includes benchmarks: https://github.com/geocar/dash

hyperpape | 7 hours ago

I think it's a completely fair criticism. Note that I didn't even say "you don't really do this" I said "your article doesn't really do this."

I don't know you, you don't know me. If you publish an article making a claim, I'm unlikely to go your github to find evidence for that claim[0], any more than I'd expect you to do the same if I wrote a post where I made testable claims and didn't share tests of them (which I know I must've done, I'm not perfect).

If you link to your benchmark from your article, then you'll have addressed the criticism.

[0] From what I can tell, I'm much more likely to do this than most people, because I'm weird and curious, but I still won't do it for most posts I encounter on Lobsters or HN or wherever else.

kornel | a day ago

The canonical version presented could be slow depending on details – either it's single-threaded, or forks (slow), or tries to dispatch each event to a different thread (with a single-thread bottleneck and lock contention).

But the proposed solution doesn't seem optimal either:

  • You're unlikely to have every state need exactly as much CPU power as the core it's pinned to provides. Depending on how big the CPU is your accept core will either be underutilised or a bottleneck.

  • Moving work between threads will hit plenty of locks in things that don't seem to care about threads but they actually do – kernel and many libraries have thread-local caches and need work-stealing or locking and moving data when you access it from another thread. Allocating on one thread and freeing on another can be a slow path in allocators. State changes may be slow if the threads cross NUMA nodes.

If you have mostly balanced workflows, then thread-per-core servers (basically run N single-threaded servers) can be very efficient. If they're unbalanced, then work stealing (you still try to mostly run N single-threaded servers, but with an option to redistribute work when there are idle threads).

geocar | 14 hours ago

But the proposed solution doesn't seem optimal either:

iouring is basically designed for this pattern, so I think everyone who is moving towards ioring disagrees with you.

If you imagine I had access to Linux kernels from the future, I would have written it almost exactly the same but using iouring.

You're unlikely to have every state need exactly as much CPU power as the core it's pinned to provides

If you create your program randomly, sure, but the application can be tuned for exactly that, and so the likeliness can be steered, and I explained exactly what to tune when you measure that.

Moving work between threads will hit plenty of locks in things that don't seem to care about threads but they actually do – kernel and many libraries have thread-local caches and need work-stealing or locking and moving data when you access it from another thread. Allocating on one thread and freeing on another can be a slow path in allocators. State changes may be slow if the threads cross NUMA nodes.

Sure. Slow things can be slow. It's good advice to avoid using a lot of libraries if you don't understand them, and to avoid allocation strategies using libraries that are not designed for it. But so what? Just don't do that!

What I find interesting here are (if I understand your reasoning correctly) the differences in the overall approach:

  • @kornel argues for a design that is robust even when parts of the system don't cooperate.
  • @geocar argues that the design should not include such uncooperative parts.

I think these are two quite opposite approaches to system design. Funnily this question has just come up for me in a different project (should we use Docker to make the third-party software easier to manage, or should we limit ourselves to third-party software that can be managed easily also without Docker?).

From my gut feeling I lean towards kornels approach. Speculating, I would guess this is because so far I rarely had full control over the entire system to avoid such uncooperative components (e.g. users and managers demand features, team members prefer to use third-party libraries, or problems come up that I did not foresee when designing the system). So I accept that to build more featureful systems, I have to add complexity, and then I have to add even more complexity on top to manage the initially added complexity :-) and then I have to limit this sprawl somehow to prevent complexity from spiraling out of control.

Admittedly this approach does not sound very good, but I don't see how else to deal with the messiness of the world (users; team members; and my own fallibility).

Geocars approach sounds tempting, though. I wonder how I would need to shape my environment (team, and project structure) to support this approach?

geocar | 9 hours ago

That's basically it.

kornel | 8 hours ago

iouring is basically designed for this pattern

io_uring is well suited to thread-per-core architectures, but that usually means a whole server per core to avoid cross-thread handoff costs.

It's good advice to avoid using a lot of libraries if you don't understand them, and to avoid allocation strategies using libraries that are not designed for it. But so what? Just don't do that!

It's not from lack of understanding, but the opposite: knowing that moving of sockets and associated data across threads has overheads in the kernel, every high performance allocator implementation (that have thread-local freelists to avoid locking), and even in hardware (cache coherence, NUMA). That's not a case of some random library being unfit, but the software architecture going against the grain of the hardware and design patterns aware of the hardware's limitations.

I just don't see how the proposed architecture is useful. You can make it less inefficient by tuning it for a specific hardware + workload combo, but other architectures can better utilize CPUs with less tuning (thread-per-core adapts balanced workloads to any hardware, work-stealing adapts to variable workloads, pinned-state adapts to nothing).

Discussion about fast servers needs context what environment it's designed for. Fast in a 2 vCPU VM has different constraints that fast on a Threadripper with PCI full of network cards with offload. But I don't see how pinned-thread-states are fast on either of these ends.

[OP] lorddimwit | a day ago

It’s been ten years since first submission, and there were no comments. Lobsters has grown a lot, might be a good opportunity to get more comments now.

einacio | a day ago

Is there any server implementing this pattern, that we know of?

geocar | 16 hours ago

Yes a few: https://github.com/geocar/dash uses this technique as written.

iouring is almost exactly the same interface, so pretty-much all iouring servers do this way (although some emulate the epoll/kqueue-type interface to the application for weird reasons)

marginalia | 9 hours ago

io_uring has a similar interface, but the similarities end there. On the kernel side io_uring executes multiple operations concurrently, this is why you can get completions out of order, whereas the design in the article executes the operations sequentially (at least as far as I'm able to parse the Arthur Whitney-esque C code).

[OP] lorddimwit | a day ago

Nothing public to my knowledge. I've played around with it some, and it's relatively pleasant to use.

marginalia | 9 hours ago

I don't really see any arguments why this would be better. You're basically bottlenecked to handling only a few requests at a time.

And the "old way" of doing things is missing a thread pool. Both fork() and pthread_create are pretty slow in the context, and can easily be abused to crash a server in a DDoS.

The design in the article is as far as I understand kinda shooting itself in the foot by using syscalls for thread communications (syscalls have ~250 ns overhead), when something like a lock free ring buffer can do the same thing with ~10ns latency.