I'm not surprised Chen's patch was rejected; that's an extremely niche usecase not worth supporting. With my shell developer hat on, I agree with the closing "developers would likely welcome a native implementation that isn't (unlike the current implementation) hiding fork() and exec() under the covers".
It has been for decades at this point.
thiago's blog posts which introduced me to the topic over a decade ago (and is still one of the best explainers) points out that posix_spawn was introduced in POSIX.1-2001: https://web.archive.org/web/20120718152158/http://www.maciei...
Maybe tangentially related but I always think it's silly that every linux process has the same libgcc_so.so.1 loaded into memory for each process even though the raw binary for the library is exactly the same so you end up with like 800 copies of libgcc_so.so.1 in memory.
I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?
Shared libraries (and mmapped files in general) are deduplicated; it's nowhere near as bad as you think. The kernel loads a .so into memory once and then maps that memory into every process that mmaps it.
Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.
How do you think position independent code can call functions from other .so's without being patched with their addresses?
They can't, so even PIC code still has to have a relocation table that gets patched. It's in a different page than the code though, so code does still get reused.
There's a part of the .so ELF file (the Global Offset Table aka GOT) that has to be modified with all the addresses of the functions being imported, which of course vary from process to process.
If not patching, what exactly would you call modifying part of the file?
And the got is just a big table of pointers like any other table of pointers your application manipulates as it runs.
This isn't meant as a reductive take, but instead that there is a difference between completely describable in C like the contents of the .got section, and something like a .reloc section that actually has to understand the generated assembly in order to build the relocation table to load and link the executable. Both are linking, but I've saved "patching" for more brain surgery esque techniques. Like on mips, the jump instruction immediate is the bottom 26 bits of the absolute address of the target, so you're going through and modifying all of the jump instructions if you load it to somewhere it wasn't linked at.
Those mappings by default all go to the same shared memory.
Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.
Typically libgcc_so.so is loaded by the linker, which uses an mmap call to map the binary into the address space.
> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.
In Linux, when a shared lib is loaded by multiple processes, its loaded once and not duplicated in ram. Only if a memory page is modified by the process will the memory be duplicated. (Hope I have explained that correctly)
I have a rule for myself. If I think something is silly or stupid, I assume I don't understand it. I usually find I do not understand it, and it no longer seems silly when I do understand it.
In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.
I'm guessing that a big part of the problem with moving away from fork() in general is that each new process needs a copy of the parent process' environment anyway, right?
I'm a bit naive, but I don't think that's necessarily a requirement.
It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.
Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).
But also UID, groups, controlling TTY, process group, capabilities, pipes, shared memory, etc. and the file descriptors while maybe not inherently needed are how a lot of Unix plumbing works.
The LWN article is incorrect in saying that it "must copy the entire process state (including memory) for the child process". There are some kernel structures and page tables that need to be initialized, plus you need a new stack, but it's not nearly as dramatic as implied. Most of the parent's memory is "incorporated by reference", so to speak.
In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.
It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.
Fork becomes more and more expensive the higher the RSS of the process, roughly 1ms per 1Gb of the process size with 4kb pages. Given that modern servers can easily support 1-2Tb of RAM the fork() part can easily take several hundred milliseconds, blocking everything in the meantime. So for larger programs you kinda have to have a "fork helper" process if you need to execute external programs for some reason.
The kernel does not copy every page, but it does have to copy all of the VMAs. Setting memory to COW (which can involve changing a lot of page-table-entries) is not free either. I guess I could have mentioned copy-on-write explicitly, but I do not believe that what I wrote was incorrect.
A lot of times you actively don't want the parent environment or any of the memory or file descriptors. And then you have to actively do work to fix all that stuff up after the fork.
This seems unnecessary to me. In the example, the core of git should be a library yo can link so you don't need to run the binary. That would be better in every way.
Node, Python, PowerShell, and the rest do (almost) just that. launchd and systemd famously strived to remove as much shell from the start up process as possible because it was harming boot times and introducing unpredictability.
CPython doesn't usually create subprocesses unless specifically asked to, it loads Python modules and native extensions into its process. The former is similar (you're still extending an existing process with new code, just interpreted), the latter is literally dlopen(), so loading dynamic libraries.
A lot of other Python implementations don't have the ability to spin up new processes at all too.
I still don't really get this point. It's just two different things, spawning processes and running libraries. Seems like you're comparing apples and oranges to me.
Because it comes with a lot of overhead and, unless for some reason you really need every of those processes to have their own address space, set of privileges, file descriptors, etc., there's no point in wasting resources repeatedly setting those up only to tear them down milliseconds later. Running the same workloads in an nginx-style process pool usually works better.
I see what you mean now. I agree, a sustained workload of creating many processes very quickly is probably not a great idea. But it's also useful to be able to spawn that process pool (and any number of other use cases like that) efficiently.
But when you use a process, you get tons of things for free, the subtask is invoked in parallel, you get isolation and you can control execution for free. Unless you are already writing a multithreaded program or already accept passing objects in memory, using a process is actually easier to write than using a library.
If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.
I'm not sure what you mean by inventing a sync mechanism, all languages come with one. Same with a scheduler, either your language runtime or the OS (or both) will deal with scheduling.
Launching git repeatedly was probably not the best example. But it's hard to think of good examples where launching processes repeatedly is the most performant thing to do, probably because launching processes had been expensive and everyone has learned to do something else (libraries, zygotes, etc). Maybe a different question is: if launching processes were cheap, is there something we would implement as processes instead of libraries?
I can recall just one program that's intentionally not implemented as a library, but I think people have since built a library on top of it:
I just ran into this recently, where I had an obscure bug caused by needing to close more file descriptors in the forked process. "I want a clone of the current process" is just way less common in my experience than "I want a completely new process". It feels crazy that we don't have a way to directly express the latter thing, and can only approximate it by cloning and then fixing things up in post.
A thing that makes that complicated is that while you want that conceptually, you don't want that in reality. For instance, if the spawning process is in a container of some sort and it spawned a process that "shares nothing with the process that spawned it", the spawned process would no longer be in that container, because the state of "being in the container" is one of the things it shares with the parent process.
This is just an example of I don't even know how many things a modern-day process will share from its parent.
By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".
I also explicitly said this wasn't unsolvable. My point isn't about technical implementations or code, my point is that the casual "I want to share nothing about the parent process" thought in sanderj's mind, and presumably a lot others, is much more ill-defined than they realize. There's a lot more state that a process has than what file descriptors are open in a modern system.
Moreover, as things like "in which container is this running" demonstrate, those are also not "create a process that has nothing to do with this process", because, again, there's a lot more to "having to do with this process" than "what file descriptors are open".
Also, as the name might have been a clue, Linux has posix_spawn: https://linux.die.net/man/3/posix_spawn. It also has a thing called "clone": https://www.man7.org/linux/man-pages/man2/clone.2.html Nor do I claim this paragraph is an entire overview of all the ways of starting a process in Linux. If you want to understand what I mean by "lots of details in a modern OS", your assignment is to carefully read the entire "clone" man page, and you'll start to see what I mean, though I'm not sure even that is all the state associated with a process nowadays.
Linux posix_spawn is a wrapper around clone and exec. There is no primitive on Linux to create an entirely blank process. This is adequately discussed in the linked LWN post.
Other operating systems either have parallel APIs to fork (e.g. the posix_spawn syscall on macOS) or do not provide fork at all (Windows).
You seem to persist in reading into my words claims that aren't there and then excitedly debunking them. I feel I'm extraneous to this process, though, so I think I'll let you carry on arguing with the guy in your head on your own terms. It's more fun for both of us.
It's not a casual thought. I recognize that there are lots of details, there always are, we're talking about computers :)
I don't think it is necessary (or the best implementation) to clone the parent process, in order to maintain important properties like the process tree / container state, etc. I recognize that it's a sorta neat hack, "well if we just start by cloning the parent, then we don't have to figure out what state to include!", but that just pushes the details to the child process needing to figure out what to exclude, which IMO is a worse default.
Yes, stipulated. And it it's true that we should have a primitive for spawning a completely new process, because that's what we usually want. I agree that the details are both non trivial and soluble.
But you generally want to communicate with that process, so you do need to setup e.g. file descriptors and stuff, which needs information from the parent process to be passed.
Most programming languages abstract this out to be able to connect or drop the 3 standard pipes. Typically this is the only thing that can be shared anyway unless the other program is specifically shared and expects other file handles to be available, in which case fork might be the right system call anyway.
Yes, you do want to pass in some stuff. But by default you get every single open file descriptor and a copy of every single stack that any threads use for execution.
It shares way too much, and have huge use cases where it is really, really bad.
A variant of exec could take an initial table of file descriptors in the current process that get cloned into the new child. Pipe creation could also get rolled into this mechanism. That should take care of the most obvious leaky bit of fork()/exec(), at least.
There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...
This is an oft-overlooked point. An obvious place to look for improving fork+execve is to see whether posix_spawn can be given more efficient kernel mechanisms to be based upon.
And of course that has already been done. On NetBSD, posix_spawn() is a fully-fledged system call and much of the work is done in kernel mode.
posix_spawn addresses the need from userspace. Under the hood, it's still doing more or less a fork/exec, with the baggage that comes with it. A syscall would be nicer.
There are a lot of slightly different fork-exec-like things in the concept space and it's hard to imagine one approach satisfying them all. IMO it would be interesting to take an approach analogous-ish to sched_ext_ops where you built the rough flow chart of a combined fork-exec, but with hooks built to enable ebpf to change behavior or skip the bits these sophisticated users don't want/need.
Fork/exec is great if you actually want the traditional copy of your process for some reason.
For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.
Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.
Relatively rarely, but in some performance sensitive use cases. Mine happens to be fuzzers, where a very cheap fork-like primitive would be a really big win.
Android and chrome both benefit greatly from fork exec as part of their zygote model iirc. It substantially reduces the memory cost and latency of spawning new apps and tabs.
The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs. Every attempt to replace it with a combined call that I have seen so far seemed fundamentally poorer because it needs to add all configuration options as parameters to the call and then do this in away that you can extend it later and does not become a mess.
I have the entirely opposite opinion. IMO a big mistake of the UNIXy model is that so much state is preserved across the creation of a process. For example, there are APIs to have a specific thing be fd number 4 so you can run a program and have it find that thing at fd 4. This is weird.
Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.
Is it weirder, that you can pass an variable precisely into argument 4? You do need to pass information to a subprocess and there needs to be some agreement on what means what. Sure, maybe you could use names instead of fds, but that sounds needlessly complicated.
That’s like saying you could use positions to specify function argument access (as in assembly) instead of variable names. File descriptors being numbers that are likely array indexes in a file handle seems like a leaky abstraction. Having a namespace that a parent process share with its children seems like a much cleaner design.
A way to pass a defined list of handles to a subprocess (or a friendly other process) makes sense. Having that mechanism be direct inheritance of those handles with the same numbering as the source is obnoxious.
You're simply failing to grasp the value of the simplicity, compatibility, and portability of POSIX/*nix. Inventing yet another way to create a process would be complex and break things. It's a-la-carte to enable configuration after fork of the new CoW or non-CoW process but before exec (unless vfork or similar were used). This is the model.
If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.
"The reasonable man adapts himself to POSIX: the unreasonable one persists in trying to adapt the POSIX to himself. Therefore all progress depends on the unreasonable man."
Having fd 4 mean something specific is no weirder than having fds 0,1, and 2 mean something specific, which is probably never going to change. At some point you just gotta embrace the Unix.
Actually, there is a native fork. There had to be, as POSIX personality support was a part of the Windows NT 3.1 design. What there wasn't was a Win32 form of fork. The Native API for Windows NT allowed it quite straightforwardly.
Well, a lot of the power of the UNIX shell comes form this and I see this as a major advantage over Windows. So no, I do not think Windows got it right.
Any kind of replacement should aim for the same conceptual simplicity and power. Sadly, I fear that people driving development nowadays are more interested in building unbreakable walled gardens for advertisement or app stores, or trying to squeeze down the some small gain when used on the cloud. I am more interested in general computing on the user side.
A lot of features of UNIX shells are build around pipe and dup and the fork + exec model. One can certainly implement in differently, but it is - like UNIX in general - very nice and elegant.
Help me out here, please. Off the top of my head, the exec command is dependent on exec, except that a spawn + wait implementation would be a mostly okay substitute.
Pipes and redirections don’t need fork + exec. Neither do subshells.
If you use pipe() you get two ends in the same process, then you fork and child and parent can communicate. This is how a unix shell setups up pipes and it is rather elegant.
Doing the same thing on Windows, I create the pipe and get two ends in the same process. Then I'd call CreateProcess and indicate I want the pipe's handle (fd) inherited to the child, and I'd use a prearranged way to tell the child what the fd value is it should use.
Possibly the most common way to tell the child the value is by setting it as a CLI arg in CreateProcess.
Which special facilities are you referring to? If it's the ability to selectively inherit handles (fds) to a new process, Linux's lack of this "special facility" is nothing to be proud of.
How do you selectively pass on fds without having a global impact on your process?
I don't think it is hack. I think it is a nice and clean API and the hate is largely irrational. I think one could improve usability for multi-threaded programs though.
Yeah. The right way to eliminate fork() is to make the usual APIs that modify process state take an explicit process handle, so the same APIs can be used to set up an empty process. They can also be composed in other ways, eg for IPC or debugging.
That's mostly papering over design mistake that most syscalls doesn't accept target pid. Otherwise you could just create suspended process, configure it with syscalls that explicitly take target pid, and start it.
Maybe, I am not saying fork() + exec() model couldn't be improved, but most people saying it is "terrible" and it needs to die seem to go on to propose something substantially worse.
I agree. I think the current way is very nice to use (in c). I think the best way would be to have something similar to vfork() but not bound by posix rules. Then make the normal posix apis (close, setuid, etc.) act like the Rust “builder” pattern. Possibly giving them a prefix for explicitness. That way the “fill out a giant structure” people could have their wish and the people that just want a faster posix experience don’t have to learn an entirely new concept and api surface. It would be future extensible that way, too (just add more prefixed calls to the builder).
The new system calls described in the article have an extensible declarative command interface built into them to do things like close or duplicate file descriptors. Not opposed to it but it definitely stood out to me.
Calling that elegant is a path dependence of the history of fork+exec.
In an alternative world where fork+exec never existed, a lot of those "usual APIs" would probably have had an explicit pid argument to them that let you modify process configuration from a different process. (This is how Fuschia works, e.g.). There's a lot of benefit to this world: the most obvious is that you don't have to magic up some IPC system just to report configuration errors, but there's actually a good amount of utility in being able to have a manager process that is tweaking attributes of its children (e.g., debuggers would love it).
Or you could call ptrace_syscall (that doesn't currently exist) on your child processes that you are tracing because you'd always be tracing them by default, or get an io_uring for the child process, or...
The value is not needing to change every other syscall and not needing to write new ones with a pid argument (besides which, what when you want to change it to a pidfd argument? then you add pidfd_syscall instead of duplicating every syscall again)
It should be spawn, configure, exec. Configure can be done if the process starts with a ptrace attachment and no threads, so you can force it to do syscalls. Linux doesn't even have a concept of "process with no threads", so it'd probably have to have a dummy thread.
The flip side of this is that you have to be aware of the entire state of the process, including everything done in libraries, in order to correctly start a new process.
Quick, what's the highest numbered open file descriptor in the your program?
This gets even worse if you have multiple threads running. Without looking it up, what is the state of all the various synchronization primitives in a forked process?
I kinda disagree, though I do see the usefulness here. While fork/exec can be useful in some cases, it'd be honestly pretty neat if the APIs took a pidfd argument (maybe with 0 meaning current process). Only program is setuid/setgid binaries I suppose but maybe this case is better handled by special casing `exec`.
For example
pidfd_t ps = spawn(); // creates a process stopped (kernel does this anyway by default)
setuid(ps, 33);
capset(ps, ...);
socket(ps, ...);
mmap(ps, ...);
process_vm_writev(ps, ...);
exec(ps, ...);
signal(ps, SIGCONT);
// error handling elided
I guess this is a little bit me being a bit of critical of the usual syscall APIs for not thinking about "what if I want to do this to another process I have access to" but...
It also makes things like thread safety even reasonably doable with fork. I do agree though that stuff like CreateProcess which take in a gazillion parameters don't really make for the greatest of userspace APIs
Maybe, a few people proposed this. It is a lot better than a single spawn call.
But how often would one actually need this? And what are the semantics? Refer arguments (e.g. file descriptors) to the current process or the other one? How are cross-permissions handled? It seems a lot of complexity...
Someones proposed a ptrace_syscall which could achieve the same thing.
Well, the idea is that it'd probably be close to the default API for spawning processes (and could even be the bedrock for posix_spawn and friends in libc (and potentially even "simple" fork cases[1])). fork/clone would be the special case
In most cases, most programs don't need special setup. Something like `ptrace_syscall` would also work for this and would be probably the way to do it with the backwards compat limitations of nowadays
ptrace-ability seems to be generally how permissions for this sort of thing are handled in general (see also procfs, process_vm_writev, ptrace, etc). The complication is a little bit around setuid programs but either you could special case execve to imply SIGCONT for setuid or have execve also imply a SIGCONT as well
[1]: Probably would be rare for a compiler to optimize it though
I agree with it, although still the fork is expensive like they mention. There is clone with some flags, although that does not really solve it.
I think one problem is that it is already how it is; making an entirely new operating system (that is not Linux, not GNU, and not POSIX) would solve it, but that is not the case here, so it would need to be done as it is.
One possibility would be a new function that creates a new empty child process, but the parent process specifies what system calls the child process executes, and can stop if specifying that exec or exit is (successfully) called by the child process, or if the parent process gives it the program memory to execute directly instead of using a file (since that use is also useful). The new function can still have some of the clone flags available. (I don't actually know how much better it would work.)
There are other possibilities as well.
The existing methods can also remain available for when they are helpful, but functions such as popen might be changed to use the new method.
> The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs.
Unfortunately, the opposite is true, when the parent process is multi-threaded. In the child process, only one thread exists (the thread returning from fork()), but the memory is an exact copy of the parent's. As a result, the child may inherit locks (resident in memory) that are in acquired state, but have no owner threads -- the threads that are responsible for eventually releasing those locks in the child's copy of the process memory do not exist in the child. If the single thread in the child process (returning from fork()) attempts to take such a lock (before exec), it deadlocks. This is why POSIX says that only async-signal-safe functions may be called in a child process, between fork and exec. And then, for example, "malloc" is not such a function (at least per POSIX), so the fork-to-exec environment in the child process is an extremely uncomfortable one. You've got to preallocate everything in the parent, can't report errors to stderr, etc.
> fork() is a relatively expensive system call; it must copy the entire process state (including memory) for the child process. Many optimizations have been made over the years, but a fork is still a fundamentally costly operation. To make things worse, a fork() call is often immediately followed by an exec(), which will discard all of that memory that was so carefully copied for the child.
It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.
It says state. Copy on write still means it's O(number of page table entries) even if you don't copy the contents. It's a well known issue that forking a program with large virtual memory size is slow.
This was left implicit in the article, but what they mean by copying the process state here is the memory management structures. That's mainly the page tables and the VMAs.
That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.
Even with copy-on-write, fork() still has to pay the setup cost for COW. If the parent process has a lot of busy threads (e.g. Java), you can end up doing a lot of unnecessary COW before exec() fires.
vfork() does NOT stop the world in many / most implementations. The ones that do stop the world do it because someone misunderstood the whole "vfork() stops the parent process" -- yes, it stops the parent process in a pre-threads world, but it doesn't have to stop any other threads but the one that called vfork(). Indeed, many implementations don't do that.
(Someone once tried to make NetBSD's vfork() stop the world because that's what the pre-threading man page said it does. I did my utter best to keep that from happening at the time, and it didn't then. Hopefully no one tried again later.)
Redis is the kind of process where this matters a lot, and while fork() doesn't copy the memory, it still needs to copy the page table. For a process holding tens of GBs of RAM, fork() can take a long time, and there's one every time Redis dumps its .rdb file or rewrites its binary log ("AOF").
On an m2.xlarge using ~25GB of RAM, fork() took 5.67 seconds. That's a long pause when Redis clients typically experience single-digit msec latency for most operations. Yes, that's only the time needed to copy the page table. It's surprising they don't mention huge pages, it seems like it would be a key consideration here.
No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.
The zygote pattern[1] is a great optimization to deal with the cost of forking, but IMHO, being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.
I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.
[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.
You're referring to something else, and maybe I'm using the term "zygote" incorrectly.
In all uses of zygotes that I have seen, here's what's really happening:
- `fork` is being used to reduce the cost of starting a process that has a high start-up cost. So, you start one process, run it through the expensive initialization, and then fork it from there to start new processes.
- To make this even faster, you have a pool of pre-forked processes sit around.
- Having pre-forked processes sitting around ready to be used is not expensive because of the CoW property and the fact that a process that forks and then immediately pauses will not have triggered any significant CoW yet.
So, the zygote optimization you speak of is in practice only meaningful on top of systems that are using an optimization uniquely enabled by `fork` (avoiding process initialization costs by cloning a process), and that zygote optimization is further optimized by another property of `fork` (memory sharing of forked processes that haven't done anything else yet).
Oh I see. I guess your zygotes have developed more than mine. I think Google may have coined or at least popularized the term zygote for this in Chrome and Android, Chrome documentation [1] says:
> A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.
I think reading the first sentance and stopping covers my zygote, but adding the second sentance covers yours. So I think we're both right!
I think both paths are useful. If your children need time to startup and become ready, spawn one that does start up work, and then it (pre)forks at the ready state to have processes ready to handle requests (your zygote). This does require a traditional fork() to avoid duplication of work.
But if forking is expensive at runtime because you have a million FDs open and a whole lot of memory allocations, spawn spawners before you start doing work (my zygote). This could be unnecessary with a inexpensive way to spawn a new process from an process that has lots of resources in use.
Of course, you can also use my zygotes to spawn your zygotes. Zygoteception.
I quite like the idea. I’m using OpenBSD on an oldish laptop, and fork-exec is expensive enough that it conflicts with the usb subsystem. Isochronous transfers have a 1ms realtime requirement and it seem that the fork-exec system calls hold the giant lock long enough to mess with it (audio stutters).
While I’ve not bothered to profile it, but it seems that process that have lot of mapped pages is the issue (firefox, emacs,…). In the emacs case, the issue is when the main process trying to fork-exec, if I start a shell session (with shell-mode or term-mode), it works fine.
> Oh I see. I guess your zygotes have developed more than mine. I think Google may have coined or at least popularized the term zygote for this in Chrome and Android, Chrome documentation [1] says:
Google may have popularized the term, but this approach was already in use by KDE developers in the KDE 2.x timeframe, where it was used as part of a system called kdeinit.
In this scheme, launching KDE apps from a KDE desktop could bypass much of the startup cost of dynamic linking by forking from a long-running kdeinit process (with kdeinit itself deliberately linked to all large dependency libs like Qt and kdelibs), dynamically loading the application logic (stored as a .so) and then launching the app.
This was more to save startup time due to how long it took to dynamically resolve a multitude of C++-based symbols back then, all the common logic came before the app's own main() would ever be called. But it did also save a bit of memory as well.
adding on the the sibling, what argument to clone allows me to set the fds of the child? AFAIK, you either share the FD table with the parent, or get a copy of it. If the parent has 1 million FDs open and the child doesn't want most of those, dealing with that has real costs. Many applications that tend to have large numbers of FDs and also fork/exec will mitigate the cost by spawning a process during startup that they can then use to spawn processes during runtime without doing it from the main process; this is a nice mitigation, but it shows a missing interface.
The paper explicitly covers it that various memory COW/snapshot mechanisms are probably faster and safer than the zygote pattern. As it stands getting the zygote pattern correct and safe is something you have to plan for upfront. You can’t retrofit it which is why the paper mentions it has poor composability. Also the advantages of the zygote pattern can be overstated since the memory sharing benefit is minimal since it has to happen so early and modern OSes already transparently CoW duplicate pages in the background.
I recommend at least skimming the paper as it covers this. But essentially you can’t just inject a call at a random point in code to start being a zygote. It’s something you have to plan up front as to the exact point you’re going to fork and that you’re going to do it at the start of program before any threads have started or any files are open and before any locks have been acquired. It’s basically all the challenges of invoking fork at arbitrary points in time.
The reason to do a zygote in the first place could be solved with alternative special APIs that are safer and harder to misuse. But we have fork so there’s not as big of a demand despite the warts.
Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).
If you don't, you might wake up with fork() causing latency issues.
You can create threads in the zygote. It doesn't "break down", but sure, there's a bit more work.
My trick for that is that the set of threads that I create pre fork have to be suspendable and resumable, preferably lazily (they resume when they are actually needed). So, the zygotes are sitting with those threads suspended. When they become active, they can do work immediately. They might lazily resume those threads as needed.
There are other idioms for this too.
> Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.
Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives
Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.
Windows NT was never designed with pre-386 machines in mind. That was the territory of the old DOS+Windows. Windows NT from the get-go was for machines with page-based virtual memory.
This is not true. NT never had fork, was always based on the assumption of an MMU and Dave Cutler was a well known fork hater in the 80s long before this paper came out and made it cool to be so. By the time Windows 95 was out, the baseline was 386 with an MMU. CreateThread was initially designed for NT in 1993 though (which didn’t support pre-386 CPUs).
As mentioned elsewhere on this page, Windows NT had fork from the start. Vide NtCreateProcess and what happens if an image file is not explicitly supplied.
You haven't read the doco. I did point to some. The image file is supplied (or not) via the section object.
Think it through. Windows NT supported fork from the start in its POSIX subsystem, that subsystem was layered on top of the Native API, and this is the Native API mechanism that the POSIX subsystem employed. Although it took until Gary Nebbett for someone to publicly show how, even though people knew informally back in 1993.
NT was designed to be platform-agnostic, and its original target was the DEC Alpha. Its process model owes nothing to pre-386 CPUs. The WinAPI CreateProcess function is a layer atop NtCreateProcess, so that is where the pre-386 heritage lives. But even the WinAPI process model changed significantly with 32-bit Windows.
Windows NT was developed on various different CPUs before the Alpha was a thing. When it was released in 1993, it was released for three CPUs: IA-32, MIPS, and Alpha.
A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.
Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.
Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.
You're right that POSIX semantics get tangled when using threads.
Well, Windows before NT isn't the same design as Windows 16 bit, it only shares the name for all practical purposes, and has more influence from OS/2 than Windows 16 bit.
Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.
Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.
whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.
How are those not simply child processes? I don't understand your use of the word 'threads' here.
Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.
Second answer: Linux doesn't differentiate between threads and processes. It has a "thread group ID" that serves a small number of purposes, and the rest of the difference is just whether the threads happen to share the same address space.
That's actually less accurate, not more. It's a post-hoc revision that conflates Unix with Linux.
The Unix model was invented over a decade before the idea of multithreading percolated into mainstream operating systems at all.
The reason that Windows NT started as it did, was that OS/2 had come out in 1987, with kernel threads, and the idea of multithreading had taken root. SunOS 5 gained threading, too.
Windows NT applications development began with threading available as a mechanism from the start, and with a lot of people in the IBM/Microsoft world already knowing about its use in applications development from OS/2.
Whereas with the Unices it came in more gradually, as the applications had often already been designed. The whole libthread versus libpthread thing made things interesting on SunOS for a few years, too. As did the first attempt (LinuxThreads) at providing threads on Linux.
PaulDavisThe1st is saying that the Unix pattern of forking a process (and not calling exec) was an early form of multi-threading (or multi-processing), but unlike threads in NT and later pthreads, they didn't share memory and communication between them required some form of IPC.
Yep, absolutely corrrect. It was true at the lowest level (the semantics of fork) and it was true at the app/platform design level: in Windows you used threads inside a process, on Unix you used multiple communicating processes.
This obviously changed as pthreads came into being, and at this point, I suspect that the typical use for threads-sharing-memory and threads-not-sharing-memory is the same on most platforms.
A reminder that the task_t data structure describes threads and processes not just in Linux, but earlier Unixen also.
The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.
True, but on Windows the approach is then to use COM servers, which have a faster IPC model, and can even serve multiple clients, depending on how the appartement space is configured.
Than UNIX fork/exec model, or calling into Create Process all the time.
Windows has a more rich set of IPC stuff than POSIX, especially since it has a microkernel like design.
If you are going to say it is everything on the same memory space anyway, it isn't.
Optional on Windows 10, and enforced on Windows 11, Hyper-V is always running, and several components including kernel and driver modules are sandboxed into their little worlds.
Several additional sandboxing changes were announced at BUILD.
I would say that pipes and shared memory are the IPC mechanisms? Controlling the state of the exec'd process's file descriptors would counts as a way to set up interprocess communication, but once that's done, it's the pipe or SHM that does the actual communication.
The problem with POSIX IPC is that passing file descriptors between processes (other than parent passing to child via fork) is hard. Yes, SCM_RIGHTS can do it, but it is quite error prone and rarely done.
That's like comparing apples and oranges. When tooling is tied to a platform, you're adding in the entire platform to the comparison.
Mozilla implemented an alternative to COM, called XPCOM. XP here means cross platform. Perhaps you could compare against that to take the platform out of the equation.
POSIX threads having problems with signals is, imho, mostly the problem with signals in general. They are pretty poorly designed: https://lwn.net/Articles/414618/
That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.
Again, NtCreateProcess does not implement fork(). The fundamental characteristic of fork is that the child is an exact replica of the parent, down to the instruction pointer. Windows does not have a way to create a process object with such a configuration.
Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.
Okay but people don't claim that copying the instruction pointer (a single machine register) is the reason for any speed difference. They claim it's due to the memory sharing. And that's easily disproven since you can share pages, just like on Linux, simply by passing null for the section handle, yet there's still a performance difference.
Why does it matter which prefix I used? They both point to the same routine so my point applies either way.
It's a completely uncontroversial fact that NT does implement fork(). Turn to page 183 of Helen Custer's "Inside Windows NT" and you will read about it.
I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++
The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.
Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.
Didn't he just say that fork turns out to be comparatively faster to the non-fork samples we get? Ie Linux spawns processes faster than Microsoft's kernels?
We don't have any broadly used non-fork samples. Windows, macOS, and Linux all have fork. So the presence of fork can't be the reason for the performance difference.
If you pass null for the section handle, it shares pages with the calling process, thus implementing a forking model. Or at least the parts of a forking model that some people erroneously believe are responsible for performance differences.
Didn't I just say that "the problem with fork isn't really that it's slow"? It's all the other OS design choices it forces on you if you want it to be fast.
CoW is probably a good idea whether you use fork or not. Or rather, fork is probably a better option than just exec exactly because it can benefit from CoW.
At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.
CoW is probably a good idea regardless, yeah. Overcommit is more questionable. Regardless, both ought to be argued based on their own merits. It's unfortunate that both are necessary as a consequence of fork().
I don't think fork() mandates overcommit. OpenBSD doesn't seem to even allow overcommit or have an OOM killer, memory allocations that exceed available capacity fail immediately even if the memory is not touched.
Let's say you have 1GB RAM. You're running program that occupies 600 MB. Now this program wants to launch second small program that occupies 1 MB.
You're doing fork + exec.
If you're overcommiting, fork will not reserve another 600 MB, and exec immediately after fork will cause total system usage to be 601 MB.
If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.
> If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.
> Let's say you have 1GB RAM. You're running program that occupies 600 MB. Now this program wants to launch second small program that occupies 1 MB.
> You're doing fork + exec.
This is the clear problem: you don't want another process that's a duplicate of the current one, that's just a detail of what you actually want: a 1mb process. Right now it's a badly leaky detail which you're forced to work around.
> The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1
It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.
On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.
That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.
I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.
Another possible design is instead of forking the current process, you create a new empty process, then the parent calls syscalls to set up the new process, and eventually call exec on the child process. That does mean you either need new syscalls for that, or adapt existing syscalls to take a pidfd as an argument. That also solves some other problems with fork/exec where the default is to inherit a lot of things you probably don't want. With this, you can opt in to inheritance instead of having to opt out.
Or you could create a hybrid between a thread and a process, where it still uses the parent's memory space (unlike fok), but has it's own stack (unlike vfork), and is in its own process (unlike a thread). I think this is technically possible on linux, but there isn't a readily available interface for it. Although it seems like posix_spawn could be implemented that way...
io_uring taught us that if syscalls are expensive, queue them up in a buffer with one syscall to transfer the thread to the os to process it. So, queue up the new process mutations in a buffer with a single syscall to process all of them in a batch. This model should have replaced repetitive syscalls across the kernel years ago.
> you create a new empty process, then the parent calls syscalls to set up the new process ...
That does seem like a much better design to me. But I wonder if that was considered way back at the dawn of computing and rejected for good reason?
> I think this is technically possible on linux, but there isn't a readily available interface for it.
Yes there is, see `man clone`. POSIX and glibc are quite different from the kernel in this regard. AFAIK under linux there are just threads of execution that might or might not share various namespaces and memory mappings. That said, the kernel does place a few artificial restrictions on what combinations are allowed in order to (as I understand it) guard against the unintended exercise of entirely untested combinations that serve no known practical purpose.
The practical problem is that if you start doing as you please with the various namespaces and mappings you quickly become incompatible with glibc and by extension most likely the majority of the dynamic libraries available on your system.
anarazel's comment focuses entirely on performance, indicating that they have an impression that the discussion about why fork is bad is about performance. I'm not entirely sure where this impression came from, as it's not mentioned in
rom1v's quote nor a point in the linked paper, "A fork() in the road".
With large enough processes, like say a server JVM process that uses 10s of GBs of RAM, even just copying the page tables for CoW can be slow. And unless you have aggressive overcommit settings you can get an OOM on fork, even if you're just going to exec something small.
vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.
vfork() helps a LOT. The restrictions on what you can do on the child-side of vfork() are pretty much the same ones as for fork() + you must not do anything to damage the stack frame of the vfork() caller (i.e., you can't return).
In addition to what you said: forking from a process running on multiple cores is slow once you have mark all pages as read-only and shoot this out to all cores. TLB synchronization is super expensive. Unix originally didn't support threads (want concurrency? just fork!) but with modern multicore that's clearly unsustainable.
This paper is great and I also really like one of its references [29] as it goes into some more subtle parts of scalable interfaces, including fork. It's a gem IMO: The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors https://people.csail.mit.edu/nickolai/papers/clements-sc.pdf
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.
No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.
This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.
QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.
"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"
Don’t pretty much all OSes implement process startup in userspace? On macOS, the kernel creates a process with an image of dyld and points it at dyld_start, which actually takes care of parsing the Mach-O header. I assumed ld.so does the same job on Linux.
Nope, the kernel can load static ELF binaries. ld.so is only needed for dynamically linked binaries, and in fact many Go applications (for example, as they're statically linked) ship as containers with nothing but the single binary.
You can do this on macOS too, if you're willing to break all forward/backward compatibility and make direct syscalls you can have a purely static binary. Without the LC_LOAD_DYLINKER command on the mach-o binary the kernel should just jump to the entrypoint based on LC_UNIXTHREAD. (This may not longer work on arm machines though if they actually trap on direct syscalls not through libSystem, similar to the BSDs)
Yes, it can all be done in userspace. When the "fork in the road" paper came up a while back someone linked to an example. https://grugq.github.io/docs/ul_exec.txt
I think fork() is more of a PDP-7 mistake than a PDP-11 mistake.
On the original UNIX system, memory was so limited that the only sane partitioning was to write the running program's memory image to disk, then reuse the running image as the child. An immediate consequence is the UNIX I/O model, where disk I/O is always synchronous (can't swap processes while waiting for disk I/O because swapping processes requires disk I/O). Anyway, as soon as the UNIX group got a PDP-11, the model broke down, because they had enough memory for multiple processes, but fork() didn't allow them to run concurrently, because their first PDP-11 didn't have an MMU. So they whined until they got one with an MMU instead of fixing their broken design.
> It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.
aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).
I think GP was saying that in QNX the spawning process was responsible for dynamically linking it's child process before running it. With Linux, I think it's the spawned process taking care of it's own dynamic linking.
On QNX the process spawning is done by sending a message to the userspace process manager, which creates a new process table entry and queues up its initial thread. When its initial thread gets a timeslice its entry point may be the dynamic loader (as specified in the PT_INTERP segment) which then does all the dynamic linking as the spawned process or it might be some other entry point like with a statically-linked executable.
So on QNX, the spawned process does all the dynamic linking. The spawning process just sends an asynchronous message to the process manager and then gets on with things in a very deterministic manner as befitting a hard realtime system.
It's a fairly widespread idea for architectures that try to move things out of kernel mode. The Hurd does program image file loading in userspace, too, in its exec server(s).
The tricky part is setting up the initial process. The way out for that is static linking and re-use of the fact that the operating system kernel loader has to understand and be able to load (at least a small subset of) program image file formats too.
But why is having a pair of separate independent operations, fork and exec, required to achieve this? A single fexec call could be implemented to work in the way you describe, no?
It can also mean that neither the hardware side or the software side is static, but change over time. That means that their demands and what they allow also change over time. This leads to the insight that what was perhaps a good idea on 70s hardware/software is not necessarily a good, or even ok, idea 50 years later on modern hardware executing OSes and programs that have been kept up to date.
These discussions were definitely had back in the 20th century too. The spawn model versus the fork+execve model has been an on-going debate since the time of MS/PC/DR-DOS.
It is a weirdly common misconception that that fork() is cheap... it is O(N) on the size of the process, and it always has been.
Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.
This is not exactly fixed since you can vary the amount of memory each page maps with things like hugepages and the same process can run with different page sizes.
The whole approach of using fork seems to be unnatural for me. In many cases (even in the majority of them) it's not needed to inherit the whole structure of the parent process, but to start a given executable. Windows does this better with its CreateProcessW interface.
Fork always seemed conceptually terrible even when I first learned about it.. If you want to do one thing (start a process) you should not have to use a mysterious incantation that does a different unrelated thing (forks your process) in order to do it.
I am curious about what the best way to handle the example in the article of one process spawning many git subprocesses is. Surely it just doesn't make sense to repeatedly start git from scratch in the course of a long-running parent operation. What's the low cost abstraction for the same result, though?
libgit2 exists. You could imagine communicating with some gitd over a pipe/socket but I don't know why that would be a good idea. Short of that you have to spawn processes.
Yeah, as someone who originally came from Windows, the fork+exec model never made sense to me. Now I know it's just a historical quirk, but for some reason there are still people who pretend that fork+exec is actually a good thing...
Fork is conceptually simple. Without bringing in any other layers, you start a process with the one thing known to exist: yourself.
Otherwise you need multiple steps to create a process, fill it with something to run, and arrange for it to execute. Or like Win32 you permanently smush them together with other layers, like filesystems and object loaders and linkers.
The only thing I want to inherit from the parent process is its cwd and environment variables, even those are often overridden. The rest can easily be passed explicitly through other channels like pipes or command line arguments.
Back to the example from the article. It makes no sense that a git-subprocess forked from a web server need to have any process state inherited from the web server.
Yes, exactly. Cloning, as a process creation primitive, is the one thing that doesn't need to be concerned with other stuff.
> … a git-subprocess forked from a web server …
That's pulling in a whole load of assumptions that are distinct from process creation. You can have processes in an environment that has no concept of file system or persistent storage at all.
I gues that way of thinking makes sense if you have a certain model of what a process is, in terms of the data structures and runtime state etc. But, tbh, I think of processes as glorified function calls, which happen to have that stuff involved as an implementation detail. And if spawning a process call is supposed to act like a function call, then of course it should not inherit state. You should call the function you want to call, not call yourself with an instruction to switch over to it instead.
It's not conceptually simple. No other object creation API works by copying an existing thing and then modifying it. You don't create a new file by copying an existing one and then modifying it. You don't create a new window by copying an existing one and modifying it.
Attempting to justify clone/exec as a reasonable design is just Stockholm syndrome.
> Clone-and-modify is almost universal in version control systems.
It's closer to copy-on-write. Also, it actually makes sense there because in 99.999% of cases a commit actually is a modified copy of its parent. That isn't true for process spawning.
When cores start needing more than 9 bits to be represented and RAM is in terabytes, many of the old assumptions need to change. Schedulers need to be implemented in userspace, RAM needs to be allocated in GB, not in 4k, io needs to require less round-trips between kernel and user space and NICs need to do a lot more work before the data reaches the CPU.
Does it need to be the same OS? Most consumer device are in the low 16GB range for memory with some outliers in the 64 and 128 GB. 32 cores are still in the realm of specialized devices.
Yes, we’re not the one paying for Linux development, but its subsystems are so complicated for general purpose computing. Like fitting formula 1 car parts onto a camry.
Our software is littered with the consequences of these kinds of assumptions, and they have an impact on consumer use cases.
x86 still runs in real mode on boot despite dropping the PC BIOS.
Lots of software still assumes a 4kb page size, to the point where migrating Android to 16kb is an ongoing multi-year effort involving far too many people. And this is an OS for phones, which you might assume would lack the memory to benefit from a larger page size.
And one of the most popular consumer CPUs for enthusiasts, the Ryzen X3D chips, broke assumptions in both Linux and Windows schedulers that all cores have access to the same amount of L3 cache.
I would probably not assume the kinds of hardware limitations that we have now will persist into the useful lifetime of current software. Splitting the OS into "consumer" and "enterprise" variants is one of those moves that would bake in a ton of assumptions and make things messier in the future.
It’s all about contracts. It’s fine to define assumptions and build software on top of those. It’s also fine to break those and adjust the software. The trap is trying to steer towards a universal solution (Yagni is the cure there) or trying to slip something in that does not respect the contracts (hence bugs).
UEFI could have supported something like ELF and do away with real mode. Intel and Amd could have just introduced a new line of cpu and everyone could have transitioned to that (with maybe shims to soften the change). But everyone is all about backwards compatibility and compile once, runs for eternity.
I’m using Emacs and various cli tools and while threads are nice to have, they can easily ramp up the complexity of a program beyond what is necessary. I much prefer the boilerplate of setting up a thread pool and tasks queue, rather than dealing with all the await/async syntactic sugar.
i thought this was all fixed with special modes of clone that are optimized and don't actually copy anything (ie, it creates a new deficient process that can pretty much only exec)?
Kind of. Those exist, but because Linux’s formal ABI is syscalls and not libraries that combine them in known-safe ways, the clone speedups that make fork faster are a confusing and fragile API for low-level programmers to use.
That, and even those clone-without-pagetable-copy improvements leave a lot of slowness on the table. Being able to skip even disable-able functionality intended for fork would simplify code. Also, for programs that launch the same subprocess many times, a better API might allow caching away some of the pre-entrypoint initialization of exec.
The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things.
If you contrast that with win32, where you optionally pack a bunch of initial values into a struct, win32 is a much more narrow, less pleasant, less freeform interface, where it is harder to introduce more features.
But I think there is already posix_spawn to imitate that philosophy on Unix-like OSs.
> The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things.
What do you mean underestimated? You can do anything between fork and exec; there are no limitations.
You're talking about libc design choices, not constraints imposed by the kernel. To the kernel, a post-fork pre-exec process is just any old process. GP was suggesting post-fork processes were constrained in the syscalls they could invoke; they are not.
I did not say they are constrained in what syscalls they can make, as if some nanny at the syscall entry point will punish you for doing wrong. I said that it interacts poorly with threads due to inherent race conditions. See the other comment.
Literally nothing in that comment mentions or discusses threads.
> I did not say they are constrained in what syscalls they can make
You wrote: "The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things."
Those are all syscalls. You can also invoke any of the other ~hundreds of syscalls linux exposes, not only dup2, setpgid, and a "few" others.
That's not true. Just one example, if you do anything with threads you are pretty screwed. For example if another thread holds a mutex at the time of fork(2), and you also want that mutex.
You can create threads in forked children before exec. Nothing in the kernel prevents you from invoking clone().
You're talking about libc (glibc) implementation details now; userspace programs running on the Linux kernel do not have to be implemented in C or use glibc's primitives. Your earlier comment I initially replied to was talking about kernel syscalls. Forked processes are free to invoke any syscall they want, not just dup2 or a handful of others.
I'm not talking about glibc implementation details. I'm talking about how mixing fork(2) with threads creates harmful race conditions.
The forked child has only 1 thread in its process. If the parent's threads are holding a lock or are in the middle of mutating a shared data structure, you're fucked, because those threads are no longer running in your child's copy of the address space and will not finish their work. This issue is fundamental to how threads work and what fork(2) does.
Again, you're talking about userspace now. Not kernel-imposed constraints. A userspace program is always free to deadlock itself; fork doesn't change that.
I never said it was a kernel imposed constraint. It remains unsafe behavior, and frankly you'd be stupid to ignore it if you want to write a stable multi threaded program. In colloquial shorthand, you can't do it.
Signal safety is not the same as this, but similar. I believe posix specifies what is signal-unsafe to be overly broad. But the unsafety isn't an illusion -- it's an emergent property from something being a bad idea given the primitives at work, there are broad categories of bugs that are easy to introduce due to the way it works. So for signals, posix declares a bunch of ill advised things to be undefined, and with good reason. This is an analogous scenario.
This means if the program is multi threaded, you cannot rely on calling malloc in the child, because at the time of the fork another thread could have happened to be inside malloc doing manipulations on the global heap.
Which means, practically speaking, "don't allocate memory between fork and exec".
If you want to be overly literal as you have been, you can call mmap and it will give you new pages, but who is really doing that? Not the random shared library code you might want to call into. Hell, even a lot of libc calls malloc.
Which means it's not safe to do a random library call between fork and exec.
See where I'm going with this? That's if your program is multi threaded. If it isn't, these things are most likely fine.
posix_spawn is emulated on Linux, but it is a native syscall on macOS (and possibly other OSes?). As discussed in the linked article, there is interest in changing Linux to adopt this model, where posix_spawn is its own fundamental primitive.
Yeah, I think it is a reasonable transition path or implementation detail for some systems to implement it in userland atop fork(2), and others to natively spawn a new process without copying the old address space.
The problem with replacing exec/fork is that you usually want to configure new process: for example, set up signal handlers, close or open FDs, switch namespaces, setup seccomp, adjust permissions. And all the system calls to do it apply only to the current process and you need something to replace them. The proposal in the article was to create a new API for this.
My idea is that we could make a new syscall, for example "spawn", that creates a new empty process, loads some lightweight "loader" into it, and passes arbitrary configuration data. The loader configures the process and exec()'s the main program. This allows to avoid forking the memory and keep existing APIs, but still requires to fork file descriptors and other things.
I liked the other proposal where you can create a blank process and then force it to make syscalls, ending with execve. That doesn't require a bunch of special data structures to hold the syscalls you want to do.
If fork and exec can exhibit persistent and algebraic behavior (beyond its CoW nature) that would not only be more useful but more interesting to use, for example using it for doing lazy evaluation
Huh, LWN has moved to (sometimes) requiring a click to proceed past the subscription pitch to the actual article. I feel like this may have an inverse effect (insistent begging to the point of inserting additional obstacles = angry/insulted users that are less likely to pay).
It's an experiment. Compared to the text-obscuring popovers that are prevalent elsewhere on the net, it seems pretty low-key; as far as I know, this is the first complaint I've seen. I don't know if we will continue experimenting with those or not...better ideas for getting people to subscribe to the site would be more than welcome.
This isn’t moving beyond fork and exec at all. It’s adding a complicated API for a marginal gain for a niche use case, and ignoring the actual big bottleneck of fork
I've always liked the Mach approach. You've got a few primitives:
- address space
- memory objects
- threads
Mix and match. A Task (process) is not a primitive, but a composite object combining address space with one or more threads. How you fill the address space with actual memory objects is up to you. Map from disk or COW your own address space...have fun!
ComputerGuru | a day ago
smj-edison | a day ago
sanderjd | a day ago
Chu4eeno | 21 hours ago
sanderjd | 21 hours ago
hparadiz | a day ago
I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?
201984 | a day ago
Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.
sjmulder | a day ago
ptspts | a day ago
monocasa | a day ago
201984 | a day ago
They can't, so even PIC code still has to have a relocation table that gets patched. It's in a different page than the code though, so code does still get reused.
monocasa | a day ago
201984 | a day ago
If not patching, what exactly would you call modifying part of the file?
monocasa | a day ago
This isn't meant as a reductive take, but instead that there is a difference between completely describable in C like the contents of the .got section, and something like a .reloc section that actually has to understand the generated assembly in order to build the relocation table to load and link the executable. Both are linking, but I've saved "patching" for more brain surgery esque techniques. Like on mips, the jump instruction immediate is the bottom 26 bits of the absolute address of the target, so you're going through and modifying all of the jump instructions if you load it to somewhere it wasn't linked at.
t-3 | a day ago
monocasa | a day ago
Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.
saidinesh5 | a day ago
> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.
Relevant stack overflow answer: https://stackoverflow.com/questions/61950951/linux-shared-li...
mlaretallack | a day ago
BoingBoomTschak | a day ago
1718627440 | a day ago
johnthescott | a day ago
sirsinsalot | a day ago
In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.
Sophira | a day ago
lokar | a day ago
dijit | a day ago
It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.
Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).
sjmulder | a day ago
lanstin | a day ago
zerobees | a day ago
In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.
It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.
nasretdinov | a day ago
corbet | a day ago
sanderjd | a day ago
ktpsns | a day ago
lokar | a day ago
sanderjd | a day ago
aerzen | a day ago
1718627440 | a day ago
lokar | a day ago
MBCook | a day ago
m132 | 22 hours ago
sanderjd | 21 hours ago
m132 | 8 hours ago
A lot of other Python implementations don't have the ability to spin up new processes at all too.
sanderjd | 6 hours ago
lokar | 7 hours ago
Bash as a programming language is just a bad idea.
sanderjd | a day ago
lokar | 7 hours ago
s/it/not/
pizlonator | a day ago
lokar | 7 hours ago
lokar | a day ago
kllrnohj | a day ago
sanderjd | a day ago
m132 | 22 hours ago
sanderjd | 21 hours ago
1718627440 | a day ago
If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.
lokar | a day ago
omoikane | a day ago
I can recall just one program that's intentionally not implemented as a library, but I think people have since built a library on top of it:
https://dechifro.org/dcraw/#:~:text=Why%20don%27t%20you%20im...
sanderjd | a day ago
dnw | a day ago
sanderjd | a day ago
jerf | a day ago
This is just an example of I don't even know how many things a modern-day process will share from its parent.
By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".
dcrazy | a day ago
jerf | a day ago
I also explicitly said this wasn't unsolvable. My point isn't about technical implementations or code, my point is that the casual "I want to share nothing about the parent process" thought in sanderj's mind, and presumably a lot others, is much more ill-defined than they realize. There's a lot more state that a process has than what file descriptors are open in a modern system.
Moreover, as things like "in which container is this running" demonstrate, those are also not "create a process that has nothing to do with this process", because, again, there's a lot more to "having to do with this process" than "what file descriptors are open".
Also, as the name might have been a clue, Linux has posix_spawn: https://linux.die.net/man/3/posix_spawn. It also has a thing called "clone": https://www.man7.org/linux/man-pages/man2/clone.2.html Nor do I claim this paragraph is an entire overview of all the ways of starting a process in Linux. If you want to understand what I mean by "lots of details in a modern OS", your assignment is to carefully read the entire "clone" man page, and you'll start to see what I mean, though I'm not sure even that is all the state associated with a process nowadays.
dcrazy | a day ago
Other operating systems either have parallel APIs to fork (e.g. the posix_spawn syscall on macOS) or do not provide fork at all (Windows).
jerf | 8 hours ago
sanderjd | a day ago
I don't think it is necessary (or the best implementation) to clone the parent process, in order to maintain important properties like the process tree / container state, etc. I recognize that it's a sorta neat hack, "well if we just start by cloning the parent, then we don't have to figure out what state to include!", but that just pushes the details to the child process needing to figure out what to exclude, which IMO is a worse default.
sanderjd | a day ago
JoBrad | a day ago
wongarsu | a day ago
1718627440 | a day ago
jonhohle | a day ago
sanderjd | a day ago
stefan_ | a day ago
yxhuvud | a day ago
It shares way too much, and have huge use cases where it is really, really bad.
sanderjd | a day ago
gmueckl | 19 hours ago
stabbles | a day ago
anarazel | a day ago
JdeBP | a day ago
anarazel | a day ago
sanderjd | a day ago
7jjjjjjj | a day ago
Isn't that what posix_spawn is for?
yxhuvud | a day ago
JdeBP | a day ago
And of course that has already been done. On NetBSD, posix_spawn() is a fully-fledged system call and much of the work is done in kernel mode.
* https://blog.netbsd.org/tnf/entry/posix_spawn_syscall_added
dcrazy | a day ago
JdeBP | a day ago
toast0 | a day ago
debatem1 | a day ago
MBCook | a day ago
For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.
Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.
debatem1 | a day ago
surajrmal | 19 hours ago
uecker | a day ago
amluto | a day ago
Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.
1718627440 | a day ago
jonhohle | a day ago
amluto | a day ago
burnt-resistor | a day ago
If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.
wvenable | a day ago
― George Bernard Shaw, probably.
__david__ | a day ago
JdeBP | a day ago
* https://jdebp.uk/FGA/bernstein-on-ttys/cttys.html
Interestingly, on MS/PC/DR-DOS file descriptor 3 was stdaux. and file descriptor 4 was stdprn.
chasil | a day ago
The Windows approach may be correct, but it suffers in performance from the POSIX perspective.
I have heard that WSL1 iimproves this.
amluto | a day ago
Windows does not historically depend on fork(), so there was no native fork(), so Cygwin kludged it up.
JdeBP | a day ago
jkrejcha | 23 hours ago
Though actually iirc werfault uses NtCreateUserProcess() to clone processes when writing out crash dumps to this day
uecker | a day ago
Any kind of replacement should aim for the same conceptual simplicity and power. Sadly, I fear that people driving development nowadays are more interested in building unbreakable walled gardens for advertisement or app stores, or trying to squeeze down the some small gain when used on the cloud. I am more interested in general computing on the user side.
dcrazy | a day ago
uecker | a day ago
amluto | 22 hours ago
Pipes and redirections don’t need fork + exec. Neither do subshells.
uecker | 16 hours ago
dwattttt | 9 hours ago
Possibly the most common way to tell the child the value is by setting it as a CLI arg in CreateProcess.
uecker | 8 hours ago
dwattttt | 58 minutes ago
How do you selectively pass on fds without having a global impact on your process?
IshKebab | 14 hours ago
uecker | 7 hours ago
IshKebab | 7 hours ago
uecker | 7 hours ago
IshKebab | 6 hours ago
I dunno, that's the best I can do for now. Maybe you can do better?
uecker | 5 hours ago
fanf2 | a day ago
garaetjjte | a day ago
uecker | a day ago
trumpdong | a day ago
__david__ | a day ago
cryptonector | 17 hours ago
POSIX says nothing much about vfork() anymore. It was a mistake removing it. Zealots failed to understand that vfork() >> fork(). https://news.ycombinator.com/item?id=30502392
PaulDavisThe1st | a day ago
matheusmoreira | a day ago
jcranmer | a day ago
In an alternative world where fork+exec never existed, a lot of those "usual APIs" would probably have had an explicit pid argument to them that let you modify process configuration from a different process. (This is how Fuschia works, e.g.). There's a lot of benefit to this world: the most obvious is that you don't have to magic up some IPC system just to report configuration errors, but there's actually a good amount of utility in being able to have a manager process that is tweaking attributes of its children (e.g., debuggers would love it).
uecker | a day ago
trumpdong | a day ago
uecker | a day ago
But frankly, I am not really seeing the value.
trumpdong | a day ago
uecker | 11 hours ago
trumpdong | 7 hours ago
trumpdong | a day ago
pjc50 | a day ago
Quick, what's the highest numbered open file descriptor in the your program?
This gets even worse if you have multiple threads running. Without looking it up, what is the state of all the various synchronization primitives in a forked process?
jkrejcha | a day ago
For example
I guess this is a little bit me being a bit of critical of the usual syscall APIs for not thinking about "what if I want to do this to another process I have access to" but...It also makes things like thread safety even reasonably doable with fork. I do agree though that stuff like CreateProcess which take in a gazillion parameters don't really make for the greatest of userspace APIs
uecker | a day ago
But how often would one actually need this? And what are the semantics? Refer arguments (e.g. file descriptors) to the current process or the other one? How are cross-permissions handled? It seems a lot of complexity...
Someones proposed a ptrace_syscall which could achieve the same thing.
jkrejcha | 23 hours ago
Well, the idea is that it'd probably be close to the default API for spawning processes (and could even be the bedrock for posix_spawn and friends in libc (and potentially even "simple" fork cases[1])). fork/clone would be the special case
In most cases, most programs don't need special setup. Something like `ptrace_syscall` would also work for this and would be probably the way to do it with the backwards compat limitations of nowadays
ptrace-ability seems to be generally how permissions for this sort of thing are handled in general (see also procfs, process_vm_writev, ptrace, etc). The complication is a little bit around setuid programs but either you could special case execve to imply SIGCONT for setuid or have execve also imply a SIGCONT as well
[1]: Probably would be rare for a compiler to optimize it though
zzo38computer | 21 hours ago
I think one problem is that it is already how it is; making an entirely new operating system (that is not Linux, not GNU, and not POSIX) would solve it, but that is not the case here, so it would need to be done as it is.
One possibility would be a new function that creates a new empty child process, but the parent process specifies what system calls the child process executes, and can stop if specifying that exec or exit is (successfully) called by the child process, or if the parent process gives it the program memory to execute directly instead of using a file (since that use is also useful). The new function can still have some of the clone flags available. (I don't actually know how much better it would work.)
There are other possibilities as well.
The existing methods can also remain available for when they are helpful, but functions such as popen might be changed to use the new method.
fonheponho | 8 hours ago
Unfortunately, the opposite is true, when the parent process is multi-threaded. In the child process, only one thread exists (the thread returning from fork()), but the memory is an exact copy of the parent's. As a result, the child may inherit locks (resident in memory) that are in acquired state, but have no owner threads -- the threads that are responsible for eventually releasing those locks in the child's copy of the process memory do not exist in the child. If the single thread in the child process (returning from fork()) attempts to take such a lock (before exec), it deadlocks. This is why POSIX says that only async-signal-safe functions may be called in a child process, between fork and exec. And then, for example, "malloc" is not such a function (at least per POSIX), so the fork-to-exec environment in the child process is an extremely uncomfortable one. You've got to preallocate everything in the parent, can't report errors to stderr, etc.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/f...
https://pubs.opengroup.org/onlinepubs/9799919799/functions/V...
The fork(2) Linux manual page spells out the sam restriction.
https://man7.org/linux/man-pages/man2/fork.2.html
https://man7.org/linux/man-pages/man7/signal-safety.7.html
"pthread_atfork" exists, but is effectively unusable.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/p...
uecker | 8 hours ago
mrkeen | a day ago
It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.
FooBarWidget | a day ago
mort96 | a day ago
m00x | a day ago
I guess it depends on how sensitive your application is to main thread pauses.
trumpdong | a day ago
Joker_vD | 22 hours ago
tempest_ | 21 hours ago
tux3 | a day ago
That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.
cls59 | a day ago
josefx | a day ago
j16sdiz | 20 hours ago
> Attempts (such as vfork()) have been made over the years to optimize for this case, but the pattern still is more expensive than it could be.
Basically vfork do a "stop the world".
cryptonector | 17 hours ago
vfork() does NOT stop the world in many / most implementations. The ones that do stop the world do it because someone misunderstood the whole "vfork() stops the parent process" -- yes, it stops the parent process in a pre-threads world, but it doesn't have to stop any other threads but the one that called vfork(). Indeed, many implementations don't do that.
(Someone once tried to make NetBSD's vfork() stop the world because that's what the pre-threading man page said it does. I did my utter best to keep that from happening at the time, and it didn't then. Hopefully no one tried again later.)
epcoa | a day ago
For the intended audience of such a paper this is base knowledge.
thamer | a day ago
Even back in 2012 this blog post showed the high cost of this operation: https://redis.io/blog/testing-fork-time-on-awsxen-infrastruc...
On an m2.xlarge using ~25GB of RAM, fork() took 5.67 seconds. That's a long pause when Redis clients typically experience single-digit msec latency for most operations. Yes, that's only the time needed to copy the page table. It's surprising they don't mention huge pages, it seems like it would be a key consideration here.
No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.
rom1v | a day ago
> ABSTRACT
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.
pizlonator | a day ago
Hard to come up with an optimization that is equally efficient and elegant
toast0 | a day ago
I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.
[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.
pizlonator | a day ago
In all uses of zygotes that I have seen, here's what's really happening:
- `fork` is being used to reduce the cost of starting a process that has a high start-up cost. So, you start one process, run it through the expensive initialization, and then fork it from there to start new processes.
- To make this even faster, you have a pool of pre-forked processes sit around.
- Having pre-forked processes sitting around ready to be used is not expensive because of the CoW property and the fact that a process that forks and then immediately pauses will not have triggered any significant CoW yet.
So, the zygote optimization you speak of is in practice only meaningful on top of systems that are using an optimization uniquely enabled by `fork` (avoiding process initialization costs by cloning a process), and that zygote optimization is further optimized by another property of `fork` (memory sharing of forked processes that haven't done anything else yet).
toast0 | a day ago
> A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.
I think reading the first sentance and stopping covers my zygote, but adding the second sentance covers yours. So I think we're both right!
I think both paths are useful. If your children need time to startup and become ready, spawn one that does start up work, and then it (pre)forks at the ready state to have processes ready to handle requests (your zygote). This does require a traditional fork() to avoid duplication of work.
But if forking is expensive at runtime because you have a million FDs open and a whole lot of memory allocations, spawn spawners before you start doing work (my zygote). This could be unnecessary with a inexpensive way to spawn a new process from an process that has lots of resources in use.
Of course, you can also use my zygotes to spawn your zygotes. Zygoteception.
[1] https://chromium.googlesource.com/chromium/src/+/HEAD/docs/l...
skydhash | a day ago
While I’ve not bothered to profile it, but it seems that process that have lot of mapped pages is the issue (firefox, emacs,…). In the emacs case, the issue is when the main process trying to fork-exec, if I start a shell session (with shell-mode or term-mode), it works fine.
mpyne | a day ago
Google may have popularized the term, but this approach was already in use by KDE developers in the KDE 2.x timeframe, where it was used as part of a system called kdeinit.
In this scheme, launching KDE apps from a KDE desktop could bypass much of the startup cost of dynamic linking by forking from a long-running kdeinit process (with kdeinit itself deliberately linked to all large dependency libs like Qt and kdelibs), dynamically loading the application logic (stored as a .so) and then launching the app.
This was more to save startup time due to how long it took to dynamically resolve a multitude of C++-based symbols back then, all the common logic came before the app's own main() would ever be called. But it did also save a bit of memory as well.
PaulDavisThe1st | a day ago
It's called clone(2)
trumpdong | a day ago
eggnet | 7 hours ago
toast0 | a day ago
vlovich123 | a day ago
loeg | a day ago
vlovich123 | 22 hours ago
The reason to do a zygote in the first place could be solved with alternative special APIs that are safer and harder to misuse. But we have fork so there’s not as big of a demand despite the warts.
loeg | 2 hours ago
p_l | a day ago
Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).
If you don't, you might wake up with fork() causing latency issues.
cyberax | 17 hours ago
Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.
pizlonator | 5 hours ago
My trick for that is that the set of threads that I create pre fork have to be suspendable and resumable, preferably lazily (they resume when they are actually needed). So, the zygotes are sitting with those threads suspended. When they become active, they can do work immediately. They might lazily resume those threads as needed.
There are other idioms for this too.
> Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.
Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives
anarazel | a day ago
I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.
pjmlp | a day ago
Traditionally Windows applications that create processes all the time come from UNIX heritage.
Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.
While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.
zozbot234 | a day ago
JdeBP | a day ago
* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...
pstuart | a day ago
epcoa | a day ago
JdeBP | a day ago
* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...
dcrazy | a day ago
JdeBP | a day ago
Think it through. Windows NT supported fork from the start in its POSIX subsystem, that subsystem was layered on top of the Native API, and this is the Native API mechanism that the POSIX subsystem employed. Although it took until Gary Nebbett for someone to publicly show how, even though people knew informally back in 1993.
keitmo | a day ago
pjmlp | a day ago
Misread on purpose to make a point?
dcrazy | a day ago
peterfirefly | 22 hours ago
https://en.wikipedia.org/wiki/Windows_NT#Development
Windows NT was developed on various different CPUs before the Alpha was a thing. When it was released in 1993, it was released for three CPUs: IA-32, MIPS, and Alpha.
dcrazy | 19 hours ago
Raymond also says elsewhere that most WinNT engineers did development on i386, but doesn’t explicitly say what time period he is describing: https://devblogs.microsoft.com/oldnewthing/20250513-00/?p=11...
PaulDavisThe1st | a day ago
Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.
Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.
You're right that POSIX semantics get tangled when using threads.
pjmlp | a day ago
Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.
Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.
snozolli | a day ago
How are those not simply child processes? I don't understand your use of the word 'threads' here.
Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.
pjmlp | a day ago
The unit of execution is the thread.
On the UNIX world it depends on which UNIX you are talking about.
Linux has a similar model to Windows NT nowadays, hence clone() as key primitive.
Other UNIXes have different approaches.
PaulDavisThe1st | 23 hours ago
trumpdong | a day ago
Second answer: Linux doesn't differentiate between threads and processes. It has a "thread group ID" that serves a small number of purposes, and the rest of the difference is just whether the threads happen to share the same address space.
JdeBP | a day ago
The Unix model was invented over a decade before the idea of multithreading percolated into mainstream operating systems at all.
The reason that Windows NT started as it did, was that OS/2 had come out in 1987, with kernel threads, and the idea of multithreading had taken root. SunOS 5 gained threading, too.
Windows NT applications development began with threading available as a mechanism from the start, and with a lot of people in the IBM/Microsoft world already knowing about its use in applications development from OS/2.
Whereas with the Unices it came in more gradually, as the applications had often already been designed. The whole libthread versus libpthread thing made things interesting on SunOS for a few years, too. As did the first attempt (LinuxThreads) at providing threads on Linux.
thayne | a day ago
PaulDavisThe1st | 23 hours ago
This obviously changed as pthreads came into being, and at this point, I suspect that the typical use for threads-sharing-memory and threads-not-sharing-memory is the same on most platforms.
A reminder that the task_t data structure describes threads and processes not just in Linux, but earlier Unixen also.
knome | a day ago
pjmlp | a day ago
sunshowers | a day ago
pjmlp | a day ago
mort96 | a day ago
pjmlp | a day ago
Windows has a more rich set of IPC stuff than POSIX, especially since it has a microkernel like design.
If you are going to say it is everything on the same memory space anyway, it isn't.
Optional on Windows 10, and enforced on Windows 11, Hyper-V is always running, and several components including kernel and driver modules are sandboxed into their little worlds.
Several additional sandboxing changes were announced at BUILD.
mort96 | a day ago
pjmlp | a day ago
This is how a http server back in the day would share the request context for the child process to reply back.
mort96 | a day ago
tliltocatl | 7 hours ago
yencabulator | 59 minutes ago
dcrazy | a day ago
.NET tried this with app domains, which are now deprecated.
pjmlp | a day ago
Also App Domains are partially back in .NET Core, isolation features aren't there, but code unloading is, via AssemblyLoadContext.
dcrazy | 23 hours ago
tosti | 2 hours ago
Mozilla implemented an alternative to COM, called XPCOM. XP here means cross platform. Perhaps you could compare against that to take the platform out of the equation.
nine_k | a day ago
nvme0n1p1 | a day ago
dcrazy | a day ago
Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.
nvme0n1p1 | a day ago
Why does it matter which prefix I used? They both point to the same routine so my point applies either way.
netbsdusers | 10 hours ago
aseipp | a day ago
mort96 | a day ago
Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.
theK | a day ago
nvme0n1p1 | a day ago
(Windows's fork is called ZwCreateProcess)
dcrazy | a day ago
nvme0n1p1 | a day ago
Someone | a day ago
I don’t know how they implemented it, though. Under the hood, it could do the equivalent of a fork/exec pair.
plorkyeran | a day ago
dcrazy | a day ago
mort96 | a day ago
theK | a day ago
marcosdumay | a day ago
At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.
mort96 | a day ago
mpyne | a day ago
vbezhenar | a day ago
You're doing fork + exec.
If you're overcommiting, fork will not reserve another 600 MB, and exec immediately after fork will cause total system usage to be 601 MB.
If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.
agwa | 21 hours ago
cylemons | 20 hours ago
Does this accounting apply to vfork as well?
cryptonector | 17 hours ago
dwattttt | 16 hours ago
> You're doing fork + exec.
This is the clear problem: you don't want another process that's a duplicate of the current one, that's just a detail of what you actually want: a 1mb process. Right now it's a badly leaky detail which you're forced to work around.
dapperdrake | a day ago
Only being half facetious here. Maybe you or someone else really has a better take.
mort96 | a day ago
Someone | a day ago
It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.
On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.
That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.
I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.
thayne | a day ago
Or you could create a hybrid between a thread and a process, where it still uses the parent's memory space (unlike fok), but has it's own stack (unlike vfork), and is in its own process (unlike a thread). I think this is technically possible on linux, but there isn't a readily available interface for it. Although it seems like posix_spawn could be implemented that way...
dcrazy | a day ago
infogulch | 18 hours ago
thayne | 18 hours ago
fc417fc802 | 18 hours ago
That does seem like a much better design to me. But I wonder if that was considered way back at the dawn of computing and rejected for good reason?
> I think this is technically possible on linux, but there isn't a readily available interface for it.
Yes there is, see `man clone`. POSIX and glibc are quite different from the kernel in this regard. AFAIK under linux there are just threads of execution that might or might not share various namespaces and memory mappings. That said, the kernel does place a few artificial restrictions on what combinations are allowed in order to (as I understand it) guard against the unintended exercise of entirely untested combinations that serve no known practical purpose.
The practical problem is that if you start doing as you please with the various namespaces and mappings you quickly become incompatible with glibc and by extension most likely the majority of the dynamic libraries available on your system.
cryptonector | 17 hours ago
Though I want a posix_spawn-as-a-system-call approach as well / instead of that.
foresto | a day ago
Did someone suggest that it was?
mort96 | a day ago
adgjlsfhk1 | a day ago
thayne | a day ago
vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.
cryptonector | 17 hours ago
tliltocatl | a day ago
emmelaich | 17 hours ago
To avoid the problems, see roc's comment under the article. Esp use of a zygote process.
netbsdusers | 10 hours ago
omoikane | a day ago
https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)
[OP] jwilk | a day ago
aseipp | a day ago
Animats | a day ago
No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.
This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.
QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.
lukan | a day ago
"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"
(But thanks for the good explanation)
dcrazy | a day ago
purkka | a day ago
dcrazy | a day ago
loeg | a day ago
bregma | 10 hours ago
krackers | 16 hours ago
fc417fc802 | 18 hours ago
not_a_bijection | a day ago
duped | a day ago
aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).
bluepuma77 | a day ago
Well, it seems we are back in an era with really expensive memory.
afiori | 22 hours ago
BobbyTables2 | 19 hours ago
“An era of really expensive memory”. That sounds familiar…
vanviegen | 14 hours ago
bregma | 10 hours ago
So on QNX, the spawned process does all the dynamic linking. The spawning process just sends an asynchronous message to the process manager and then gets on with things in a very deterministic manner as befitting a hard realtime system.
cryptonector | 16 hours ago
JdeBP | 14 hours ago
The tricky part is setting up the initial process. The way out for that is static linking and re-use of the fact that the operating system kernel loader has to understand and be able to load (at least a small subset of) program image file formats too.
cryptonector | 16 hours ago
> No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.
Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().
derriz | 12 hours ago
up2isomorphism | 23 hours ago
cryptonector | 17 hours ago
burnt-resistor | a day ago
Every couple of years, someone claims they have "the solution" implying everyone else who came before them didn't know what they were doing.
yxhuvud | a day ago
mike_hock | a day ago
I.e. a year that starts with 20, not 19.
JdeBP | a day ago
jcalvinowens | a day ago
Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.
themafia | 16 hours ago
This is not exactly fixed since you can vary the amount of memory each page maps with things like hugepages and the same process can run with different page sizes.
IshKebab | 14 hours ago
Panzerschrek | a day ago
ajkjk | a day ago
I am curious about what the best way to handle the example in the article of one process spawning many git subprocesses is. Surely it just doesn't make sense to repeatedly start git from scratch in the course of a long-running parent operation. What's the low cost abstraction for the same result, though?
wmf | a day ago
trumpdong | a day ago
spacechild1 | a day ago
kps | a day ago
Otherwise you need multiple steps to create a process, fill it with something to run, and arrange for it to execute. Or like Win32 you permanently smush them together with other layers, like filesystems and object loaders and linkers.
Too | 17 hours ago
The only thing I want to inherit from the parent process is its cwd and environment variables, even those are often overridden. The rest can easily be passed explicitly through other channels like pipes or command line arguments.
Back to the example from the article. It makes no sense that a git-subprocess forked from a web server need to have any process state inherited from the web server.
kps | 6 hours ago
Yes, exactly. Cloning, as a process creation primitive, is the one thing that doesn't need to be concerned with other stuff.
> … a git-subprocess forked from a web server …
That's pulling in a whole load of assumptions that are distinct from process creation. You can have processes in an environment that has no concept of file system or persistent storage at all.
ajkjk | 17 hours ago
fluffybucktsnek | 13 hours ago
IshKebab | 14 hours ago
Attempting to justify clone/exec as a reasonable design is just Stockholm syndrome.
kps | 6 hours ago
Clone-and-modify is pretty common in CAD.
> You don't create a new file by copying an existing one and then modifying it.
Clone-and-modify is almost universal in version control systems.
IshKebab | 6 hours ago
It's closer to copy-on-write. Also, it actually makes sense there because in 99.999% of cases a commit actually is a modified copy of its parent. That isn't true for process spawning.
ggm | a day ago
I do use threaded code. It's significantly harder to write and reason about. (45 years in to a CS career, ageing out)
You have to be clever to do better than clever people. Clever people bootstrapped me into fork()/exec() and I know my limits.
redleader55 | a day ago
skydhash | a day ago
Yes, we’re not the one paying for Linux development, but its subsystems are so complicated for general purpose computing. Like fitting formula 1 car parts onto a camry.
tadfisher | a day ago
x86 still runs in real mode on boot despite dropping the PC BIOS.
Lots of software still assumes a 4kb page size, to the point where migrating Android to 16kb is an ongoing multi-year effort involving far too many people. And this is an OS for phones, which you might assume would lack the memory to benefit from a larger page size.
And one of the most popular consumer CPUs for enthusiasts, the Ryzen X3D chips, broke assumptions in both Linux and Windows schedulers that all cores have access to the same amount of L3 cache.
I would probably not assume the kinds of hardware limitations that we have now will persist into the useful lifetime of current software. Splitting the OS into "consumer" and "enterprise" variants is one of those moves that would bake in a ton of assumptions and make things messier in the future.
skydhash | a day ago
UEFI could have supported something like ELF and do away with real mode. Intel and Amd could have just introduced a new line of cpu and everyone could have transitioned to that (with maybe shims to soften the change). But everyone is all about backwards compatibility and compile once, runs for eternity.
skydhash | a day ago
a-dub | a day ago
zbentley | 3 hours ago
That, and even those clone-without-pagetable-copy improvements leave a lot of slowness on the table. Being able to skip even disable-able functionality intended for fork would simplify code. Also, for programs that launch the same subprocess many times, a better API might allow caching away some of the pre-entrypoint initialization of exec.
asveikau | a day ago
If you contrast that with win32, where you optionally pack a bunch of initial values into a struct, win32 is a much more narrow, less pleasant, less freeform interface, where it is harder to introduce more features.
But I think there is already posix_spawn to imitate that philosophy on Unix-like OSs.
loeg | a day ago
What do you mean underestimated? You can do anything between fork and exec; there are no limitations.
dcrazy | a day ago
loeg | a day ago
asveikau | a day ago
loeg | a day ago
No, you absolutely did not: https://news.ycombinator.com/item?id=48427396
Literally nothing in that comment mentions or discusses threads.
> I did not say they are constrained in what syscalls they can make
You wrote: "The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things."
Those are all syscalls. You can also invoke any of the other ~hundreds of syscalls linux exposes, not only dup2, setpgid, and a "few" others.
asveikau | a day ago
loeg | a day ago
You're talking about libc (glibc) implementation details now; userspace programs running on the Linux kernel do not have to be implemented in C or use glibc's primitives. Your earlier comment I initially replied to was talking about kernel syscalls. Forked processes are free to invoke any syscall they want, not just dup2 or a handful of others.
asveikau | a day ago
The forked child has only 1 thread in its process. If the parent's threads are holding a lock or are in the middle of mutating a shared data structure, you're fucked, because those threads are no longer running in your child's copy of the address space and will not finish their work. This issue is fundamental to how threads work and what fork(2) does.
loeg | a day ago
asveikau | a day ago
Signal safety is not the same as this, but similar. I believe posix specifies what is signal-unsafe to be overly broad. But the unsafety isn't an illusion -- it's an emergent property from something being a bad idea given the primitives at work, there are broad categories of bugs that are easy to introduce due to the way it works. So for signals, posix declares a bunch of ill advised things to be undefined, and with good reason. This is an analogous scenario.
asveikau | 23 hours ago
This means if the program is multi threaded, you cannot rely on calling malloc in the child, because at the time of the fork another thread could have happened to be inside malloc doing manipulations on the global heap.
Which means, practically speaking, "don't allocate memory between fork and exec".
If you want to be overly literal as you have been, you can call mmap and it will give you new pages, but who is really doing that? Not the random shared library code you might want to call into. Hell, even a lot of libc calls malloc.
Which means it's not safe to do a random library call between fork and exec.
See where I'm going with this? That's if your program is multi threaded. If it isn't, these things are most likely fine.
peterfirefly | 21 hours ago
dcrazy | a day ago
asveikau | a day ago
codedokode | a day ago
My idea is that we could make a new syscall, for example "spawn", that creates a new empty process, loads some lightweight "loader" into it, and passes arbitrary configuration data. The loader configures the process and exec()'s the main program. This allows to avoid forking the memory and keep existing APIs, but still requires to fork file descriptors and other things.
nyrikki | a day ago
(Sorry if you weren't joking) but yes, posix_spawn() has been a thing and in glibc fork is just a alias to clone()
Not exactly that OP idea, but fork/exec is legacy really.
trumpdong | a day ago
stevefan1999 | 23 hours ago
LoganDark | 22 hours ago
corbet | 21 minutes ago
foo-bar-baz529 | 18 hours ago
mpweiher | 13 hours ago
- address space
- memory objects
- threads
Mix and match. A Task (process) is not a primitive, but a composite object combining address space with one or more threads. How you fill the address space with actual memory objects is up to you. Map from disk or COW your own address space...have fun!
https://developer.apple.com/library/archive/documentation/Da...
tus666 | 10 hours ago
medoc | 9 hours ago
high_byte | 5 hours ago