Why does C have the best file API?

14 points by robalex 6 hours ago on lobsters | 14 comments

thasso | 5 hours ago

mmap(2), the API in question, is a system call; it’s not specific to C in any way. I’m sure the memory model of some other languages makes it more difficult to use mmap(2) in this way, but, still, there exist plenty of languages where you can write practically the same code. I believe it’s somewhat misguided to make this a C thing.

calvin | 5 hours ago

There's a lot of (misguided and inaccurate) conflation between Unix APIs (i.e. open, read, mmap) vs. the C standard library APIs (fopen, fread, but no mmap equivalent).

silentbicycle | 5 hours ago

Indeed. I just used mmap extensively in Rust code last week.

krtab | 4 hours ago

Rust is one of the language where getting mmap right is the hardest imo, as any change to the underlying file will end up being UB in almost all cases.

valpackett | an hour ago

As if it's not UB everywhere else…?

It is pretty hard to decipher what you mean under the curtness of your reply. That being said, yes, Rust references have much more invariants than C pointers. Writing to a file you have already opened in C is UB only if it causes a data race, ie when concurrently reading the file. In Rust, merely having a &[u8] to the mmap is UB if it is written to concurrently,

volatile

silentbicycle | 43 minutes ago

Luckily, that shouldn't be an issue in my use case, because by design the mmap refers to constant data on disk. Updates switch to a new mmap handle, with a distinct file path. I'm paranoid about that kind of thing because several years back I hit a bug (in C): I discovered that since dlopen only identifies files by path, if you replace the file (or symlink, IIRC) you can end up with some of the old file getting paged out and eventually replaced with the new file's contents at the same offset. With that dynamically linked code it led to jumping into the middle of functions with stale register contents and extremely strange errors. The dlopen behavior is noted in the man page, but I probably figured it'd resolve the file path and then save some kind of inode / file handle internally.

invlpg | 3 hours ago

It's not even a particularly good API, as there's no sensible way to handle read/write failures other than a signal handler.

kornel | 5 hours ago

or the worst API: https://db.cs.cmu.edu/mmap-cidr2022/

Your I/O error handling is now SIGBUS. You lose control over when I/O happens, and your performance and latency is at the mercy of complex memory swap machinery and TLB management.

Yeah, mmap needs to be used with care. previously

The counterargument is that it is possible to mitigate the downsides of mmap without the complexity described by Crotty/Leis/Pavlo: previously, previously. In short, use mmap for the read path and page cache, and use write(), fsync(), etc. for the write path.

ThinkChaos | 3 hours ago

The true MVP for file apis is preadv and pwritev: Vectored access to files with a base position, and an array of sources/targets

cblake | 2 hours ago

It's not like it's "built in" to the stdlib (but then C's stdlib is notoriously slim), but you can easily roll together a FileArray abstraction in Nim. As others have pointed out several times in several ways in this very thread, the safest route is to write via regular writes and then read via the mmap. So, it depends how "read heavy" vs. "write heavy" your workload is. Presumably there are Apache Arrow wrappers in many PLangs (which it seems does not let you do a whole struct, but only the parallel scalars route?). So, maybe the author is just under-exposed?