io_uring, libaio performance across Linux kernels and an unexpected IOMMU trap

65 points by tanelpoder a day ago on hackernews | 16 comments

In Linux, there are two interfaces for efficient asynchronous disk I/O: traditional AIO (libaio) and the newer io_uring (liburing). It is well known that io_uring outperforms AIO. A number of papers demonstrate this, and kernel developers often mention that io_uring performance improved significantly after the initial releases. However, it is surprisingly hard to find concrete numbers showing how this performance evolved across kernel versions. Most benchmarks compare APIs on a single kernel, leaving the historical picture unclear.

At YDB we continuously look for ways to improve database performance. Our production servers typically run Linux 5.4, 5.15, or 6.6, and we are moving towards 6.12, which made us curious: how much does the kernel version itself affect async I/O performance? So we decided to measure it ourselves using fio. Here are the results for random 4K writes:

Press enter or click to view image in full size

From the figure above, the key findings are:

io_uring can be 2x faster than libaio
the most performant io_uring configuration is 1.4x faster on newer kernels compared to older kernels
there is an unexpected performance degradation for libaio and non-SQPOLL io_uring between kernels 5.4 and 5.15

io_uring not only beats libaio, but its performance also improves noticeably on newer kernels. However, there are some subtle pitfalls. Along the way, we investigated what looked like a kernel regression and found that the real cause was Intel IOMMU being enabled by default between releases. Before discovering this, we were alarmed by an approximately 30% drop in IOPS in both libaio and io_uring on newer kernels.

This is the short version of the story. The details are below.

Setup

We conducted our experiments on a bare metal machine with the following configuration:

Two Intel Xeon Gold 6338 processors (32 cores each, hyper-threading enabled, 128 logical cores total)
512 GiB RAM
NVMe Intel P4610 (SSDPE2KE032T8O) 3.2TB disk (4K LBA format)
Ubuntu 20.04.3 LTS with Linux kernel version 5.4.161c
Ubuntu 22.04.5 LTS with Linux kernel 5.15.0–164, 6.6.80–060680, 6.18.18–061818 and 7.0-rc3
intel_iommu=off kernel boot parameter (as used in the final setup presented in Results section)
NVMe driver configured with options nvme poll_queues=16 (again, in a final setup)
fio-3.41

Linux kernel 5.4.161c is our custom in-house kernel. It includes distribution patches, but we do not expect any of them to affect asynchronous I/O performance. Kernel 5.15.0–164 is from Ubuntu. The rest are Ubuntu mainline kernels.

The NVMe device is attached to NUMA node 0. All benchmarks were executed in a cgroup pinned to CPU cores 0–16 on NUMA node 0, with memory allocation bound to the same NUMA node. CPU governor was set to performance.

To compare libaio and io_uring, we used a script based on the fio benchmarking tool. The script runs a series of fio commands while varying iodepth and switching between different I/O engines. We measure raw block device performance.

An example of the base fio command used in the experiments is shown below:

sudo fio --name=write_latency_test --filename=<DEVICE> --filesize=2500G  \
  --time_based --ramp_time=10s --runtime=1m                              \
  --rw=randwrite                                                         \
  --clocksource=cpu                                                      \
  --direct=1 --verify=0 --randrepeat=0 --randseed=17                     \
  --iodepth=16  --iodepth_batch_submit=1 --iodepth_batch_complete_max=1  \
  --bs=4K                                                                \
  --lat_percentiles=1 --percentile_list=10:50:90:95:99:99.9              \
  --output-format=json --output=<DIR/FNAME.json>                         \
  --ioengine=<ENGINE> <ENGINE ARGS>

The following engine and engine args are used:

--ioengine=libaio
--ioengine=io_uring
--ioengine=io_uring --hipri=1 (IOPOLL)
--ioengine=io_uring --sqthread_poll (SQPOLL)
--ioengine=io_uring --sqthread_poll --hipri=1 (IOPOLL + SQPOLL)

Additional io_uring optimizations (e.g., registered buffers) are intentionally out of scope; we compare only equivalent configurations across kernels.

The goal of the benchmark is to compare I/O mechanisms, so before each run we refresh the device state. This is done using blkdiscard, which resets the NVMe device and brings it back to its best possible performance state.

The workload uses random 4K writes, which stress the I/O submission path in the kernel and are commonly used to evaluate asynchronous I/O performance.

For each engine + engine parameters + iodepth combination:

the test is executed 10 times
runs are randomized (with a fixed seed) to reduce ordering effects
we report the run with median IOPS

To avoid thermal effects and device-side throttling:

each run is followed by a 10 second cooldown
every hour the script performs a 5 minute cooldown

Each run starts with a 10 second ramp-up followed by a 1 minute measurement window. This duration is typically sufficient to compare I/O engines while avoiding NVMe performance drops caused by internal garbage collection or background maintenance.

For each run we collect both IOPS and latency statistics (including percentiles). Maximum IOPS is only part of the picture: it is also important to examine latency at low I/O depths.

Get Evgeniy Ivanov’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

All benchmarks were executed on the same hardware, with the same fio version and configuration, unless stated otherwise, changing only the kernel version and I/O engine.

Results

First, let’s look again at the maximum IOPS achieved across different kernels and I/O mechanisms.

Press enter or click to view image in full size

Several observations stand out:

io_uring improves significantly on newer kernels. Now we can also quantify the difference: the fastest io_uring configuration is about 1.4x faster on newer kernels compared to older kernels.
Even the non-polled, non-batched io_uring mode outperforms libaio on all tested kernels.
There is a noticeable regression between kernels 5.4 and 5.15 for both libaio and non-polled io_uring. This suggests that improvements in polling may partially mask this regression.

To complete the picture, let’s focus on the io_uring vs. libaio comparison on Linux 6.18.18, the latest stable kernel we use. First, here is the maximum IOPS figure with only the 6.18.18 results shown:

Press enter or click to view image in full size

Next, let’s examine how IOPS scale with increasing iodepth. The figures below show the minimum and maximum values as whiskers, with the median shown as a point.

Press enter or click to view image in full size

Below is a zoomed view for the iodepth range 1–8:

Press enter or click to view image in full size

io_uring consistently outperforms libaio across the entire iodepth range.

Finally, let’s examine latency behavior. The figure below shows latency percentiles as iodepth increases:

Press enter or click to view image in full size

Below is a zoomed view for the iodepth range 1–16:

Press enter or click to view image in full size

And here are latency distributions for queue depths 1, 4, and 16:

Press enter or click to view image in full size

Again, io_uring consistently delivers lower latency than libaio across the entire range.

A Database Developer’s Unexpected Journey into the Linux Kernel (and IOMMU)

Press enter or click to view image in full size

As we mentioned earlier, our investigation was not as straightforward as it might seem. Let us share a bit of that story with a happy ending.

We started with Linux kernels 5.4, 5.15, and 6.6. For completeness, we also tested 7.0-rc3. That’s where we first noticed a 30% IOPS drop in both libaio and io_uring. At first, this did not seem too surprising: after all, this was an rc kernel. There was also a known regression in 7.0-rc2 that could have carried over into rc3. To verify this, we moved back to 6.18.18 and saw the same significant performance drop.

At that point, our results looked like this:

Press enter or click to view image in full size

We observed a minor degradation between 5.4 and 5.15 (libaio and non-polled io_uring only) and a more severe drop somewhere between 6.6.15 and 6.6.20.

The differences between mainline 6.6.15 and 6.6.20 are minimal: small configuration changes and a slightly different compiler version. Unfortunately, we initially underestimated the impact of configuration changes and instead tried rebuilding 6.6.20 with an older GCC.

As a final, somewhat desperate step, we disabled intel_iommu — and that turned out to be the root cause of the degradation. Further investigation showed that CONFIG_INTEL_IOMMU_DEFAULT_ON had been enabled in Ubuntu starting from certain 5.15.x kernels. In mainline Ubuntu kernels, it appears to have been toggled: disabled at some point and then re-enabled again between 6.6.15 and 6.6.20.

An IOMMU (Input–Output Memory Management Unit) translates device memory accesses and enforces isolation, improving security but sometimes introducing additional overhead in I/O-intensive workloads. It is needed to protect the system by preventing devices from accessing arbitrary memory, which is especially important for virtualization and untrusted peripherals.

In our experience, most database management systems run in trusted environments where IOMMU is not strictly required. If PCI passthrough or device isolation is needed and disabling IOMMU is not an option, the iommu=pt mode can be used. In this configuration, address translation is effectively bypassed for most devices, reducing overhead while still keeping IOMMU enabled for isolation where necessary.

We did not find any recent, systematic evaluations of IOMMU overhead in modern setups. However, there are a couple of notable observations. Here the authors report that IOMMU became “a kernel-level bottleneck” in a Ceph cluster, and disabling it resulted in “a substantial performance boost”.

Another example is the well-known VLDB paper “What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines”. To fully utilize multiple NVMe devices, the authors disabled IOMMU as part of their setup.

One upside of running into this issue ourselves is that we were able to measure its impact directly. Below are results comparing IOMMU enabled vs. disabled on Linux 6.6.80:

Press enter or click to view image in full size

From a security perspective, enabling IOMMU by default is a reasonable decision. But for database developers who usually run their DBMS in trusted environments, it can look very much like a regression. Blaming the OS? We’re in good company :)

Press enter or click to view image in full size

Jokes aside, kudos to the kernel developers — the progress in asynchronous I/O, especially io_uring, is impressive.

IOPOLL caveats

There is one more interesting caveat we would like to highlight. Starting with Linux 6.8, fio configured to use io_uring with IOPOLL (--hipri) began failing with the following error:

fio: io_u error on file /dev/nvme0n1p2: Operation not supported: write offset=62033248256, buflen=4096
fio command failed for mode=uring-iopoll iodepth=8 rw=write run_index=4

It turned out that a specific NVMe driver configuration is required to use IOPOLL. The steps are:

echo 'options nvme poll_queues=16' | sudo tee /etc/modprobe.d/nvme-poll.conf
sudo update-initramfs -u -k all
#sudo update-initramfs -u -k 6.6.80-060680-generic
#sudo update-initramfs -u -k `uname -r`
echo 1 | sudo tee /sys/block/nvme0n1/queue/io_poll
sudo shutdown -r now

Before discovering this, we had already run a number of IOPOLL experiments on earlier kernels. In practice, we observed only a small difference between “regular” io_uring and IOPOLL, and only modest gains when combining SQPOLL with IOPOLL.

As explained here by Jens Axboe: “If you don’t have any poll queues, preadv2 with IOCB_HIPRI will be IRQ based, not polled. io_uring just tells you this up front with -EOPNOTSUPP”. This is something to keep in mind when working with IOPOLL on older kernels or default configurations. Because of that, we had to rerun the benchmarks on all older kernels.

Conclusion

In this post, we compared the performance of libaio and io_uring across several Linux kernel versions using random 4K write workloads. The experiments lead to three main conclusions.

First, io_uring performance improves significantly on newer kernels. The fastest io_uring configuration on Linux 6.6 is about 1.4x faster than on older kernels. For systems running older kernels, upgrading the kernel alone may yield substantial storage performance improvements, even without application changes.

Second, io_uring consistently outperforms libaio. Even the non-polled, non-batched io_uring delivers better performance than libaio across all tested kernels, both in terms of IOPS and latency. In the best configurations, io_uring achieves up to 2x higher IOPS than libaio. In our experiments, io_uring already pushes the tested NVMe device close to its limits. On faster storage hardware, its advantage over libaio could become even more pronounced.

Third, we observed a performance regression between kernels 5.4 and 5.15 for both libaio and non-polled io_uring. Since the effect appears across multiple I/O interfaces, it likely originates in the block layer or NVMe driver rather than in the I/O APIs themselves. Improvements in polled io_uring modes may partially mask this regression for these modes.

At the same time, configuration matters. In particular, IOMMU can significantly affect I/O performance and should be explicitly considered when comparing results across kernel versions. In our case, Intel IOMMU was turned on by default between kernel releases, leading to a severe performance regression up to a 30% drop in IOPS.

Finally, in addition to fio, we validated these findings using one of the YDB components and observed very similar results, suggesting that the conclusions are applicable to real-world workloads. These insights are also relevant to other database systems that rely on modern Linux I/O stacks, including systems such as PostgreSQL, where io_uring support is actively evolving.