The Virtual Void A blog by Johannes Rudolph

Experiments with io_uring

This is a quick writeup of my experiments with io_uring.

What is io_uring?

io_uring is a generic, consistent API that allows to (batch) submit Linux kernel IO operations in an asynchronous way ultimately requiring few or zero actual syscalls. It uses a pair of buffer rings for submission and completion data shared between user and kernel space (but that’s just an implementation detail).

To deconstruct the above sentence:

  • generic: io_uring is generic because it supports a growing list of IO-related syscalls
  • consistent: The submission and completion mechanism is the same for different kinds of APIs. Submission entries must by default fit into the 64-byte submission entry, completions into a 32-byte completion entry. For some operations the size of a single entry must be doubled which can be configured during creation of the ring.
  • batch submit: Any number of calls (up to the configured size of the submission ring) can be submitted in one go.
  • asynchronous: The API itself is asynchronous in that submission of an operation will not always lead to immediate completion. Instead, some operations may only be completed later. By default, submission are directly executed inline (in the same thread) during submission if possible. Otherwise, operations may only be completed later.
  • few or zero syscalls: Submissions and completions are serialized into shared memory and can (in theory) be accessed by the kernel and the user application concurrently. The default behavior requires a single io_uring_enter call to make the kernel aware of the latest submissions. An optional SQPOLL mode allows the kernel to run a dedicated thread to poll submission from the application in which ultimately no syscalls are required any more (though, then kernel and application need to (busy-) poll the queues).
  • buffer rings: Application and kernel communicate using shared memory and a submission/completion protocol. In theory, this can be faster than syscall based IO submission.

Why io_uring?

syscalls can become a bottleneck in applications that require lots of IO. Batch submission and the potential polling mode means you can trade off different latency and throughput requirements.

Until io_uring came along, Linux did not have a proper asynchronous interface for doing buffered 1 file IO. In general, io_uring can hide an internally blocking operation inside its own blocking thread pool just providing an asynchronous interface to an actually blocking operation (similar as e.g. Java NIO asynchronous file operations were implemented). Adding more (internal or not) thread pools somewhat defeat the purpose of asynchronous execution (which should be as light-weight as possible). So, later an asynchronous buffered read mode was proposed and added to the kernel (and io_uring) for file-systems that support it.

Registering buffers and file descriptors

Using user-space buffers and file descriptors in the kernel comes with some overhead. Therefore, it is possible to preregister buffers and descriptors using the io_uring_register syscall.

In many server applications that mutiplex handling of multiple connections with a single thread, the connection is first polled to be ready for reading, then a buffer is allocated, and finally the data is read. With io_uring it is possible to avoid the polling step completely and read data asynchronously directly. However, that means that buffers need to be provided upfront for every waiting read call (which in a web server would be all idle connections). To avoid a per-read (per-connection) read buffer for every outstanding read, with io_uring buffers can also be provided more dynamically using a special operation, IORING_OP_PROVIDE_BUFFERS.

You can then use IOSQE_BUFFER_SELECT for submissions to have io_uring select a buffer from the given set when needed.

A downside of this method of providing buffers is that the kernel may choose buffers arbitrarily. After use, the application has to provide them again. While the original IORING_OP_PROVIDE_BUFFERS call can use a contiguous amount of memory to provide multiple buffers at once, spent buffers will fragment that memory over time. The application will either have to

  • provide them back to io_uring in a one-by-one fashion after usage (spending one SQE per buffer after usage), or
  • keep track of which buffers have been redeemed, try to coalesce their memory regions and provide them later with fewer SQEs

In later kernels, it is possible to use IORING_REGISTER_PBUF_RING to create yet another ring just for buffer management, so that providing buffers can happen more efficiently without spending SQEs just for providing single small buffers.

Execution Modes

As shown in the Missing Manuals - io_uring worker pool, an IO operation can go through three different paths when submitted:

  • Be executed directly inline during the execution of io_uring_enter
  • Is not ready but can be polled for readiness. In that case, it is automatically polled. When the operation is ready, it is retried and completed.
  • If it only can be executed in a blocking fashion, it is enqueued to a run queue to the blocking kernel worker thread pool and executed and completed when done.

Inline execution

Simple networking applications (like an echo server or a simple HTTP server) will spent a significant portion of CPU time in the kernel doing TCP work. Whether you use regular syscalls or io_uring, by default much of the TCP processing time will be spent on a user thread (especially on the write side).2

The test application is a simple HTTP server which fakes request processing:

  • Has a simple (single-threaded) event processing loop which blocks on completion events from io_uring (setting the min_complete argument of the io_uring_enter syscall to 1).
  • Accepts any amount of incoming connections.
  • Tries to read an incoming HTTP request (but does not validate it at all) and sends out a response (it will never have to wait on network writes).

Here is a flame graph of simple HTTP server processing using io_uring (click for interactive version):

Flamegraph of inline execution of socket read/write

The blueish and green frames on the left side are the application. Since this application does not do any actual processing, it is only a very small share of total processing time.

The big red humps are frames executing in the kernel. All of those frames are on top of the io_uring_enter call, i.e. executing while doing the submission call. The biggest share of that call is the big hump in the middle which is the actual TCP sendmsg implementation. You will see the same frames on a blocking server as well.

The small part on the left seems to be the overhead of actual submitting events using io_uring.

The small hump on the right is part of the TCP recvmsg implementation.

Async submission

There’s an alternative mode that can be selected using the IOSQE_ASYNC flag per SQE where the task is not executed inline at all but queued for io_uring’s internal worker queue.

One problem when used together with IOSQE_BUFFER_SELECT shows up in the flamegraph (click through for interactive flame graph):

Flamegraph of buffer selection contention with async flag

io_uring will spawn a massive amount of worker threads and all these threads contend on acquiring a pre-provided buffer which will slow everything down.

I filed an issue about that and it was confirmed that using this usage pattern in particular but more generally IOSQE_ASYNC is not recommended. In any case, you should probably stay away from using IOSQE_BUFFER_SELECT when you expect the requests to end up on the internal worker pool.

Inline vs. Async

It may be surprising that using an asynchronous interface may not automatically lead to asynchronous execution or more parallelism. For application servers that are single-threaded, there is the option to force spreading the internal kernel load to the internal worker queue (if you have cores to spare) by using IOSQE_ASYNC. However, as seen, this comes with extra overhead and potential extra scalability issues.

For CPU-bound workloads the inline execution behavior seems useful as it will provide the fastest execution path. For low-latency applications the time spent in the kernel might be an unwanted interruption but for lowest-latency you will anyway need spare CPU resources to avoid any extra delays. In that case, distributing work over multiple worker threads each owning a separate ring (and maybe polling) should give the best solution and best control over how resources are spent.

Usage from the JVM

To support io_uring from the JVM, only a few native calls need to be implemented:

  • Access to the io_uring_setup, io_uring_enter, and io_uring_register syscalls (e.g. by supporting all syscalls by linking to libc’s syscall symbol)
  • Memory mapping the io_uring data structures into memory
  • A way to manipulate the native (non-JVM-heap) memory of the rings and IO buffers.

The JVM’s standard way of interfacing with the native world is using JNI. We only need to provide simple wrappers for three syscalls with simple interfaces so this is simple to do. Alternatively, using a simple JNA declaration like this also works:

import com.sun.jna.Library;
import com.sun.jna.Pointer;
import com.sun.jna.Structure;

public interface LibC extends Library {
  int SYSCALL_IO_URING_SETUP = 425;
  int SYSCALL_IO_URING_ENTER = 426;
  int SYSCALL_IO_URING_REGISTER = 427;
  int syscall(int number, Object... args);
  Pointer mmap(Pointer addr, long length, int prot, int flags, int fd, long offset);
}

Using JNA will introduce some overhead over JNI, but it seems relatively fine compared to all the other work.

After mapping a few of io_uring’s internal data structures, creating a uring can look like this:

val params = new IoUringParams // not in the fast-path, use JNA mapped data structured
val uringFd = libC.syscall(LibC.SYSCALL_IO_URING_SETUP, sizeOfRing, params)

val length = params.sq_off.array + params.sq_entries * 4

val sqRingPointer = libC.mmap(new Pointer(0), length, LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_SQ_RING)
val cqRingPointer = libC.mmap(new Pointer(0), params.cq_off.cqes + params.cq_entries * 16 /* sizeOf(IoUringCqe) */ , LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_CQ_RING)
val sqePointer = libC.mmap(new Pointer(0), params.sq_entries * 64 /* sizeOf(IoUringSqe) */ , LibC.PROT_READ | LibC.PROT_WRITE, LibC.MAP_SHARED | LibC.MAP_POPULATE, uringFd, LibC.IORING_OFF_SQES)

// FIXME: check all result values

// convert JNA pointers to DirectByteBuffer
val sqRingPointerByteBuffer = sqRingPointer.getByteBuffer(0, length)
val cqRingPointerByteBuffer = cqRingPointer.getByteBuffer(0, params.cq_off.cqes + params.cq_entries * 16)
val sqes = sqePointer.getByteBuffer(0, 64 * params.sq_entries)

From here on, all datastructures are manipulated using the DirectByteBuffers whose operations (hopefully) are intrinsified and therefore cheap. JNA’s getByteBuffer is expensive, though, so make sure to create buffers spanning the whole interesting region and reuse them later. For example, to access some parameters, you can use code like this:

def sqHead() = sqRingPointerByteBuffer.getInt(params.sq_off.head)
def sqTail() = sqRingPointerByteBuffer.getInt(params.sq_off.tail)
val sqMask = sqRingPointerByteBuffer.getInt(params.sq_off.ring_mask)

def cqHead() = cqRingPointerByteBuffer.getInt(params.cq_off.head)
def cqTail() = cqRingPointerByteBuffer.getInt(params.cq_off.tail)
val cqMask = cqRingPointerByteBuffer.getInt(params.cq_off.ring_mask)

Here is one way of representing an operation with a case class (which is not quite optimal because of the extra allocation):

case class ReadOp(
    flags:             Byte,
    fd:                Int,
    destinationOffset: Long,
    targetBufferAddr:  Long,
    targetBufferSize:  Int,
    userData:          Long,
    bufferGroup:       Short = 0 // FIXME: buffer selection should be abstracted
) extends Op(IORING_OP_READ) {
  override def prepareSQE(buffer: ByteBuffer): Unit = {
    buffer.put(opId)
    buffer.put(flags)
    buffer.putShort(0 /* ioprio */ )
    buffer.putInt(fd)
    buffer.putLong(destinationOffset)
    buffer.putLong(targetBufferAddr)
    buffer.putInt(targetBufferSize)
    buffer.putInt(0)
    buffer.putLong(userData)
    buffer.putShort(bufferGroup)
    buffer.putShort(0)
    buffer.putInt(0)
    buffer.putLong(0)
    buffer.putLong(0)
  }
}

An operation can be submitted like this:

def submitGlobal(op: Op): Unit = {
  // FIXME: check against overflow
  val curTail = sqTail()
  val index = curTail & sqMask
  sqes.clear()
  sqes.position(64 * index)
  op.prepareSQE(sqes)
  sqRingPointerByteBuffer.putInt(params.sq_off.array + 4 * index, index)
  // FIXME: write barrier needed?
  sqRingPointerByteBuffer.putInt(params.sq_off.tail, curTail + 1)
  // FIXME: write barrier needed?
}

That is, we first figure out the tail of the ring to write to, then prepare an operation using the above method, put the index into the SQE array, and finally increase the tail index. (I’m not sure if the barriers are needed in a simple application, but if multiple threads access the ring, you definitely should add the barriers, e.g. using Unsafe).

Finally, submitting all queued SQEs to the kernel is done by invoking io_uring_enter (the only necessary syscall in the hot path):

val toSubmit = sqTail() - sqHead()
libC.syscall(LibC.SYSCALL_IO_URING_ENTER, uringFd, toSubmit, 1, IORING_ENTER_GETEVENTS, 0)

After the call returns, you can check for completions:

var cqIdx = cqHead()
val last = cqTail()

while (cqIdx < last) {
  val idx = cqIdx & cqMask
  // accessing cqe data via JNA is more expensive but also possible:
  // val cqe = new IoUringCqe(cqRingPointer.share(params.cq_off.cqes + idx * 16))
  // debug(cqe.toString)
  val off = params.cq_off.cqes + idx * 16
  val userData = cqRingPointerByteBuffer.getLong(off)
  val result = cqRingPointerByteBuffer.getInt(off + 8)
  val flags = cqRingPointerByteBuffer.getInt(off + 12)
  handleCompletion(userData, result, flags)
  cqIdx += 1
}
// give back all handled entries (FIXME: memory barriers)
cqRingPointerByteBuffer.putInt(params.cq_off.head, last)

Here we loop through all new CQEs, extract their data (again using the previously created DirectByteBuffers) and defer handling to a completion method.

A final problem is creating or accessing file descriptors from the JVM. It is possible to extract file descriptors from Java RandomAccessFile and Socket using Java reflection from their internals.

In newer kernel versions, more syscalls are available and it might be possible to open files and sockets using io_uring.

The full source code can be found at https://github.com/jrudolph/scala-io-uring. It contains more of infrastructure to provide a more idiomatic API, however, everything should be considered experimental (checking return values, anyone?).

Much more

There’s, of course, a lot more to io_uring not covered here:

  • More flags like IOSQE_IO_LINK or IOSQE_IO_DRAIN
  • Full set of supported ops
  • Cancelling requests

References

  1. Buffered means going through the page cache. I.e. when calling read the file is mapped into memory and then copied into user buffers. 

  2. If you ever wondered where the 5 times difference between the “Json” and the “Plaintext” tests in the Techempower Benchmark comes from, that’s the reason. The json example uses regular HTTP requests where only a single request can be ongoing per TCP connection. In that case you always require at least one TCP/IP packet per request and response. The plaintext benchmark uses HTTP pipelining, which ultimately allows to transmit multiple HTTP requests and responses in a single TCP/IP packet which decreases the in-kernel CPU overhead of TCP massively.