io_uring_setup(2)

SECCIÓN: 2 - Llamadas al sistema

io_uring_setup(2) Linux Programmer’s Manual io_uring_setup(2)

NAME

io_uring_setup - setup a context for performing asynchronous I/O

SYNOPSIS

#include <liburing.h>

int io_uring_setup(u32 entries, struct io_uring_params *p);

DESCRIPTION

The io_uring_setup(2) system call sets up a submission queue (SQ) and

completion queue (CQ) with at least entries entries, and returns a file

descriptor which can be used to perform subsequent operations on the

io_uring instance. The submission and completion queues are shared be‐

tween userspace and the kernel, which eliminates the need to copy data

when initiating and completing I/O.

params is used by the application to pass options to the kernel, and by

the kernel to convey information about the ring buffers.

struct io_uring_params {

__u32 sq_entries;

__u32 cq_entries;

__u32 flags;

__u32 sq_thread_cpu;

__u32 sq_thread_idle;

__u32 features;

__u32 wq_fd;

__u32 resv[3];

struct io_sqring_offsets sq_off;

struct io_cqring_offsets cq_off;

};

The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐

ure the io_uring instance. flags is a bit mask of 0 or more of the fol‐

lowing values ORed together:

IORING_SETUP_IOPOLL

Perform busy‐waiting for an I/O completion, as opposed to getting

notifications via an asynchronous IRQ (Interrupt Request). The

file system (if any) and block device must support polling in or‐

der for this to work. Busy‐waiting provides lower latency, but

may consume more CPU resources than interrupt driven I/O. Cur‐

rently, this feature is usable only on a file descriptor opened

using the O_DIRECT flag. When a read or write is submitted to a

polled context, the application must poll for completions on the

CQ ring by calling io_uring_enter(2). It is illegal to mix and

match polled and non‐polled I/O on an io_uring instance.

This is only applicable for storage devices for now, and the

storage device must be configured for polling. How to do that de‐

pends on the device type in question. For NVMe devices, the nvme

driver must be loaded with the poll_queues parameter set to the

desired number of polling queues. The polling queues will be

shared appropriately between the CPUs in the system, if the num‐

ber is less than the number of online CPU threads.

IORING_SETUP_SQPOLL

When this flag is specified, a kernel thread is created to per‐

form submission queue polling. An io_uring instance configured

in this way enables an application to issue I/O without ever con‐

text switching into the kernel. By using the submission queue to

fill in new submission queue entries and watching for completions

on the completion queue, the application can submit and reap I/Os

without doing a single system call.

If the kernel thread is idle for more than sq_thread_idle mil‐

liseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the flags

field of the struct io_sq_ring. When this happens, the applica‐

tion must call io_uring_enter(2) to wake the kernel thread. If

I/O is kept busy, the kernel thread will never sleep. An appli‐

cation making use of this feature will need to guard the io_ur‐

ing_enter(2) call with the following code sequence:

/*

* Ensure that the wakeup flag is read after the tail pointer

* has been written. It’s important to use memory load acquire

* semantics for the flags read, as otherwise the application

* and the kernel might not agree on the consistency of the

* wakeup flag.

*/

unsigned flags = atomic_load_relaxed(sq_ring‐>flags);

if (flags & IORING_SQ_NEED_WAKEUP)

io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

where sq_ring is a submission queue ring setup using the struct

io_sqring_offsets described below.

Note that, when using a ring setup with

IORING_SETUP_SQPOLL, you never directly call the io_uring_en‐

ter(2) system call. That is usually taken care of by liburing’s

io_uring_submit(3) function. It automatically determines if you

are using polling mode or not and deals with when your program

needs to call io_uring_enter(2) without you having to bother

about it.

Before version 5.11 of the Linux kernel, to successfully use this fea‐

ture, the

application must register a set of files to be used for IO

through io_uring_register(2) using the IORING_REGISTER_FILES op‐

code. Failure to do so will result in submitted IO being errored

with EBADF. The presence of this feature can be detected by the

IORING_FEAT_SQPOLL_NONFIXED feature flag. In version 5.11 and

later, it is no longer necessary to register files to use this

feature. 5.11 also allows using this as non‐root, if the user has

the CAP_SYS_NICE capability. In 5.13 this requirement was also

relaxed, and no special privileges are needed for SQPOLL in newer

kernels. Certain stable kernels older than 5.13 may also support

unprivileged SQPOLL.

IORING_SETUP_SQ_AFF

If this flag is specified, then the poll thread will be bound to

the cpu set in the sq_thread_cpu field of the struct io_ur‐

ing_params. This flag is only meaningful when IOR‐

ING_SETUP_SQPOLL is specified. When cgroup setting cpuset.cpus

changes (typically in container environment), the bounded cpu set

may be changed as well.

IORING_SETUP_CQSIZE

Create the completion queue with struct io_uring_params.cq_en‐

tries entries. The value must be greater than entries, and may

be rounded up to the next power‐of‐two.

IORING_SETUP_CLAMP

If this flag is specified, and if entries exceeds IORING_MAX_EN‐

TRIES, then entries will be clamped at IORING_MAX_ENTRIES. If

the flag IORING_SETUP_CQSIZE is set, and if the value of struct

io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES, then it

will be clamped at IORING_MAX_CQ_ENTRIES.

IORING_SETUP_ATTACH_WQ

This flag should be set in conjunction with struct io_ur‐

ing_params.wq_fd being set to an existing io_uring ring file de‐

scriptor. When set, the io_uring instance being created will

share the asynchronous worker thread backend of the specified

io_uring ring, rather than create a new separate thread pool.

IORING_SETUP_R_DISABLED

If this flag is specified, the io_uring ring starts in a disabled

state. In this state, restrictions can be registered, but sub‐

missions are not allowed. See io_uring_register(2) for details

on how to enable the ring. Available since 5.10.

IORING_SETUP_SUBMIT_ALL

Normally io_uring stops submitting a batch of requests, if one of

these requests results in an error. This can cause submission of

less than what is expected, if a request ends in error while be‐

ing submitted. If the ring is created with this flag, io_ur‐

ing_enter(2) will continue submitting requests even if it encoun‐

ters an error submitting a request. CQEs are still posted for er‐

rored request regardless of whether or not this flag is set at

ring creation time, the only difference is if the submit sequence

is halted or continued when an error is observed. Available since

5.18.

IORING_SETUP_COOP_TASKRUN

By default, io_uring will interrupt a task running in userspace

when a completion event comes in. This is to ensure that comple‐

tions run in a timely manner. For a lot of use cases, this is

overkill and can cause reduced performance from both the inter‐

processor interrupt used to do this, the kernel/user transition,

the needless interruption of the tasks userspace activities, and

reduced batching if completions come in at a rapid rate. Most ap‐

plications don’t need the forceful interruption, as the events

are processed at any kernel/user transition. The exception are

setups where the application uses multiple threads operating on

the same ring, where the application waiting on completions isn’t

the one that submitted them. For most other use cases, setting

this flag will improve performance. Available since 5.19.

IORING_SETUP_TASKRUN_FLAG

Used in conjunction with IORING_SETUP_COOP_TASKRUN, this provides

a flag, IORING_SQ_TASKRUN, which is set in the SQ ring flags

whenever completions are pending that should be processed. libur‐

ing will check for this flag even when doing io_uring_peek_cqe(3)

and enter the kernel to process them, and applications can do the

same. This makes IORING_SETUP_TASKRUN_FLAG safe to use even when

applications rely on a peek style operation on the CQ ring to see

if anything might be pending to reap. Available since 5.19.

IORING_SETUP_SQE128

If set, io_uring will use 128‐byte SQEs rather than the normal

64‐byte sized variant. This is a requirement for using certain

request types, as of 5.19 only the IORING_OP_URING_CMD

passthrough command for NVMe passthrough needs this. Available

since 5.19.

IORING_SETUP_CQE32

If set, io_uring will use 32‐byte CQEs rather than the normal

16‐byte sized variant. This is a requirement for using certain

request types, as of 5.19 only the IORING_OP_URING_CMD

passthrough command for NVMe passthrough needs this. Available

since 5.19.

IORING_SETUP_SINGLE_ISSUER

A hint to the kernel that only a single task (or thread) will

submit requests, which is used for internal optimisations. The

submission task is either the task that created the ring, or if

IORING_SETUP_R_DISABLED is specified then it is the task that en‐

ables the ring through io_uring_register(2). The kernel enforces

this rule, failing requests with ‐EEXIST if the restriction is

violated. Note that when IORING_SETUP_SQPOLL is set it is con‐

sidered that the polling task is doing all submissions on behalf

of the userspace and so it always complies with the rule disre‐

garding how many userspace tasks do io_uring_enter(2). Available

since 6.0.

IORING_SETUP_DEFER_TASKRUN

By default, io_uring will process all outstanding work at the end

of any system call or thread interrupt. This can delay the appli‐

cation from making other progress. Setting this flag will hint

to io_uring that it should defer work until an io_uring_enter(2)

call with the IORING_ENTER_GETEVENTS flag set. This allows the

application to request work to run just before it wants to

process completions. This flag requires the IORING_SETUP_SIN‐

GLE_ISSUER flag to be set, and also enforces that the call to

io_uring_enter(2) is called from the same thread that submitted

requests. Note that if this flag is set then it is the applica‐

tion’s responsibility to periodically trigger work (for example

via any of the CQE waiting functions) or else completions may not

be delivered. Available since 6.1.

IORING_SETUP_NO_MMAP

By default, io_uring allocates kernel memory that callers must

subsequently mmap(2). If this flag is set, io_uring instead uses

caller‐allocated buffers; p‐>cq_off.user_addr must point to the

memory for the sq/cq rings, and p‐>sq_off.user_addr must point to

the memory for the sqes. Each allocation must be contiguous mem‐

ory. Typically, callers should allocate this memory by using

mmap(2) to allocate a huge page. If this flag is set, a subse‐

quent attempt to mmap(2) the io_uring file descriptor will fail.

Available since 6.5.

IORING_SETUP_REGISTERED_FD_ONLY

If this flag is set, io_uring will register the ring file de‐

scriptor, and return the registered descriptor index, without

ever allocating an unregistered file descriptor. The caller will

need to use IORING_REGISTER_USE_REGISTERED_RING when calling

io_uring_register(2). This flag only makes sense when used

alongside with IORING_SETUP_NO_MMAP, which also needs to be set.

Available since 6.5.

IORING_SETUP_NO_SQARRAY

If this flag is set, entries in the submission queue will be sub‐

mitted in order, wrapping around to the first entry after reach‐

ing the end of the queue. In other words, there will be no more

indirection via the array of submission entries, and the queue

will be indexed directly by the submission queue tail and the

range of indexed represented by it modulo queue size. Subse‐

quently, the user should not map the array of submission queue

entries, and the corresponding offset in struct io_sqring_offsets

will be set to zero. Available since 6.6.

If no flags are specified, the io_uring instance is setup for interrupt

driven I/O. I/O may be submitted using io_uring_enter(2) and can be

reaped by polling the completion queue.

The resv array must be initialized to zero.

features is filled in by the kernel, which specifies various features

supported by current kernel version.

IORING_FEAT_SINGLE_MMAP

If this flag is set, the two SQ and CQ rings can be mapped with a

single mmap(2) call. The SQEs must still be allocated separately.

This brings the necessary mmap(2) calls down from three to two.

Available since kernel 5.4.

IORING_FEAT_NODROP

If this flag is set, io_uring supports almost never dropping com‐

pletion events. A dropped event can only occur if the kernel

runs out of memory, in which case you have worse problems than a

lost event. Your application and others will likely get OOM

killed anyway. If a completion event occurs and the CQ ring is

full, the kernel stores the event internally until such a time

that the CQ ring has room for more entries. In earlier kernels,

if this overflow condition is entered, attempting to submit more

IO would fail with the ‐EBUSY error value, if it can’t flush the

overflown events to the CQ ring. If this happens, the application

must reap events from the CQ ring and attempt the submit again.

If the kernel has no free memory to store the event internally it

will be visible by an increase in the overflow value on the

cqring. Available since kernel 5.5. Additionally io_uring_en‐

ter(2) will return ‐EBADR the next time it would otherwise sleep

waiting for completions (since kernel 5.19).

IORING_FEAT_SUBMIT_STABLE

If this flag is set, applications can be certain that any data

for async offload has been consumed when the kernel has consumed

the SQE. Available since kernel 5.5.

IORING_FEAT_RW_CUR_POS

If this flag is set, applications can specify offset == ‐1 with

IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and

IORING_OP_{READ,WRITE} to mean current file position, which be‐

haves like preadv2(2) and pwritev2(2) with offset == ‐1. It’ll

use (and update) the current file position. This obviously comes

with the caveat that if the application has multiple reads or

writes in flight, then the end result will not be as expected.

This is similar to threads sharing a file descriptor and doing IO

using the current file position. Available since kernel 5.6.

IORING_FEAT_CUR_PERSONALITY

If this flag is set, then io_uring guarantees that both sync and

async execution of a request assumes the credentials of the task

that called io_uring_enter(2) to queue the requests. If this flag

isn’t set, then requests are issued with the credentials of the

task that originally registered the io_uring. If only one task is

using a ring, then this flag doesn’t matter as the credentials

will always be the same. Note that this is the default behavior,

tasks can still register different personalities through io_ur‐

ing_register(2) with IORING_REGISTER_PERSONALITY and specify the

personality to use in the sqe. Available since kernel 5.6.

IORING_FEAT_FAST_POLL

If this flag is set, then io_uring supports using an internal

poll mechanism to drive data/space readiness. This means that re‐

quests that cannot read or write data to a file no longer need to

be punted to an async thread for handling, instead they will be‐

gin operation when the file is ready. This is similar to doing

poll + read/write in userspace, but eliminates the need to do so.

If this flag is set, requests waiting on space/data consume a lot

less resources doing so as they are not blocking a thread. Avail‐

able since kernel 5.7.

IORING_FEAT_POLL_32BITS

If this flag is set, the IORING_OP_POLL_ADD command accepts the

full 32‐bit range of epoll based flags. Most notably EPOLLEXCLU‐

SIVE which allows exclusive (waking single waiters) behavior.

Available since kernel 5.9.

IORING_FEAT_SQPOLL_NONFIXED

If this flag is set, the IORING_SETUP_SQPOLL feature no longer

requires the use of fixed files. Any normal file descriptor can

be used for IO commands without needing registration. Available

since kernel 5.11.

IORING_FEAT_EXT_ARG

If this flag is set, then the io_uring_enter(2) system call sup‐

ports passing in an extended argument instead of just the

sigset_t of earlier kernels. This. extended argument is of type

struct io_uring_getevents_arg and allows the caller to pass in

both a sigset_t and a timeout argument for waiting on events. The

struct layout is as follows:

struct io_uring_getevents_arg {

__u64 sigmask;

__u32 sigmask_sz;

__u32 pad;

__u64 ts;

};

and a pointer to this struct must be passed in if IORING_EN‐

TER_EXT_ARG is set in the flags for the enter system call. Avail‐

able since kernel 5.11.

IORING_FEAT_NATIVE_WORKERS

If this flag is set, io_uring is using native workers for its

async helpers. Previous kernels used kernel threads that assumed

the identity of the original io_uring owning task, but later ker‐

nels will actively create what looks more like regular process

threads instead. Available since kernel 5.12.

IORING_FEAT_RSRC_TAGS

If this flag is set, then io_uring supports a variety of features

related to fixed files and buffers. In particular, it indicates

that registered buffers can be updated in‐place, whereas before

the full set would have to be unregistered first. Available since

kernel 5.13.

IORING_FEAT_CQE_SKIP

If this flag is set, then io_uring supports setting

IOSQE_CQE_SKIP_SUCCESS in the submitted SQE, indicating that no

CQE should be generated for this SQE if it executes normally. If

an error happens processing the SQE, a CQE with the appropriate

error value will still be generated. Available since kernel 5.17.

IORING_FEAT_LINKED_FILE

If this flag is set, then io_uring supports sane assignment of

files for SQEs that have dependencies. For example, if a chain of

SQEs are submitted with IOSQE_IO_LINK, then kernels without this

flag will prepare the file for each link upfront. If a previous

link opens a file with a known index, eg if direct descriptors

are used with open or accept, then file assignment needs to hap‐

pen post execution of that SQE. If this flag is set, then the

kernel will defer file assignment until execution of a given re‐

quest is started. Available since kernel 5.17.

IORING_FEAT_REG_REG_RING

If this flag is set, then io_uring supports calling io_uring_reg‐

ister(2) using a registered ring fd, via IORING_REGISTER_USE_REG‐

ISTERED_RING. Available since kernel 6.3.

The rest of the fields in the struct io_uring_params are filled in by

the kernel, and provide the information necessary to memory map the sub‐

mission queue, completion queue, and the array of submission queue en‐

tries. sq_entries specifies the number of submission queue entries al‐

located. sq_off describes the offsets of various ring buffer fields:

struct io_sqring_offsets {

__u32 head;

__u32 tail;

__u32 ring_mask;

__u32 ring_entries;

__u32 flags;

__u32 dropped;

__u32 array;

__u32 resv1;

__u64 user_addr;

};

Taken together, sq_entries and sq_off provide all of the information

necessary for accessing the submission queue ring buffer and the submis‐

sion queue entry array. The submission queue can be mapped with a call

like:

ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),

PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,

ring_fd, IORING_OFF_SQ_RING);

where sq_off is the io_sqring_offsets structure, and ring_fd is the file

descriptor returned from io_uring_setup(2). The addition of sq_off.ar‐

ray to the length of the region accounts for the fact that the ring is

located at the end of the data structure. As an example, the ring

buffer head pointer can be accessed by adding sq_off.head to the address

returned from mmap(2):

head = ptr + sq_off.head;

The flags field is used by the kernel to communicate state information

to the application. Currently, it is used to inform the application

when a call to io_uring_enter(2) is necessary. See the documentation

for the IORING_SETUP_SQPOLL flag above. The dropped member is incre‐

mented for each invalid submission queue entry encountered in the ring

buffer.

The head and tail track the ring buffer state. The tail is incremented

by the application when submitting new I/O, and the head is incremented

by the kernel when the I/O has been successfully submitted. Determining

the index of the head or tail into the ring is accomplished by applying

a mask:

index = tail & ring_mask;

The array of submission queue entries is mapped with:

sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),

PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,

ring_fd, IORING_OFF_SQES);

The completion queue is described by cq_entries and cq_off shown here:

struct io_cqring_offsets {

__u32 head;

__u32 tail;

__u32 ring_mask;

__u32 ring_entries;

__u32 overflow;

__u32 cqes;

__u32 flags;

__u32 resv1;

__u64 user_addr;

};

The completion queue is simpler, since the entries are not separated

from the queue itself, and can be mapped with:

ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),

PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,

IORING_OFF_CQ_RING);

Closing the file descriptor returned by io_uring_setup(2) will free all

resources associated with the io_uring context. Note that this may hap‐

pen asynchronously within the kernel, so it is not guaranteed that re‐

sources are freed immediately.

RETURN VALUE

io_uring_setup(2) returns a new file descriptor on success. The appli‐

cation may then provide the file descriptor in a subsequent mmap(2) call

to map the submission and completion queues, or to the io_uring_regis‐

ter(2) or io_uring_enter(2) system calls.

On error, a negative error code is returned. The caller should not rely

on errno variable.

ERRORS

EFAULT params is outside your accessible address space.

EINVAL The resv array contains non‐zero data, p.flags contains an unsup‐

ported flag, entries is out of bounds, IORING_SETUP_SQ_AFF was

specified, but IORING_SETUP_SQPOLL was not, or IORING_SETUP_CQ‐

SIZE was specified, but io_uring_params.cq_entries was invalid.

IORING_SETUP_REGISTERED_FD_ONLY was specified, but IOR‐

ING_SETUP_NO_MMAP was not.

EMFILE The per‐process limit on the number of open file descriptors has

been reached (see the description of RLIMIT_NOFILE in getr‐

limit(2)).

ENFILE The system‐wide limit on the total number of open files has been

reached.

ENOMEM Insufficient kernel resources are available.

EPERM IORING_SETUP_SQPOLL was specified, but the effective user ID of

the caller did not have sufficient privileges.

EPERM /proc/sys/kernel/io_uring_disabled has the value 2, or it has the

value 1 and the calling process does not hold the CAP_SYS_ADMIN

capability or is not a member of /proc/sys/kernel/io_uring_group.

SEE ALSO

io_uring_register(2), io_uring_enter(2)

Linux 2019‐01‐29 io_uring_setup(2)

***

Índice de la Sección 2

Índice General