io_uring_setup(2)
SECCIĂN: 2 - Llamadas al sistema
io_uring_setup(2) Linux Programmerâs Manual io_uring_setup(2)
NAME
io_uring_setup - setup a context for performing asynchronous I/O
SYNOPSIS
#include <liburing.h>
int io_uring_setup(u32 entries, struct io_uring_params *p);
DESCRIPTION
The io_uring_setup(2) system call sets up a submission queue (SQ) and
completion queue (CQ) with at least entries entries, and returns a file
descriptor which can be used to perform subsequent operations on the
io_uring instance. The submission and completion queues are shared beâ
tween userspace and the kernel, which eliminates the need to copy data
when initiating and completing I/O.
params is used by the application to pass options to the kernel, and by
the kernel to convey information about the ring buffers.
struct io_uring_params {
__u32 sq_entries;
__u32 cq_entries;
__u32 flags;
__u32 sq_thread_cpu;
__u32 sq_thread_idle;
__u32 features;
__u32 wq_fd;
__u32 resv[3];
struct io_sqring_offsets sq_off;
struct io_cqring_offsets cq_off;
};
The flags, sq_thread_cpu, and sq_thread_idle fields are used to configâ
ure the io_uring instance. flags is a bit mask of 0 or more of the folâ
lowing values ORed together:
IORING_SETUP_IOPOLL
Perform busyâwaiting for an I/O completion, as opposed to getting
notifications via an asynchronous IRQ (Interrupt Request). The
file system (if any) and block device must support polling in orâ
der for this to work. Busyâwaiting provides lower latency, but
may consume more CPU resources than interrupt driven I/O. Curâ
rently, this feature is usable only on a file descriptor opened
using the O_DIRECT flag. When a read or write is submitted to a
polled context, the application must poll for completions on the
CQ ring by calling io_uring_enter(2). It is illegal to mix and
match polled and nonâpolled I/O on an io_uring instance.
This is only applicable for storage devices for now, and the
storage device must be configured for polling. How to do that deâ
pends on the device type in question. For NVMe devices, the nvme
driver must be loaded with the poll_queues parameter set to the
desired number of polling queues. The polling queues will be
shared appropriately between the CPUs in the system, if the numâ
ber is less than the number of online CPU threads.
IORING_SETUP_SQPOLL
When this flag is specified, a kernel thread is created to perâ
form submission queue polling. An io_uring instance configured
in this way enables an application to issue I/O without ever conâ
text switching into the kernel. By using the submission queue to
fill in new submission queue entries and watching for completions
on the completion queue, the application can submit and reap I/Os
without doing a single system call.
If the kernel thread is idle for more than sq_thread_idle milâ
liseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the flags
field of the struct io_sq_ring. When this happens, the applicaâ
tion must call io_uring_enter(2) to wake the kernel thread. If
I/O is kept busy, the kernel thread will never sleep. An appliâ
cation making use of this feature will need to guard the io_urâ
ing_enter(2) call with the following code sequence:
/*
* Ensure that the wakeup flag is read after the tail pointer
* has been written. Itâs important to use memory load acquire
* semantics for the flags read, as otherwise the application
* and the kernel might not agree on the consistency of the
* wakeup flag.
*/
unsigned flags = atomic_load_relaxed(sq_ringâ>flags);
if (flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
where sq_ring is a submission queue ring setup using the struct
io_sqring_offsets described below.
Note that, when using a ring setup with
IORING_SETUP_SQPOLL, you never directly call the io_uring_enâ
ter(2) system call. That is usually taken care of by liburingâs
io_uring_submit(3) function. It automatically determines if you
are using polling mode or not and deals with when your program
needs to call io_uring_enter(2) without you having to bother
about it.
Before version 5.11 of the Linux kernel, to successfully use this feaâ
ture, the
application must register a set of files to be used for IO
through io_uring_register(2) using the IORING_REGISTER_FILES opâ
code. Failure to do so will result in submitted IO being errored
with EBADF. The presence of this feature can be detected by the
IORING_FEAT_SQPOLL_NONFIXED feature flag. In version 5.11 and
later, it is no longer necessary to register files to use this
feature. 5.11 also allows using this as nonâroot, if the user has
the CAP_SYS_NICE capability. In 5.13 this requirement was also
relaxed, and no special privileges are needed for SQPOLL in newer
kernels. Certain stable kernels older than 5.13 may also support
unprivileged SQPOLL.
IORING_SETUP_SQ_AFF
If this flag is specified, then the poll thread will be bound to
the cpu set in the sq_thread_cpu field of the struct io_urâ
ing_params. This flag is only meaningful when IORâ
ING_SETUP_SQPOLL is specified. When cgroup setting cpuset.cpus
changes (typically in container environment), the bounded cpu set
may be changed as well.
IORING_SETUP_CQSIZE
Create the completion queue with struct io_uring_params.cq_enâ
tries entries. The value must be greater than entries, and may
be rounded up to the next powerâofâtwo.
IORING_SETUP_CLAMP
If this flag is specified, and if entries exceeds IORING_MAX_ENâ
TRIES, then entries will be clamped at IORING_MAX_ENTRIES. If
the flag IORING_SETUP_CQSIZE is set, and if the value of struct
io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES, then it
will be clamped at IORING_MAX_CQ_ENTRIES.
IORING_SETUP_ATTACH_WQ
This flag should be set in conjunction with struct io_urâ
ing_params.wq_fd being set to an existing io_uring ring file deâ
scriptor. When set, the io_uring instance being created will
share the asynchronous worker thread backend of the specified
io_uring ring, rather than create a new separate thread pool.
IORING_SETUP_R_DISABLED
If this flag is specified, the io_uring ring starts in a disabled
state. In this state, restrictions can be registered, but subâ
missions are not allowed. See io_uring_register(2) for details
on how to enable the ring. Available since 5.10.
IORING_SETUP_SUBMIT_ALL
Normally io_uring stops submitting a batch of requests, if one of
these requests results in an error. This can cause submission of
less than what is expected, if a request ends in error while beâ
ing submitted. If the ring is created with this flag, io_urâ
ing_enter(2) will continue submitting requests even if it encounâ
ters an error submitting a request. CQEs are still posted for erâ
rored request regardless of whether or not this flag is set at
ring creation time, the only difference is if the submit sequence
is halted or continued when an error is observed. Available since
5.18.
IORING_SETUP_COOP_TASKRUN
By default, io_uring will interrupt a task running in userspace
when a completion event comes in. This is to ensure that compleâ
tions run in a timely manner. For a lot of use cases, this is
overkill and can cause reduced performance from both the interâ
processor interrupt used to do this, the kernel/user transition,
the needless interruption of the tasks userspace activities, and
reduced batching if completions come in at a rapid rate. Most apâ
plications donât need the forceful interruption, as the events
are processed at any kernel/user transition. The exception are
setups where the application uses multiple threads operating on
the same ring, where the application waiting on completions isnât
the one that submitted them. For most other use cases, setting
this flag will improve performance. Available since 5.19.
IORING_SETUP_TASKRUN_FLAG
Used in conjunction with IORING_SETUP_COOP_TASKRUN, this provides
a flag, IORING_SQ_TASKRUN, which is set in the SQ ring flags
whenever completions are pending that should be processed. liburâ
ing will check for this flag even when doing io_uring_peek_cqe(3)
and enter the kernel to process them, and applications can do the
same. This makes IORING_SETUP_TASKRUN_FLAG safe to use even when
applications rely on a peek style operation on the CQ ring to see
if anything might be pending to reap. Available since 5.19.
IORING_SETUP_SQE128
If set, io_uring will use 128âbyte SQEs rather than the normal
64âbyte sized variant. This is a requirement for using certain
request types, as of 5.19 only the IORING_OP_URING_CMD
passthrough command for NVMe passthrough needs this. Available
since 5.19.
IORING_SETUP_CQE32
If set, io_uring will use 32âbyte CQEs rather than the normal
16âbyte sized variant. This is a requirement for using certain
request types, as of 5.19 only the IORING_OP_URING_CMD
passthrough command for NVMe passthrough needs this. Available
since 5.19.
IORING_SETUP_SINGLE_ISSUER
A hint to the kernel that only a single task (or thread) will
submit requests, which is used for internal optimisations. The
submission task is either the task that created the ring, or if
IORING_SETUP_R_DISABLED is specified then it is the task that enâ
ables the ring through io_uring_register(2). The kernel enforces
this rule, failing requests with âEEXIST if the restriction is
violated. Note that when IORING_SETUP_SQPOLL is set it is conâ
sidered that the polling task is doing all submissions on behalf
of the userspace and so it always complies with the rule disreâ
garding how many userspace tasks do io_uring_enter(2). Available
since 6.0.
IORING_SETUP_DEFER_TASKRUN
By default, io_uring will process all outstanding work at the end
of any system call or thread interrupt. This can delay the appliâ
cation from making other progress. Setting this flag will hint
to io_uring that it should defer work until an io_uring_enter(2)
call with the IORING_ENTER_GETEVENTS flag set. This allows the
application to request work to run just before it wants to
process completions. This flag requires the IORING_SETUP_SINâ
GLE_ISSUER flag to be set, and also enforces that the call to
io_uring_enter(2) is called from the same thread that submitted
requests. Note that if this flag is set then it is the applicaâ
tionâs responsibility to periodically trigger work (for example
via any of the CQE waiting functions) or else completions may not
be delivered. Available since 6.1.
IORING_SETUP_NO_MMAP
By default, io_uring allocates kernel memory that callers must
subsequently mmap(2). If this flag is set, io_uring instead uses
callerâallocated buffers; pâ>cq_off.user_addr must point to the
memory for the sq/cq rings, and pâ>sq_off.user_addr must point to
the memory for the sqes. Each allocation must be contiguous memâ
ory. Typically, callers should allocate this memory by using
mmap(2) to allocate a huge page. If this flag is set, a subseâ
quent attempt to mmap(2) the io_uring file descriptor will fail.
Available since 6.5.
IORING_SETUP_REGISTERED_FD_ONLY
If this flag is set, io_uring will register the ring file deâ
scriptor, and return the registered descriptor index, without
ever allocating an unregistered file descriptor. The caller will
need to use IORING_REGISTER_USE_REGISTERED_RING when calling
io_uring_register(2). This flag only makes sense when used
alongside with IORING_SETUP_NO_MMAP, which also needs to be set.
Available since 6.5.
IORING_SETUP_NO_SQARRAY
If this flag is set, entries in the submission queue will be subâ
mitted in order, wrapping around to the first entry after reachâ
ing the end of the queue. In other words, there will be no more
indirection via the array of submission entries, and the queue
will be indexed directly by the submission queue tail and the
range of indexed represented by it modulo queue size. Subseâ
quently, the user should not map the array of submission queue
entries, and the corresponding offset in struct io_sqring_offsets
will be set to zero. Available since 6.6.
If no flags are specified, the io_uring instance is setup for interrupt
driven I/O. I/O may be submitted using io_uring_enter(2) and can be
reaped by polling the completion queue.
The resv array must be initialized to zero.
features is filled in by the kernel, which specifies various features
supported by current kernel version.
IORING_FEAT_SINGLE_MMAP
If this flag is set, the two SQ and CQ rings can be mapped with a
single mmap(2) call. The SQEs must still be allocated separately.
This brings the necessary mmap(2) calls down from three to two.
Available since kernel 5.4.
IORING_FEAT_NODROP
If this flag is set, io_uring supports almost never dropping comâ
pletion events. A dropped event can only occur if the kernel
runs out of memory, in which case you have worse problems than a
lost event. Your application and others will likely get OOM
killed anyway. If a completion event occurs and the CQ ring is
full, the kernel stores the event internally until such a time
that the CQ ring has room for more entries. In earlier kernels,
if this overflow condition is entered, attempting to submit more
IO would fail with the âEBUSY error value, if it canât flush the
overflown events to the CQ ring. If this happens, the application
must reap events from the CQ ring and attempt the submit again.
If the kernel has no free memory to store the event internally it
will be visible by an increase in the overflow value on the
cqring. Available since kernel 5.5. Additionally io_uring_enâ
ter(2) will return âEBADR the next time it would otherwise sleep
waiting for completions (since kernel 5.19).
IORING_FEAT_SUBMIT_STABLE
If this flag is set, applications can be certain that any data
for async offload has been consumed when the kernel has consumed
the SQE. Available since kernel 5.5.
IORING_FEAT_RW_CUR_POS
If this flag is set, applications can specify offset == â1 with
IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and
IORING_OP_{READ,WRITE} to mean current file position, which beâ
haves like preadv2(2) and pwritev2(2) with offset == â1. Itâll
use (and update) the current file position. This obviously comes
with the caveat that if the application has multiple reads or
writes in flight, then the end result will not be as expected.
This is similar to threads sharing a file descriptor and doing IO
using the current file position. Available since kernel 5.6.
IORING_FEAT_CUR_PERSONALITY
If this flag is set, then io_uring guarantees that both sync and
async execution of a request assumes the credentials of the task
that called io_uring_enter(2) to queue the requests. If this flag
isnât set, then requests are issued with the credentials of the
task that originally registered the io_uring. If only one task is
using a ring, then this flag doesnât matter as the credentials
will always be the same. Note that this is the default behavior,
tasks can still register different personalities through io_urâ
ing_register(2) with IORING_REGISTER_PERSONALITY and specify the
personality to use in the sqe. Available since kernel 5.6.
IORING_FEAT_FAST_POLL
If this flag is set, then io_uring supports using an internal
poll mechanism to drive data/space readiness. This means that reâ
quests that cannot read or write data to a file no longer need to
be punted to an async thread for handling, instead they will beâ
gin operation when the file is ready. This is similar to doing
poll + read/write in userspace, but eliminates the need to do so.
If this flag is set, requests waiting on space/data consume a lot
less resources doing so as they are not blocking a thread. Availâ
able since kernel 5.7.
IORING_FEAT_POLL_32BITS
If this flag is set, the IORING_OP_POLL_ADD command accepts the
full 32âbit range of epoll based flags. Most notably EPOLLEXCLUâ
SIVE which allows exclusive (waking single waiters) behavior.
Available since kernel 5.9.
IORING_FEAT_SQPOLL_NONFIXED
If this flag is set, the IORING_SETUP_SQPOLL feature no longer
requires the use of fixed files. Any normal file descriptor can
be used for IO commands without needing registration. Available
since kernel 5.11.
IORING_FEAT_EXT_ARG
If this flag is set, then the io_uring_enter(2) system call supâ
ports passing in an extended argument instead of just the
sigset_t of earlier kernels. This. extended argument is of type
struct io_uring_getevents_arg and allows the caller to pass in
both a sigset_t and a timeout argument for waiting on events. The
struct layout is as follows:
struct io_uring_getevents_arg {
__u64 sigmask;
__u32 sigmask_sz;
__u32 pad;
__u64 ts;
};
and a pointer to this struct must be passed in if IORING_ENâ
TER_EXT_ARG is set in the flags for the enter system call. Availâ
able since kernel 5.11.
IORING_FEAT_NATIVE_WORKERS
If this flag is set, io_uring is using native workers for its
async helpers. Previous kernels used kernel threads that assumed
the identity of the original io_uring owning task, but later kerâ
nels will actively create what looks more like regular process
threads instead. Available since kernel 5.12.
IORING_FEAT_RSRC_TAGS
If this flag is set, then io_uring supports a variety of features
related to fixed files and buffers. In particular, it indicates
that registered buffers can be updated inâplace, whereas before
the full set would have to be unregistered first. Available since
kernel 5.13.
IORING_FEAT_CQE_SKIP
If this flag is set, then io_uring supports setting
IOSQE_CQE_SKIP_SUCCESS in the submitted SQE, indicating that no
CQE should be generated for this SQE if it executes normally. If
an error happens processing the SQE, a CQE with the appropriate
error value will still be generated. Available since kernel 5.17.
IORING_FEAT_LINKED_FILE
If this flag is set, then io_uring supports sane assignment of
files for SQEs that have dependencies. For example, if a chain of
SQEs are submitted with IOSQE_IO_LINK, then kernels without this
flag will prepare the file for each link upfront. If a previous
link opens a file with a known index, eg if direct descriptors
are used with open or accept, then file assignment needs to hapâ
pen post execution of that SQE. If this flag is set, then the
kernel will defer file assignment until execution of a given reâ
quest is started. Available since kernel 5.17.
IORING_FEAT_REG_REG_RING
If this flag is set, then io_uring supports calling io_uring_regâ
ister(2) using a registered ring fd, via IORING_REGISTER_USE_REGâ
ISTERED_RING. Available since kernel 6.3.
The rest of the fields in the struct io_uring_params are filled in by
the kernel, and provide the information necessary to memory map the subâ
mission queue, completion queue, and the array of submission queue enâ
tries. sq_entries specifies the number of submission queue entries alâ
located. sq_off describes the offsets of various ring buffer fields:
struct io_sqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 flags;
__u32 dropped;
__u32 array;
__u32 resv1;
__u64 user_addr;
};
Taken together, sq_entries and sq_off provide all of the information
necessary for accessing the submission queue ring buffer and the submisâ
sion queue entry array. The submission queue can be mapped with a call
like:
ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
ring_fd, IORING_OFF_SQ_RING);
where sq_off is the io_sqring_offsets structure, and ring_fd is the file
descriptor returned from io_uring_setup(2). The addition of sq_off.arâ
ray to the length of the region accounts for the fact that the ring is
located at the end of the data structure. As an example, the ring
buffer head pointer can be accessed by adding sq_off.head to the address
returned from mmap(2):
head = ptr + sq_off.head;
The flags field is used by the kernel to communicate state information
to the application. Currently, it is used to inform the application
when a call to io_uring_enter(2) is necessary. See the documentation
for the IORING_SETUP_SQPOLL flag above. The dropped member is increâ
mented for each invalid submission queue entry encountered in the ring
buffer.
The head and tail track the ring buffer state. The tail is incremented
by the application when submitting new I/O, and the head is incremented
by the kernel when the I/O has been successfully submitted. Determining
the index of the head or tail into the ring is accomplished by applying
a mask:
index = tail & ring_mask;
The array of submission queue entries is mapped with:
sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
ring_fd, IORING_OFF_SQES);
The completion queue is described by cq_entries and cq_off shown here:
struct io_cqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 overflow;
__u32 cqes;
__u32 flags;
__u32 resv1;
__u64 user_addr;
};
The completion queue is simpler, since the entries are not separated
from the queue itself, and can be mapped with:
ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
IORING_OFF_CQ_RING);
Closing the file descriptor returned by io_uring_setup(2) will free all
resources associated with the io_uring context. Note that this may hapâ
pen asynchronously within the kernel, so it is not guaranteed that reâ
sources are freed immediately.
RETURN VALUE
io_uring_setup(2) returns a new file descriptor on success. The appliâ
cation may then provide the file descriptor in a subsequent mmap(2) call
to map the submission and completion queues, or to the io_uring_regisâ
ter(2) or io_uring_enter(2) system calls.
On error, a negative error code is returned. The caller should not rely
on errno variable.
ERRORS
EFAULT params is outside your accessible address space.
EINVAL The resv array contains nonâzero data, p.flags contains an unsupâ
ported flag, entries is out of bounds, IORING_SETUP_SQ_AFF was
specified, but IORING_SETUP_SQPOLL was not, or IORING_SETUP_CQâ
SIZE was specified, but io_uring_params.cq_entries was invalid.
IORING_SETUP_REGISTERED_FD_ONLY was specified, but IORâ
ING_SETUP_NO_MMAP was not.
EMFILE The perâprocess limit on the number of open file descriptors has
been reached (see the description of RLIMIT_NOFILE in getrâ
limit(2)).
ENFILE The systemâwide limit on the total number of open files has been
reached.
ENOMEM Insufficient kernel resources are available.
EPERM IORING_SETUP_SQPOLL was specified, but the effective user ID of
the caller did not have sufficient privileges.
EPERM /proc/sys/kernel/io_uring_disabled has the value 2, or it has the
value 1 and the calling process does not hold the CAP_SYS_ADMIN
capability or is not a member of /proc/sys/kernel/io_uring_group.
SEE ALSO
io_uring_register(2), io_uring_enter(2)
Linux 2019â01â29 io_uring_setup(2)
***