2023-10-16 14:02:40

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 00/11] io_uring: Initial support for {s,g}etsockopt commands

This patchset adds support for getsockopt (SOCKET_URING_OP_GETSOCKOPT)
and setsockopt (SOCKET_URING_OP_SETSOCKOPT) in io_uring commands.
SOCKET_URING_OP_SETSOCKOPT implements generic case, covering all levels
and optnames. SOCKET_URING_OP_GETSOCKOPT is limited, for now, to
SOL_SOCKET level, which seems to be the most common level parameter for
get/setsockopt(2).

In order to keep the implementation (and tests) simple, some refactors
were done prior to the changes, as follows:

Patches 1-2: Make BPF cgroup filters sockptr aware

Patches 3-4: Remove the core {s,g}etsockopt() core function from
__sys_{g,s}etsockopt, so, the code could be reused by other callers, such as
io_uring.

Patch 5: Pass compat mode to the file/socket callbacks

Patch 6-7: Move io_uring helpers from io_uring_zerocopy_tx to a generic
io_uring headers. This simplify the test case (last patch). Also copy the
io_uring UAPI to the tests directory.

Patch 8: Protect io_uring_cmd_sock() to not be called if CONFIG_NET is
disabled.

These changes were tested with a new test[1] in liburing, LTP sockopt*
tests, as also with bpf/progs/sockopt test case, which is now adapted to
run using both system calls and io_uring commands.

[1] Link: https://github.com/leitao/liburing/blob/getsock/test/socket-getsetsock-cmd.c

RFC -> V1:
* Copy user memory at io_uring subsystem, and call proto_ops
callbacks using kernel memory
* Implement all the cases for SOCKET_URING_OP_SETSOCKOPT

V1 -> V2
* Implemented the BPF part
* Using user pointers from optval to avoid kmalloc in io_uring part.

V2 -> V3:
* Break down __sys_setsockopt and reuse the core code, avoiding
duplicated code. This removed the requirement to expose
sock_use_custom_sol_socket().
* Added io_uring test to selftests/bpf/sockopt.
* Fixed compat argument, by passing it to the issue_flags.

V3 -> V4:
* Rebase on top of commit 1ded5e5a5931b ("net: annotate data-races around sock->ops")
* Also broke down __sys_setsockopt() to reuse the core function
from io_uring.
* Create a new patch to return -EOPNOTSUPP if CONFIG_NET is
disabled.
* Added two SOL_SOCKET tests in bpf/prog_tests/sockopt.

V4 -> V5:
* Do not use sockptr anymore, by changing the optlen getsock argument
to be a user pointer (instead of a kernel pointer). This change also drop
the limitation on getsockopt from previous versions, and now all
levels are supported.
* Simplified the BPF sockopt test, since there is no more limitation on
the io_uring commands.
* No more changes in the BPF subsystem.
* Moved the optlen field in the SQE struct. It is now a pointer instead
of u32.

V5 -> V6:
* Removed the need for #ifdef CONFIG_NET as suggested by Gabriel
Krisman.
* Changed the variable declaration order to respect the reverse
xmas declaration as suggested by Paolo Abeni.

V6 -> V7:
* Changed the optlen back to a value in the SQE instead of
user-pointer. This is similar to version 4.
[https://lore.kernel.org/all/[email protected]/]
* Imported the io_uring.h into tools/include/uapi/linux to be able to
run the tests in machines without liburing.
[https://lore.kernel.org/all/[email protected]/]

Breno Leitao (11):
bpf: Leverage sockptr_t in BPF getsockopt hook
bpf: Leverage sockptr_t in BPF setsockopt hook
net/socket: Break down __sys_setsockopt
net/socket: Break down __sys_getsockopt
io_uring/cmd: Pass compat mode in issue_flags
tools headers: Grab copy of io_uring.h
selftests/net: Extract uring helpers to be reusable
io_uring/cmd: return -EOPNOTSUPP if net is disabled
io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
selftests/bpf/sockopt: Add io_uring support
Breno Leitao (11):
bpf: Add sockptr support for getsockopt
bpf: Add sockptr support for setsockopt
net/socket: Break down __sys_setsockopt
net/socket: Break down __sys_getsockopt
io_uring/cmd: Pass compat mode in issue_flags
tools headers: Grab copy of io_uring.h
selftests/net: Extract uring helpers to be reusable
io_uring/cmd: return -EOPNOTSUPP if net is disabled
io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
selftests/bpf/sockopt: Add io_uring support

include/linux/bpf-cgroup.h | 9 +-
include/linux/io_uring.h | 1 +
include/net/sock.h | 6 +-
include/uapi/linux/io_uring.h | 8 +
io_uring/uring_cmd.c | 53 ++
kernel/bpf/cgroup.c | 25 +-
net/core/sock.c | 8 -
net/socket.c | 103 ++-
tools/include/io_uring/mini_liburing.h | 282 +++++++
tools/include/uapi/linux/io_uring.h | 757 ++++++++++++++++++
.../selftests/bpf/prog_tests/sockopt.c | 113 ++-
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 268 +------
13 files changed, 1300 insertions(+), 334 deletions(-)
create mode 100644 tools/include/io_uring/mini_liburing.h
create mode 100644 tools/include/uapi/linux/io_uring.h

--
2.34.1


2023-10-16 14:02:41

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 06/11] tools headers: Grab copy of io_uring.h

This file will be used by mini_uring.h and allow tests to run without
the need of installing liburing to run the tests.

This is needed to run io_uring tests in BPF, such as
(tools/testing/selftests/bpf/prog_tests/sockopt.c).

Signed-off-by: Breno Leitao <[email protected]>
---
tools/include/uapi/linux/io_uring.h | 757 ++++++++++++++++++++++++++++
1 file changed, 757 insertions(+)
create mode 100644 tools/include/uapi/linux/io_uring.h

diff --git a/tools/include/uapi/linux/io_uring.h b/tools/include/uapi/linux/io_uring.h
new file mode 100644
index 000000000000..f1c16f817742
--- /dev/null
+++ b/tools/include/uapi/linux/io_uring.h
@@ -0,0 +1,757 @@
+/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+/*
+ * this file is shared with liburing and that has to autodetect
+ * if linux/time_types.h is available or not, it can
+ * define UAPI_LINUX_IO_URING_H_SKIP_LINUX_TIME_TYPES_H
+ * if linux/time_types.h is not available
+ */
+#ifndef UAPI_LINUX_IO_URING_H_SKIP_LINUX_TIME_TYPES_H
+#include <linux/time_types.h>
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/*
+ * IO submission data structure (Submission Queue Entry)
+ */
+struct io_uring_sqe {
+ __u8 opcode; /* type of operation for this sqe */
+ __u8 flags; /* IOSQE_ flags */
+ __u16 ioprio; /* ioprio for the request */
+ __s32 fd; /* file descriptor to do IO on */
+ union {
+ __u64 off; /* offset into file */
+ __u64 addr2;
+ struct {
+ __u32 cmd_op;
+ __u32 __pad1;
+ };
+ };
+ union {
+ __u64 addr; /* pointer to buffer or iovecs */
+ __u64 splice_off_in;
+ struct {
+ __u32 level;
+ __u32 optname;
+ };
+ };
+ __u32 len; /* buffer size or number of iovecs */
+ union {
+ __kernel_rwf_t rw_flags;
+ __u32 fsync_flags;
+ __u16 poll_events; /* compatibility */
+ __u32 poll32_events; /* word-reversed for BE */
+ __u32 sync_range_flags;
+ __u32 msg_flags;
+ __u32 timeout_flags;
+ __u32 accept_flags;
+ __u32 cancel_flags;
+ __u32 open_flags;
+ __u32 statx_flags;
+ __u32 fadvise_advice;
+ __u32 splice_flags;
+ __u32 rename_flags;
+ __u32 unlink_flags;
+ __u32 hardlink_flags;
+ __u32 xattr_flags;
+ __u32 msg_ring_flags;
+ __u32 uring_cmd_flags;
+ __u32 waitid_flags;
+ __u32 futex_flags;
+ };
+ __u64 user_data; /* data to be passed back at completion time */
+ /* pack this to avoid bogus arm OABI complaints */
+ union {
+ /* index into fixed buffers, if used */
+ __u16 buf_index;
+ /* for grouped buffer selection */
+ __u16 buf_group;
+ } __attribute__((packed));
+ /* personality to use, if used */
+ __u16 personality;
+ union {
+ __s32 splice_fd_in;
+ __u32 file_index;
+ __u32 optlen;
+ struct {
+ __u16 addr_len;
+ __u16 __pad3[1];
+ };
+ };
+ union {
+ struct {
+ __u64 addr3;
+ __u64 __pad2[1];
+ };
+ __u64 optval;
+ /*
+ * If the ring is initialized with IORING_SETUP_SQE128, then
+ * this field is used for 80 bytes of arbitrary command data
+ */
+ __u8 cmd[0];
+ };
+};
+
+/*
+ * If sqe->file_index is set to this for opcodes that instantiate a new
+ * direct descriptor (like openat/openat2/accept), then io_uring will allocate
+ * an available direct descriptor instead of having the application pass one
+ * in. The picked direct descriptor will be returned in cqe->res, or -ENFILE
+ * if the space is full.
+ */
+#define IORING_FILE_INDEX_ALLOC (~0U)
+
+enum {
+ IOSQE_FIXED_FILE_BIT,
+ IOSQE_IO_DRAIN_BIT,
+ IOSQE_IO_LINK_BIT,
+ IOSQE_IO_HARDLINK_BIT,
+ IOSQE_ASYNC_BIT,
+ IOSQE_BUFFER_SELECT_BIT,
+ IOSQE_CQE_SKIP_SUCCESS_BIT,
+};
+
+/*
+ * sqe->flags
+ */
+/* use fixed fileset */
+#define IOSQE_FIXED_FILE (1U << IOSQE_FIXED_FILE_BIT)
+/* issue after inflight IO */
+#define IOSQE_IO_DRAIN (1U << IOSQE_IO_DRAIN_BIT)
+/* links next sqe */
+#define IOSQE_IO_LINK (1U << IOSQE_IO_LINK_BIT)
+/* like LINK, but stronger */
+#define IOSQE_IO_HARDLINK (1U << IOSQE_IO_HARDLINK_BIT)
+/* always go async */
+#define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
+/* select buffer from sqe->buf_group */
+#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
+/* don't post CQE if request succeeded */
+#define IOSQE_CQE_SKIP_SUCCESS (1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
+
+/*
+ * io_uring_setup() flags
+ */
+#define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */
+#define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */
+#define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */
+#define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */
+#define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */
+#define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */
+#define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */
+#define IORING_SETUP_SUBMIT_ALL (1U << 7) /* continue submit on error */
+/*
+ * Cooperative task running. When requests complete, they often require
+ * forcing the submitter to transition to the kernel to complete. If this
+ * flag is set, work will be done when the task transitions anyway, rather
+ * than force an inter-processor interrupt reschedule. This avoids interrupting
+ * a task running in userspace, and saves an IPI.
+ */
+#define IORING_SETUP_COOP_TASKRUN (1U << 8)
+/*
+ * If COOP_TASKRUN is set, get notified if task work is available for
+ * running and a kernel transition would be needed to run it. This sets
+ * IORING_SQ_TASKRUN in the sq ring flags. Not valid with COOP_TASKRUN.
+ */
+#define IORING_SETUP_TASKRUN_FLAG (1U << 9)
+#define IORING_SETUP_SQE128 (1U << 10) /* SQEs are 128 byte */
+#define IORING_SETUP_CQE32 (1U << 11) /* CQEs are 32 byte */
+/*
+ * Only one task is allowed to submit requests
+ */
+#define IORING_SETUP_SINGLE_ISSUER (1U << 12)
+
+/*
+ * Defer running task work to get events.
+ * Rather than running bits of task work whenever the task transitions
+ * try to do it just before it is needed.
+ */
+#define IORING_SETUP_DEFER_TASKRUN (1U << 13)
+
+/*
+ * Application provides the memory for the rings
+ */
+#define IORING_SETUP_NO_MMAP (1U << 14)
+
+/*
+ * Register the ring fd in itself for use with
+ * IORING_REGISTER_USE_REGISTERED_RING; return a registered fd index rather
+ * than an fd.
+ */
+#define IORING_SETUP_REGISTERED_FD_ONLY (1U << 15)
+
+/*
+ * Removes indirection through the SQ index array.
+ */
+#define IORING_SETUP_NO_SQARRAY (1U << 16)
+
+enum io_uring_op {
+ IORING_OP_NOP,
+ IORING_OP_READV,
+ IORING_OP_WRITEV,
+ IORING_OP_FSYNC,
+ IORING_OP_READ_FIXED,
+ IORING_OP_WRITE_FIXED,
+ IORING_OP_POLL_ADD,
+ IORING_OP_POLL_REMOVE,
+ IORING_OP_SYNC_FILE_RANGE,
+ IORING_OP_SENDMSG,
+ IORING_OP_RECVMSG,
+ IORING_OP_TIMEOUT,
+ IORING_OP_TIMEOUT_REMOVE,
+ IORING_OP_ACCEPT,
+ IORING_OP_ASYNC_CANCEL,
+ IORING_OP_LINK_TIMEOUT,
+ IORING_OP_CONNECT,
+ IORING_OP_FALLOCATE,
+ IORING_OP_OPENAT,
+ IORING_OP_CLOSE,
+ IORING_OP_FILES_UPDATE,
+ IORING_OP_STATX,
+ IORING_OP_READ,
+ IORING_OP_WRITE,
+ IORING_OP_FADVISE,
+ IORING_OP_MADVISE,
+ IORING_OP_SEND,
+ IORING_OP_RECV,
+ IORING_OP_OPENAT2,
+ IORING_OP_EPOLL_CTL,
+ IORING_OP_SPLICE,
+ IORING_OP_PROVIDE_BUFFERS,
+ IORING_OP_REMOVE_BUFFERS,
+ IORING_OP_TEE,
+ IORING_OP_SHUTDOWN,
+ IORING_OP_RENAMEAT,
+ IORING_OP_UNLINKAT,
+ IORING_OP_MKDIRAT,
+ IORING_OP_SYMLINKAT,
+ IORING_OP_LINKAT,
+ IORING_OP_MSG_RING,
+ IORING_OP_FSETXATTR,
+ IORING_OP_SETXATTR,
+ IORING_OP_FGETXATTR,
+ IORING_OP_GETXATTR,
+ IORING_OP_SOCKET,
+ IORING_OP_URING_CMD,
+ IORING_OP_SEND_ZC,
+ IORING_OP_SENDMSG_ZC,
+ IORING_OP_READ_MULTISHOT,
+ IORING_OP_WAITID,
+ IORING_OP_FUTEX_WAIT,
+ IORING_OP_FUTEX_WAKE,
+ IORING_OP_FUTEX_WAITV,
+
+ /* this goes last, obviously */
+ IORING_OP_LAST,
+};
+
+/*
+ * sqe->uring_cmd_flags top 8bits aren't available for userspace
+ * IORING_URING_CMD_FIXED use registered buffer; pass this flag
+ * along with setting sqe->buf_index.
+ */
+#define IORING_URING_CMD_FIXED (1U << 0)
+#define IORING_URING_CMD_MASK IORING_URING_CMD_FIXED
+
+
+/*
+ * sqe->fsync_flags
+ */
+#define IORING_FSYNC_DATASYNC (1U << 0)
+
+/*
+ * sqe->timeout_flags
+ */
+#define IORING_TIMEOUT_ABS (1U << 0)
+#define IORING_TIMEOUT_UPDATE (1U << 1)
+#define IORING_TIMEOUT_BOOTTIME (1U << 2)
+#define IORING_TIMEOUT_REALTIME (1U << 3)
+#define IORING_LINK_TIMEOUT_UPDATE (1U << 4)
+#define IORING_TIMEOUT_ETIME_SUCCESS (1U << 5)
+#define IORING_TIMEOUT_MULTISHOT (1U << 6)
+#define IORING_TIMEOUT_CLOCK_MASK (IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME)
+#define IORING_TIMEOUT_UPDATE_MASK (IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE)
+/*
+ * sqe->splice_flags
+ * extends splice(2) flags
+ */
+#define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */
+
+/*
+ * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
+ * command flags for POLL_ADD are stored in sqe->len.
+ *
+ * IORING_POLL_ADD_MULTI Multishot poll. Sets IORING_CQE_F_MORE if
+ * the poll handler will continue to report
+ * CQEs on behalf of the same SQE.
+ *
+ * IORING_POLL_UPDATE Update existing poll request, matching
+ * sqe->addr as the old user_data field.
+ *
+ * IORING_POLL_LEVEL Level triggered poll.
+ */
+#define IORING_POLL_ADD_MULTI (1U << 0)
+#define IORING_POLL_UPDATE_EVENTS (1U << 1)
+#define IORING_POLL_UPDATE_USER_DATA (1U << 2)
+#define IORING_POLL_ADD_LEVEL (1U << 3)
+
+/*
+ * ASYNC_CANCEL flags.
+ *
+ * IORING_ASYNC_CANCEL_ALL Cancel all requests that match the given key
+ * IORING_ASYNC_CANCEL_FD Key off 'fd' for cancelation rather than the
+ * request 'user_data'
+ * IORING_ASYNC_CANCEL_ANY Match any request
+ * IORING_ASYNC_CANCEL_FD_FIXED 'fd' passed in is a fixed descriptor
+ * IORING_ASYNC_CANCEL_USERDATA Match on user_data, default for no other key
+ * IORING_ASYNC_CANCEL_OP Match request based on opcode
+ */
+#define IORING_ASYNC_CANCEL_ALL (1U << 0)
+#define IORING_ASYNC_CANCEL_FD (1U << 1)
+#define IORING_ASYNC_CANCEL_ANY (1U << 2)
+#define IORING_ASYNC_CANCEL_FD_FIXED (1U << 3)
+#define IORING_ASYNC_CANCEL_USERDATA (1U << 4)
+#define IORING_ASYNC_CANCEL_OP (1U << 5)
+
+/*
+ * send/sendmsg and recv/recvmsg flags (sqe->ioprio)
+ *
+ * IORING_RECVSEND_POLL_FIRST If set, instead of first attempting to send
+ * or receive and arm poll if that yields an
+ * -EAGAIN result, arm poll upfront and skip
+ * the initial transfer attempt.
+ *
+ * IORING_RECV_MULTISHOT Multishot recv. Sets IORING_CQE_F_MORE if
+ * the handler will continue to report
+ * CQEs on behalf of the same SQE.
+ *
+ * IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
+ * the buf_index field.
+ *
+ * IORING_SEND_ZC_REPORT_USAGE
+ * If set, SEND[MSG]_ZC should report
+ * the zerocopy usage in cqe.res
+ * for the IORING_CQE_F_NOTIF cqe.
+ * 0 is reported if zerocopy was actually possible.
+ * IORING_NOTIF_USAGE_ZC_COPIED if data was copied
+ * (at least partially).
+ */
+#define IORING_RECVSEND_POLL_FIRST (1U << 0)
+#define IORING_RECV_MULTISHOT (1U << 1)
+#define IORING_RECVSEND_FIXED_BUF (1U << 2)
+#define IORING_SEND_ZC_REPORT_USAGE (1U << 3)
+
+/*
+ * cqe.res for IORING_CQE_F_NOTIF if
+ * IORING_SEND_ZC_REPORT_USAGE was requested
+ *
+ * It should be treated as a flag, all other
+ * bits of cqe.res should be treated as reserved!
+ */
+#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
+
+/*
+ * accept flags stored in sqe->ioprio
+ */
+#define IORING_ACCEPT_MULTISHOT (1U << 0)
+
+/*
+ * IORING_OP_MSG_RING command types, stored in sqe->addr
+ */
+enum {
+ IORING_MSG_DATA, /* pass sqe->len as 'res' and off as user_data */
+ IORING_MSG_SEND_FD, /* send a registered fd to another ring */
+};
+
+/*
+ * IORING_OP_MSG_RING flags (sqe->msg_ring_flags)
+ *
+ * IORING_MSG_RING_CQE_SKIP Don't post a CQE to the target ring. Not
+ * applicable for IORING_MSG_DATA, obviously.
+ */
+#define IORING_MSG_RING_CQE_SKIP (1U << 0)
+/* Pass through the flags from sqe->file_index to cqe->flags */
+#define IORING_MSG_RING_FLAGS_PASS (1U << 1)
+
+/*
+ * IO completion data structure (Completion Queue Entry)
+ */
+struct io_uring_cqe {
+ __u64 user_data; /* sqe->data submission passed back */
+ __s32 res; /* result code for this event */
+ __u32 flags;
+
+ /*
+ * If the ring is initialized with IORING_SETUP_CQE32, then this field
+ * contains 16-bytes of padding, doubling the size of the CQE.
+ */
+ __u64 big_cqe[];
+};
+
+/*
+ * cqe->flags
+ *
+ * IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
+ * IORING_CQE_F_MORE If set, parent SQE will generate more CQE entries
+ * IORING_CQE_F_SOCK_NONEMPTY If set, more data to read after socket recv
+ * IORING_CQE_F_NOTIF Set for notification CQEs. Can be used to distinct
+ * them from sends.
+ */
+#define IORING_CQE_F_BUFFER (1U << 0)
+#define IORING_CQE_F_MORE (1U << 1)
+#define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
+#define IORING_CQE_F_NOTIF (1U << 3)
+
+enum {
+ IORING_CQE_BUFFER_SHIFT = 16,
+};
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING 0ULL
+#define IORING_OFF_CQ_RING 0x8000000ULL
+#define IORING_OFF_SQES 0x10000000ULL
+#define IORING_OFF_PBUF_RING 0x80000000ULL
+#define IORING_OFF_PBUF_SHIFT 16
+#define IORING_OFF_MMAP_MASK 0xf8000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+ __u32 head;
+ __u32 tail;
+ __u32 ring_mask;
+ __u32 ring_entries;
+ __u32 flags;
+ __u32 dropped;
+ __u32 array;
+ __u32 resv1;
+ __u64 user_addr;
+};
+
+/*
+ * sq_ring->flags
+ */
+#define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */
+#define IORING_SQ_CQ_OVERFLOW (1U << 1) /* CQ ring is overflown */
+#define IORING_SQ_TASKRUN (1U << 2) /* task should enter the kernel */
+
+struct io_cqring_offsets {
+ __u32 head;
+ __u32 tail;
+ __u32 ring_mask;
+ __u32 ring_entries;
+ __u32 overflow;
+ __u32 cqes;
+ __u32 flags;
+ __u32 resv1;
+ __u64 user_addr;
+};
+
+/*
+ * cq_ring->flags
+ */
+
+/* disable eventfd notifications */
+#define IORING_CQ_EVENTFD_DISABLED (1U << 0)
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS (1U << 0)
+#define IORING_ENTER_SQ_WAKEUP (1U << 1)
+#define IORING_ENTER_SQ_WAIT (1U << 2)
+#define IORING_ENTER_EXT_ARG (1U << 3)
+#define IORING_ENTER_REGISTERED_RING (1U << 4)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+ __u32 sq_entries;
+ __u32 cq_entries;
+ __u32 flags;
+ __u32 sq_thread_cpu;
+ __u32 sq_thread_idle;
+ __u32 features;
+ __u32 wq_fd;
+ __u32 resv[3];
+ struct io_sqring_offsets sq_off;
+ struct io_cqring_offsets cq_off;
+};
+
+/*
+ * io_uring_params->features flags
+ */
+#define IORING_FEAT_SINGLE_MMAP (1U << 0)
+#define IORING_FEAT_NODROP (1U << 1)
+#define IORING_FEAT_SUBMIT_STABLE (1U << 2)
+#define IORING_FEAT_RW_CUR_POS (1U << 3)
+#define IORING_FEAT_CUR_PERSONALITY (1U << 4)
+#define IORING_FEAT_FAST_POLL (1U << 5)
+#define IORING_FEAT_POLL_32BITS (1U << 6)
+#define IORING_FEAT_SQPOLL_NONFIXED (1U << 7)
+#define IORING_FEAT_EXT_ARG (1U << 8)
+#define IORING_FEAT_NATIVE_WORKERS (1U << 9)
+#define IORING_FEAT_RSRC_TAGS (1U << 10)
+#define IORING_FEAT_CQE_SKIP (1U << 11)
+#define IORING_FEAT_LINKED_FILE (1U << 12)
+#define IORING_FEAT_REG_REG_RING (1U << 13)
+
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+enum {
+ IORING_REGISTER_BUFFERS = 0,
+ IORING_UNREGISTER_BUFFERS = 1,
+ IORING_REGISTER_FILES = 2,
+ IORING_UNREGISTER_FILES = 3,
+ IORING_REGISTER_EVENTFD = 4,
+ IORING_UNREGISTER_EVENTFD = 5,
+ IORING_REGISTER_FILES_UPDATE = 6,
+ IORING_REGISTER_EVENTFD_ASYNC = 7,
+ IORING_REGISTER_PROBE = 8,
+ IORING_REGISTER_PERSONALITY = 9,
+ IORING_UNREGISTER_PERSONALITY = 10,
+ IORING_REGISTER_RESTRICTIONS = 11,
+ IORING_REGISTER_ENABLE_RINGS = 12,
+
+ /* extended with tagging */
+ IORING_REGISTER_FILES2 = 13,
+ IORING_REGISTER_FILES_UPDATE2 = 14,
+ IORING_REGISTER_BUFFERS2 = 15,
+ IORING_REGISTER_BUFFERS_UPDATE = 16,
+
+ /* set/clear io-wq thread affinities */
+ IORING_REGISTER_IOWQ_AFF = 17,
+ IORING_UNREGISTER_IOWQ_AFF = 18,
+
+ /* set/get max number of io-wq workers */
+ IORING_REGISTER_IOWQ_MAX_WORKERS = 19,
+
+ /* register/unregister io_uring fd with the ring */
+ IORING_REGISTER_RING_FDS = 20,
+ IORING_UNREGISTER_RING_FDS = 21,
+
+ /* register ring based provide buffer group */
+ IORING_REGISTER_PBUF_RING = 22,
+ IORING_UNREGISTER_PBUF_RING = 23,
+
+ /* sync cancelation API */
+ IORING_REGISTER_SYNC_CANCEL = 24,
+
+ /* register a range of fixed file slots for automatic slot allocation */
+ IORING_REGISTER_FILE_ALLOC_RANGE = 25,
+
+ /* this goes last */
+ IORING_REGISTER_LAST,
+
+ /* flag added to the opcode to use a registered ring fd */
+ IORING_REGISTER_USE_REGISTERED_RING = 1U << 31
+};
+
+/* io-wq worker categories */
+enum {
+ IO_WQ_BOUND,
+ IO_WQ_UNBOUND,
+};
+
+/* deprecated, see struct io_uring_rsrc_update */
+struct io_uring_files_update {
+ __u32 offset;
+ __u32 resv;
+ __aligned_u64 /* __s32 * */ fds;
+};
+
+/*
+ * Register a fully sparse file space, rather than pass in an array of all
+ * -1 file descriptors.
+ */
+#define IORING_RSRC_REGISTER_SPARSE (1U << 0)
+
+struct io_uring_rsrc_register {
+ __u32 nr;
+ __u32 flags;
+ __u64 resv2;
+ __aligned_u64 data;
+ __aligned_u64 tags;
+};
+
+struct io_uring_rsrc_update {
+ __u32 offset;
+ __u32 resv;
+ __aligned_u64 data;
+};
+
+struct io_uring_rsrc_update2 {
+ __u32 offset;
+ __u32 resv;
+ __aligned_u64 data;
+ __aligned_u64 tags;
+ __u32 nr;
+ __u32 resv2;
+};
+
+/* Skip updating fd indexes set to this value in the fd table */
+#define IORING_REGISTER_FILES_SKIP (-2)
+
+#define IO_URING_OP_SUPPORTED (1U << 0)
+
+struct io_uring_probe_op {
+ __u8 op;
+ __u8 resv;
+ __u16 flags; /* IO_URING_OP_* flags */
+ __u32 resv2;
+};
+
+struct io_uring_probe {
+ __u8 last_op; /* last opcode supported */
+ __u8 ops_len; /* length of ops[] array below */
+ __u16 resv;
+ __u32 resv2[3];
+ struct io_uring_probe_op ops[];
+};
+
+struct io_uring_restriction {
+ __u16 opcode;
+ union {
+ __u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
+ __u8 sqe_op; /* IORING_RESTRICTION_SQE_OP */
+ __u8 sqe_flags; /* IORING_RESTRICTION_SQE_FLAGS_* */
+ };
+ __u8 resv;
+ __u32 resv2[3];
+};
+
+struct io_uring_buf {
+ __u64 addr;
+ __u32 len;
+ __u16 bid;
+ __u16 resv;
+};
+
+struct io_uring_buf_ring {
+ union {
+ /*
+ * To avoid spilling into more pages than we need to, the
+ * ring tail is overlaid with the io_uring_buf->resv field.
+ */
+ struct {
+ __u64 resv1;
+ __u32 resv2;
+ __u16 resv3;
+ __u16 tail;
+ };
+ __DECLARE_FLEX_ARRAY(struct io_uring_buf, bufs);
+ };
+};
+
+/*
+ * Flags for IORING_REGISTER_PBUF_RING.
+ *
+ * IOU_PBUF_RING_MMAP: If set, kernel will allocate the memory for the ring.
+ * The application must not set a ring_addr in struct
+ * io_uring_buf_reg, instead it must subsequently call
+ * mmap(2) with the offset set as:
+ * IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)
+ * to get a virtual mapping for the ring.
+ */
+enum {
+ IOU_PBUF_RING_MMAP = 1,
+};
+
+/* argument for IORING_(UN)REGISTER_PBUF_RING */
+struct io_uring_buf_reg {
+ __u64 ring_addr;
+ __u32 ring_entries;
+ __u16 bgid;
+ __u16 flags;
+ __u64 resv[3];
+};
+
+/*
+ * io_uring_restriction->opcode values
+ */
+enum {
+ /* Allow an io_uring_register(2) opcode */
+ IORING_RESTRICTION_REGISTER_OP = 0,
+
+ /* Allow an sqe opcode */
+ IORING_RESTRICTION_SQE_OP = 1,
+
+ /* Allow sqe flags */
+ IORING_RESTRICTION_SQE_FLAGS_ALLOWED = 2,
+
+ /* Require sqe flags (these flags must be set on each submission) */
+ IORING_RESTRICTION_SQE_FLAGS_REQUIRED = 3,
+
+ IORING_RESTRICTION_LAST
+};
+
+struct io_uring_getevents_arg {
+ __u64 sigmask;
+ __u32 sigmask_sz;
+ __u32 pad;
+ __u64 ts;
+};
+
+/*
+ * Argument for IORING_REGISTER_SYNC_CANCEL
+ */
+struct io_uring_sync_cancel_reg {
+ __u64 addr;
+ __s32 fd;
+ __u32 flags;
+ struct __kernel_timespec timeout;
+ __u8 opcode;
+ __u8 pad[7];
+ __u64 pad2[3];
+};
+
+/*
+ * Argument for IORING_REGISTER_FILE_ALLOC_RANGE
+ * The range is specified as [off, off + len)
+ */
+struct io_uring_file_index_range {
+ __u32 off;
+ __u32 len;
+ __u64 resv;
+};
+
+struct io_uring_recvmsg_out {
+ __u32 namelen;
+ __u32 controllen;
+ __u32 payloadlen;
+ __u32 flags;
+};
+
+/*
+ * Argument for IORING_OP_URING_CMD when file is a socket
+ */
+enum {
+ SOCKET_URING_OP_SIOCINQ = 0,
+ SOCKET_URING_OP_SIOCOUTQ,
+ SOCKET_URING_OP_GETSOCKOPT,
+ SOCKET_URING_OP_SETSOCKOPT,
+};
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--
2.34.1

2023-10-16 14:03:04

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 07/11] selftests/net: Extract uring helpers to be reusable

Instead of defining basic io_uring functions in the test case, move them
to a common directory, so, other tests can use them.

This simplify the test code and reuse the common liburing
infrastructure. This is basically a copy of what we have in
io_uring_zerocopy_tx with some minor improvements to make checkpatch
happy.

A follow-up test will use the same helpers in a BPF sockopt test.

Signed-off-by: Breno Leitao <[email protected]>
---
tools/include/io_uring/mini_liburing.h | 282 ++++++++++++++++++
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 268 +----------------
3 files changed, 285 insertions(+), 266 deletions(-)
create mode 100644 tools/include/io_uring/mini_liburing.h

diff --git a/tools/include/io_uring/mini_liburing.h b/tools/include/io_uring/mini_liburing.h
new file mode 100644
index 000000000000..9ccb16074eb5
--- /dev/null
+++ b/tools/include/io_uring/mini_liburing.h
@@ -0,0 +1,282 @@
+/* SPDX-License-Identifier: MIT */
+
+#include <linux/io_uring.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+struct io_sq_ring {
+ unsigned int *head;
+ unsigned int *tail;
+ unsigned int *ring_mask;
+ unsigned int *ring_entries;
+ unsigned int *flags;
+ unsigned int *array;
+};
+
+struct io_cq_ring {
+ unsigned int *head;
+ unsigned int *tail;
+ unsigned int *ring_mask;
+ unsigned int *ring_entries;
+ struct io_uring_cqe *cqes;
+};
+
+struct io_uring_sq {
+ unsigned int *khead;
+ unsigned int *ktail;
+ unsigned int *kring_mask;
+ unsigned int *kring_entries;
+ unsigned int *kflags;
+ unsigned int *kdropped;
+ unsigned int *array;
+ struct io_uring_sqe *sqes;
+
+ unsigned int sqe_head;
+ unsigned int sqe_tail;
+
+ size_t ring_sz;
+};
+
+struct io_uring_cq {
+ unsigned int *khead;
+ unsigned int *ktail;
+ unsigned int *kring_mask;
+ unsigned int *kring_entries;
+ unsigned int *koverflow;
+ struct io_uring_cqe *cqes;
+
+ size_t ring_sz;
+};
+
+struct io_uring {
+ struct io_uring_sq sq;
+ struct io_uring_cq cq;
+ int ring_fd;
+};
+
+#if defined(__x86_64) || defined(__i386__)
+#define read_barrier() __asm__ __volatile__("":::"memory")
+#define write_barrier() __asm__ __volatile__("":::"memory")
+#else
+#define read_barrier() __sync_synchronize()
+#define write_barrier() __sync_synchronize()
+#endif
+
+static inline int io_uring_mmap(int fd, struct io_uring_params *p,
+ struct io_uring_sq *sq, struct io_uring_cq *cq)
+{
+ size_t size;
+ void *ptr;
+ int ret;
+
+ sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned int);
+ ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);
+ if (ptr == MAP_FAILED)
+ return -errno;
+ sq->khead = ptr + p->sq_off.head;
+ sq->ktail = ptr + p->sq_off.tail;
+ sq->kring_mask = ptr + p->sq_off.ring_mask;
+ sq->kring_entries = ptr + p->sq_off.ring_entries;
+ sq->kflags = ptr + p->sq_off.flags;
+ sq->kdropped = ptr + p->sq_off.dropped;
+ sq->array = ptr + p->sq_off.array;
+
+ size = p->sq_entries * sizeof(struct io_uring_sqe);
+ sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);
+ if (sq->sqes == MAP_FAILED) {
+ ret = -errno;
+err:
+ munmap(sq->khead, sq->ring_sz);
+ return ret;
+ }
+
+ cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
+ ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING);
+ if (ptr == MAP_FAILED) {
+ ret = -errno;
+ munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe));
+ goto err;
+ }
+ cq->khead = ptr + p->cq_off.head;
+ cq->ktail = ptr + p->cq_off.tail;
+ cq->kring_mask = ptr + p->cq_off.ring_mask;
+ cq->kring_entries = ptr + p->cq_off.ring_entries;
+ cq->koverflow = ptr + p->cq_off.overflow;
+ cq->cqes = ptr + p->cq_off.cqes;
+ return 0;
+}
+
+static inline int io_uring_setup(unsigned int entries,
+ struct io_uring_params *p)
+{
+ return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static inline int io_uring_enter(int fd, unsigned int to_submit,
+ unsigned int min_complete,
+ unsigned int flags, sigset_t *sig)
+{
+ return syscall(__NR_io_uring_enter, fd, to_submit, min_complete,
+ flags, sig, _NSIG / 8);
+}
+
+static inline int io_uring_queue_init(unsigned int entries,
+ struct io_uring *ring,
+ unsigned int flags)
+{
+ struct io_uring_params p;
+ int fd, ret;
+
+ memset(ring, 0, sizeof(*ring));
+ memset(&p, 0, sizeof(p));
+ p.flags = flags;
+
+ fd = io_uring_setup(entries, &p);
+ if (fd < 0)
+ return fd;
+ ret = io_uring_mmap(fd, &p, &ring->sq, &ring->cq);
+ if (!ret)
+ ring->ring_fd = fd;
+ else
+ close(fd);
+ return ret;
+}
+
+/* Get a sqe */
+static inline struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+
+ if (sq->sqe_tail + 1 - sq->sqe_head > *sq->kring_entries)
+ return NULL;
+ return &sq->sqes[sq->sqe_tail++ & *sq->kring_mask];
+}
+
+static inline int io_uring_wait_cqe(struct io_uring *ring,
+ struct io_uring_cqe **cqe_ptr)
+{
+ struct io_uring_cq *cq = &ring->cq;
+ const unsigned int mask = *cq->kring_mask;
+ unsigned int head = *cq->khead;
+ int ret;
+
+ *cqe_ptr = NULL;
+ do {
+ read_barrier();
+ if (head != *cq->ktail) {
+ *cqe_ptr = &cq->cqes[head & mask];
+ break;
+ }
+ ret = io_uring_enter(ring->ring_fd, 0, 1,
+ IORING_ENTER_GETEVENTS, NULL);
+ if (ret < 0)
+ return -errno;
+ } while (1);
+
+ return 0;
+}
+
+static inline int io_uring_submit(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+ const unsigned int mask = *sq->kring_mask;
+ unsigned int ktail, submitted, to_submit;
+ int ret;
+
+ read_barrier();
+ if (*sq->khead != *sq->ktail) {
+ submitted = *sq->kring_entries;
+ goto submit;
+ }
+ if (sq->sqe_head == sq->sqe_tail)
+ return 0;
+
+ ktail = *sq->ktail;
+ to_submit = sq->sqe_tail - sq->sqe_head;
+ for (submitted = 0; submitted < to_submit; submitted++) {
+ read_barrier();
+ sq->array[ktail++ & mask] = sq->sqe_head++ & mask;
+ }
+ if (!submitted)
+ return 0;
+
+ if (*sq->ktail != ktail) {
+ write_barrier();
+ *sq->ktail = ktail;
+ write_barrier();
+ }
+submit:
+ ret = io_uring_enter(ring->ring_fd, submitted, 0,
+ IORING_ENTER_GETEVENTS, NULL);
+ return ret < 0 ? -errno : ret;
+}
+
+static inline void io_uring_queue_exit(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+
+ munmap(sq->sqes, *sq->kring_entries * sizeof(struct io_uring_sqe));
+ munmap(sq->khead, sq->ring_sz);
+ close(ring->ring_fd);
+}
+
+/* Prepare and send the SQE */
+static inline void io_uring_prep_cmd(struct io_uring_sqe *sqe, int op,
+ int sockfd,
+ int level, int optname,
+ const void *optval,
+ int optlen)
+{
+ memset(sqe, 0, sizeof(*sqe));
+ sqe->opcode = (__u8)IORING_OP_URING_CMD;
+ sqe->fd = sockfd;
+ sqe->cmd_op = op;
+
+ sqe->level = level;
+ sqe->optname = optname;
+ sqe->optval = (unsigned long long)optval;
+ sqe->optlen = optlen;
+}
+
+static inline int io_uring_register_buffers(struct io_uring *ring,
+ const struct iovec *iovecs,
+ unsigned int nr_iovecs)
+{
+ int ret;
+
+ ret = syscall(__NR_io_uring_register, ring->ring_fd,
+ IORING_REGISTER_BUFFERS, iovecs, nr_iovecs);
+ return (ret < 0) ? -errno : ret;
+}
+
+static inline void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd,
+ const void *buf, size_t len, int flags)
+{
+ memset(sqe, 0, sizeof(*sqe));
+ sqe->opcode = (__u8)IORING_OP_SEND;
+ sqe->fd = sockfd;
+ sqe->addr = (unsigned long)buf;
+ sqe->len = len;
+ sqe->msg_flags = (__u32)flags;
+}
+
+static inline void io_uring_prep_sendzc(struct io_uring_sqe *sqe, int sockfd,
+ const void *buf, size_t len, int flags,
+ unsigned int zc_flags)
+{
+ io_uring_prep_send(sqe, sockfd, buf, len, flags);
+ sqe->opcode = (__u8)IORING_OP_SEND_ZC;
+ sqe->ioprio = zc_flags;
+}
+
+static inline void io_uring_cqe_seen(struct io_uring *ring)
+{
+ *(&ring->cq)->khead += 1;
+ write_barrier();
+}
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 61939a695f95..4adfe2186f39 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -99,6 +99,7 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
$(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
$(OUTPUT)/tcp_inq: LDLIBS += -lpthread
$(OUTPUT)/bind_bhash: LDLIBS += -lpthread
+$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/

# Rules to generate bpf obj nat6to4.o
CLANG ?= clang
diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
index 154287740172..76e604e4810e 100644
--- a/tools/testing/selftests/net/io_uring_zerocopy_tx.c
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
@@ -36,6 +36,8 @@
#include <sys/un.h>
#include <sys/wait.h>

+#include <io_uring/mini_liburing.h>
+
#define NOTIF_TAG 0xfffffffULL
#define NONZC_TAG 0
#define ZC_TAG 1
@@ -60,272 +62,6 @@ static struct sockaddr_storage cfg_dst_addr;

static char payload[IP_MAXPACKET] __attribute__((aligned(4096)));

-struct io_sq_ring {
- unsigned *head;
- unsigned *tail;
- unsigned *ring_mask;
- unsigned *ring_entries;
- unsigned *flags;
- unsigned *array;
-};
-
-struct io_cq_ring {
- unsigned *head;
- unsigned *tail;
- unsigned *ring_mask;
- unsigned *ring_entries;
- struct io_uring_cqe *cqes;
-};
-
-struct io_uring_sq {
- unsigned *khead;
- unsigned *ktail;
- unsigned *kring_mask;
- unsigned *kring_entries;
- unsigned *kflags;
- unsigned *kdropped;
- unsigned *array;
- struct io_uring_sqe *sqes;
-
- unsigned sqe_head;
- unsigned sqe_tail;
-
- size_t ring_sz;
-};
-
-struct io_uring_cq {
- unsigned *khead;
- unsigned *ktail;
- unsigned *kring_mask;
- unsigned *kring_entries;
- unsigned *koverflow;
- struct io_uring_cqe *cqes;
-
- size_t ring_sz;
-};
-
-struct io_uring {
- struct io_uring_sq sq;
- struct io_uring_cq cq;
- int ring_fd;
-};
-
-#ifdef __alpha__
-# ifndef __NR_io_uring_setup
-# define __NR_io_uring_setup 535
-# endif
-# ifndef __NR_io_uring_enter
-# define __NR_io_uring_enter 536
-# endif
-# ifndef __NR_io_uring_register
-# define __NR_io_uring_register 537
-# endif
-#else /* !__alpha__ */
-# ifndef __NR_io_uring_setup
-# define __NR_io_uring_setup 425
-# endif
-# ifndef __NR_io_uring_enter
-# define __NR_io_uring_enter 426
-# endif
-# ifndef __NR_io_uring_register
-# define __NR_io_uring_register 427
-# endif
-#endif
-
-#if defined(__x86_64) || defined(__i386__)
-#define read_barrier() __asm__ __volatile__("":::"memory")
-#define write_barrier() __asm__ __volatile__("":::"memory")
-#else
-
-#define read_barrier() __sync_synchronize()
-#define write_barrier() __sync_synchronize()
-#endif
-
-static int io_uring_setup(unsigned int entries, struct io_uring_params *p)
-{
- return syscall(__NR_io_uring_setup, entries, p);
-}
-
-static int io_uring_enter(int fd, unsigned int to_submit,
- unsigned int min_complete,
- unsigned int flags, sigset_t *sig)
-{
- return syscall(__NR_io_uring_enter, fd, to_submit, min_complete,
- flags, sig, _NSIG / 8);
-}
-
-static int io_uring_register_buffers(struct io_uring *ring,
- const struct iovec *iovecs,
- unsigned nr_iovecs)
-{
- int ret;
-
- ret = syscall(__NR_io_uring_register, ring->ring_fd,
- IORING_REGISTER_BUFFERS, iovecs, nr_iovecs);
- return (ret < 0) ? -errno : ret;
-}
-
-static int io_uring_mmap(int fd, struct io_uring_params *p,
- struct io_uring_sq *sq, struct io_uring_cq *cq)
-{
- size_t size;
- void *ptr;
- int ret;
-
- sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned);
- ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE,
- MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);
- if (ptr == MAP_FAILED)
- return -errno;
- sq->khead = ptr + p->sq_off.head;
- sq->ktail = ptr + p->sq_off.tail;
- sq->kring_mask = ptr + p->sq_off.ring_mask;
- sq->kring_entries = ptr + p->sq_off.ring_entries;
- sq->kflags = ptr + p->sq_off.flags;
- sq->kdropped = ptr + p->sq_off.dropped;
- sq->array = ptr + p->sq_off.array;
-
- size = p->sq_entries * sizeof(struct io_uring_sqe);
- sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE,
- MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);
- if (sq->sqes == MAP_FAILED) {
- ret = -errno;
-err:
- munmap(sq->khead, sq->ring_sz);
- return ret;
- }
-
- cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
- ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE,
- MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING);
- if (ptr == MAP_FAILED) {
- ret = -errno;
- munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe));
- goto err;
- }
- cq->khead = ptr + p->cq_off.head;
- cq->ktail = ptr + p->cq_off.tail;
- cq->kring_mask = ptr + p->cq_off.ring_mask;
- cq->kring_entries = ptr + p->cq_off.ring_entries;
- cq->koverflow = ptr + p->cq_off.overflow;
- cq->cqes = ptr + p->cq_off.cqes;
- return 0;
-}
-
-static int io_uring_queue_init(unsigned entries, struct io_uring *ring,
- unsigned flags)
-{
- struct io_uring_params p;
- int fd, ret;
-
- memset(ring, 0, sizeof(*ring));
- memset(&p, 0, sizeof(p));
- p.flags = flags;
-
- fd = io_uring_setup(entries, &p);
- if (fd < 0)
- return fd;
- ret = io_uring_mmap(fd, &p, &ring->sq, &ring->cq);
- if (!ret)
- ring->ring_fd = fd;
- else
- close(fd);
- return ret;
-}
-
-static int io_uring_submit(struct io_uring *ring)
-{
- struct io_uring_sq *sq = &ring->sq;
- const unsigned mask = *sq->kring_mask;
- unsigned ktail, submitted, to_submit;
- int ret;
-
- read_barrier();
- if (*sq->khead != *sq->ktail) {
- submitted = *sq->kring_entries;
- goto submit;
- }
- if (sq->sqe_head == sq->sqe_tail)
- return 0;
-
- ktail = *sq->ktail;
- to_submit = sq->sqe_tail - sq->sqe_head;
- for (submitted = 0; submitted < to_submit; submitted++) {
- read_barrier();
- sq->array[ktail++ & mask] = sq->sqe_head++ & mask;
- }
- if (!submitted)
- return 0;
-
- if (*sq->ktail != ktail) {
- write_barrier();
- *sq->ktail = ktail;
- write_barrier();
- }
-submit:
- ret = io_uring_enter(ring->ring_fd, submitted, 0,
- IORING_ENTER_GETEVENTS, NULL);
- return ret < 0 ? -errno : ret;
-}
-
-static inline void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd,
- const void *buf, size_t len, int flags)
-{
- memset(sqe, 0, sizeof(*sqe));
- sqe->opcode = (__u8) IORING_OP_SEND;
- sqe->fd = sockfd;
- sqe->addr = (unsigned long) buf;
- sqe->len = len;
- sqe->msg_flags = (__u32) flags;
-}
-
-static inline void io_uring_prep_sendzc(struct io_uring_sqe *sqe, int sockfd,
- const void *buf, size_t len, int flags,
- unsigned zc_flags)
-{
- io_uring_prep_send(sqe, sockfd, buf, len, flags);
- sqe->opcode = (__u8) IORING_OP_SEND_ZC;
- sqe->ioprio = zc_flags;
-}
-
-static struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
-{
- struct io_uring_sq *sq = &ring->sq;
-
- if (sq->sqe_tail + 1 - sq->sqe_head > *sq->kring_entries)
- return NULL;
- return &sq->sqes[sq->sqe_tail++ & *sq->kring_mask];
-}
-
-static int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr)
-{
- struct io_uring_cq *cq = &ring->cq;
- const unsigned mask = *cq->kring_mask;
- unsigned head = *cq->khead;
- int ret;
-
- *cqe_ptr = NULL;
- do {
- read_barrier();
- if (head != *cq->ktail) {
- *cqe_ptr = &cq->cqes[head & mask];
- break;
- }
- ret = io_uring_enter(ring->ring_fd, 0, 1,
- IORING_ENTER_GETEVENTS, NULL);
- if (ret < 0)
- return -errno;
- } while (1);
-
- return 0;
-}
-
-static inline void io_uring_cqe_seen(struct io_uring *ring)
-{
- *(&ring->cq)->khead += 1;
- write_barrier();
-}
-
static unsigned long gettimeofday_ms(void)
{
struct timeval tv;
--
2.34.1

2023-10-16 14:03:07

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 05/11] io_uring/cmd: Pass compat mode in issue_flags

Create a new flag to track if the operation is running compat mode.
This basically check the context->compat and pass it to the issue_flags,
so, it could be queried later in the callbacks.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/io_uring.h | 1 +
io_uring/uring_cmd.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index b4391e0a9bc8..aefb73eeeebf 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -23,6 +23,7 @@ enum io_uring_cmd_flags {

/* set when uring wants to cancel a previously issued command */
IO_URING_F_CANCEL = (1 << 11),
+ IO_URING_F_COMPAT = (1 << 12),
};

/* only top 8 bits of sqe->uring_cmd_flags for kernel internal use */
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 00a5e5621a28..4bedd633c08c 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -175,6 +175,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
issue_flags |= IO_URING_F_SQE128;
if (ctx->flags & IORING_SETUP_CQE32)
issue_flags |= IO_URING_F_CQE32;
+ if (ctx->compat)
+ issue_flags |= IO_URING_F_COMPAT;
if (ctx->flags & IORING_SETUP_IOPOLL) {
if (!file->f_op->uring_cmd_iopoll)
return -EOPNOTSUPP;
--
2.34.1

2023-10-16 14:03:09

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 09/11] io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT

Add support for getsockopt command (SOCKET_URING_OP_GETSOCKOPT), where
level is SOL_SOCKET. This is leveraging the sockptr_t infrastructure,
where a sockptr_t is either userspace or kernel space, and handled as
such.

Differently from the getsockopt(2), the optlen field is not a userspace
pointers. In getsockopt(2), userspace provides optlen pointer, which is
overwritten by the kernel. In this implementation, userspace passes a
u32, and the new value is returned in cqe->res. I.e., optlen is not a
pointer.

Important to say that userspace needs to keep the pointer alive until
the CQE is completed.

Signed-off-by: Breno Leitao <[email protected]>
---
include/uapi/linux/io_uring.h | 7 +++++++
io_uring/uring_cmd.c | 28 ++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 92be89a871fc..9628d4f5daba 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -43,6 +43,10 @@ struct io_uring_sqe {
union {
__u64 addr; /* pointer to buffer or iovecs */
__u64 splice_off_in;
+ struct {
+ __u32 level;
+ __u32 optname;
+ };
};
__u32 len; /* buffer size or number of iovecs */
union {
@@ -81,6 +85,7 @@ struct io_uring_sqe {
union {
__s32 splice_fd_in;
__u32 file_index;
+ __u32 optlen;
struct {
__u16 addr_len;
__u16 __pad3[1];
@@ -91,6 +96,7 @@ struct io_uring_sqe {
__u64 addr3;
__u64 __pad2[1];
};
+ __u64 optval;
/*
* If the ring is initialized with IORING_SETUP_SQE128, then
* this field is used for 80 bytes of arbitrary command data
@@ -740,6 +746,7 @@ struct io_uring_recvmsg_out {
enum {
SOCKET_URING_OP_SIOCINQ = 0,
SOCKET_URING_OP_SIOCOUTQ,
+ SOCKET_URING_OP_GETSOCKOPT,
};

#ifdef __cplusplus
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 42694c07d8fd..8b045830b0d9 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -214,6 +214,32 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
}
EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);

+static inline int io_uring_cmd_getsockopt(struct socket *sock,
+ struct io_uring_cmd *cmd,
+ unsigned int issue_flags)
+{
+ bool compat = !!(issue_flags & IO_URING_F_COMPAT);
+ int optlen, optname, level, err;
+ void __user *optval;
+
+ level = READ_ONCE(cmd->sqe->level);
+ if (level != SOL_SOCKET)
+ return -EOPNOTSUPP;
+
+ optval = u64_to_user_ptr(READ_ONCE(cmd->sqe->optval));
+ optname = READ_ONCE(cmd->sqe->optname);
+ optlen = READ_ONCE(cmd->sqe->optlen);
+
+ err = do_sock_getsockopt(sock, compat, level, optname,
+ USER_SOCKPTR(optval),
+ KERNEL_SOCKPTR(&optlen));
+ if (err)
+ return err;
+
+ /* On success, return optlen */
+ return optlen;
+}
+
#if defined(CONFIG_NET)
int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
@@ -236,6 +262,8 @@ int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
if (ret)
return ret;
return arg;
+ case SOCKET_URING_OP_GETSOCKOPT:
+ return io_uring_cmd_getsockopt(sock, cmd, issue_flags);
default:
return -EOPNOTSUPP;
}
--
2.34.1

2023-10-16 14:03:18

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 10/11] io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT

Add initial support for SOCKET_URING_OP_SETSOCKOPT. This new command is
similar to setsockopt. This implementation leverages the function
do_sock_setsockopt(), which is shared with the setsockopt() system call
path.

Important to say that userspace needs to keep the pointer's memory alive
until the operation is completed. I.e, the memory could not be
deallocated before the CQE is returned to userspace.

Signed-off-by: Breno Leitao <[email protected]>
---
include/uapi/linux/io_uring.h | 1 +
io_uring/uring_cmd.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9628d4f5daba..f1c16f817742 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -747,6 +747,7 @@ enum {
SOCKET_URING_OP_SIOCINQ = 0,
SOCKET_URING_OP_SIOCOUTQ,
SOCKET_URING_OP_GETSOCKOPT,
+ SOCKET_URING_OP_SETSOCKOPT,
};

#ifdef __cplusplus
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8b045830b0d9..acbc2924ecd2 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -240,6 +240,25 @@ static inline int io_uring_cmd_getsockopt(struct socket *sock,
return optlen;
}

+static inline int io_uring_cmd_setsockopt(struct socket *sock,
+ struct io_uring_cmd *cmd,
+ unsigned int issue_flags)
+{
+ bool compat = !!(issue_flags & IO_URING_F_COMPAT);
+ int optname, optlen, level;
+ void __user *optval;
+ sockptr_t optval_s;
+
+ optval = u64_to_user_ptr(READ_ONCE(cmd->sqe->optval));
+ optname = READ_ONCE(cmd->sqe->optname);
+ optlen = READ_ONCE(cmd->sqe->optlen);
+ level = READ_ONCE(cmd->sqe->level);
+ optval_s = USER_SOCKPTR(optval);
+
+ return do_sock_setsockopt(sock, compat, level, optname, optval_s,
+ optlen);
+}
+
#if defined(CONFIG_NET)
int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
@@ -264,6 +283,8 @@ int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
return arg;
case SOCKET_URING_OP_GETSOCKOPT:
return io_uring_cmd_getsockopt(sock, cmd, issue_flags);
+ case SOCKET_URING_OP_SETSOCKOPT:
+ return io_uring_cmd_setsockopt(sock, cmd, issue_flags);
default:
return -EOPNOTSUPP;
}
--
2.34.1

2023-10-16 14:03:22

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 11/11] selftests/bpf/sockopt: Add io_uring support

Expand the sockopt test to use also check for io_uring {g,s}etsockopt
commands operations.

This patch starts by marking each test if they support io_uring support
or not.

Right now, io_uring cmd getsockopt() has a limitation of only
accepting level == SOL_SOCKET, otherwise it returns -EOPNOTSUPP. Since
there aren't any test exercising getsockopt(level == SOL_SOCKET), this
patch changes two tests to use level == SOL_SOCKET, they are
"getsockopt: support smaller ctx->optlen" and "getsockopt: read
ctx->optlen".
There is no limitation for the setsockopt() part.

Later, each test runs using regular {g,s}etsockopt systemcalls, and, if
liburing is supported, execute the same test (again), but calling
liburing {g,s}setsockopt commands.

This patch also changes the level of two tests to use SOL_SOCKET for the
following two tests. This is going to help to exercise the io_uring
subsystem:
* getsockopt: read ctx->optlen
* getsockopt: support smaller ctx->optlen

Signed-off-by: Breno Leitao <[email protected]>
---
.../selftests/bpf/prog_tests/sockopt.c | 113 +++++++++++++++++-
1 file changed, 107 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt.c b/tools/testing/selftests/bpf/prog_tests/sockopt.c
index 9e6a5e3ed4de..5a4491d4edfe 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockopt.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockopt.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <test_progs.h>
+#include <io_uring/mini_liburing.h>
#include "cgroup_helpers.h"

static char bpf_log_buf[4096];
@@ -38,6 +39,7 @@ static struct sockopt_test {
socklen_t get_optlen_ret;

enum sockopt_test_error error;
+ bool io_uring_support;
} tests[] = {

/* ==================== getsockopt ==================== */
@@ -251,7 +253,9 @@ static struct sockopt_test {
.attach_type = BPF_CGROUP_GETSOCKOPT,
.expected_attach_type = BPF_CGROUP_GETSOCKOPT,

+ .get_level = SOL_SOCKET,
.get_optlen = 64,
+ .io_uring_support = true,
},
{
.descr = "getsockopt: deny bigger ctx->optlen",
@@ -276,6 +280,7 @@ static struct sockopt_test {
.get_optlen = 64,

.error = EFAULT_GETSOCKOPT,
+ .io_uring_support = true,
},
{
.descr = "getsockopt: ignore >PAGE_SIZE optlen",
@@ -318,6 +323,7 @@ static struct sockopt_test {
.get_optval = {}, /* the changes are ignored */
.get_optlen = PAGE_SIZE + 1,
.error = EOPNOTSUPP_GETSOCKOPT,
+ .io_uring_support = true,
},
{
.descr = "getsockopt: support smaller ctx->optlen",
@@ -337,8 +343,10 @@ static struct sockopt_test {
.attach_type = BPF_CGROUP_GETSOCKOPT,
.expected_attach_type = BPF_CGROUP_GETSOCKOPT,

+ .get_level = SOL_SOCKET,
.get_optlen = 64,
.get_optlen_ret = 32,
+ .io_uring_support = true,
},
{
.descr = "getsockopt: deny writing to ctx->optval",
@@ -518,6 +526,7 @@ static struct sockopt_test {
.set_level = 123,

.set_optlen = 1,
+ .io_uring_support = true,
},
{
.descr = "setsockopt: allow changing ctx->level",
@@ -572,6 +581,7 @@ static struct sockopt_test {
.set_optname = 123,

.set_optlen = 1,
+ .io_uring_support = true,
},
{
.descr = "setsockopt: allow changing ctx->optname",
@@ -624,6 +634,7 @@ static struct sockopt_test {
.expected_attach_type = BPF_CGROUP_SETSOCKOPT,

.set_optlen = 64,
+ .io_uring_support = true,
},
{
.descr = "setsockopt: ctx->optlen == -1 is ok",
@@ -640,6 +651,7 @@ static struct sockopt_test {
.expected_attach_type = BPF_CGROUP_SETSOCKOPT,

.set_optlen = 64,
+ .io_uring_support = true,
},
{
.descr = "setsockopt: deny ctx->optlen < 0 (except -1)",
@@ -658,6 +670,7 @@ static struct sockopt_test {
.set_optlen = 4,

.error = EFAULT_SETSOCKOPT,
+ .io_uring_support = true,
},
{
.descr = "setsockopt: deny ctx->optlen > input optlen",
@@ -675,6 +688,7 @@ static struct sockopt_test {
.set_optlen = 64,

.error = EFAULT_SETSOCKOPT,
+ .io_uring_support = true,
},
{
.descr = "setsockopt: ignore >PAGE_SIZE optlen",
@@ -940,7 +954,89 @@ static int load_prog(const struct bpf_insn *insns,
return fd;
}

-static int run_test(int cgroup_fd, struct sockopt_test *test)
+/* Core function that handles io_uring ring initialization,
+ * sending SQE with sockopt command and waiting for the CQE.
+ */
+static int uring_sockopt(int op, int fd, int level, int optname,
+ const void *optval, socklen_t optlen)
+{
+ struct io_uring_cqe *cqe;
+ struct io_uring_sqe *sqe;
+ struct io_uring ring;
+ int err;
+
+ err = io_uring_queue_init(1, &ring, 0);
+ if (!ASSERT_OK(err, "io_uring initialization"))
+ return err;
+
+ sqe = io_uring_get_sqe(&ring);
+ if (!ASSERT_NEQ(sqe, NULL, "Get an SQE")) {
+ err = -1;
+ goto fail;
+ }
+
+ io_uring_prep_cmd(sqe, op, fd, level, optname, optval, optlen);
+
+ err = io_uring_submit(&ring);
+ if (!ASSERT_EQ(err, 1, "Submit SQE"))
+ goto fail;
+
+ err = io_uring_wait_cqe(&ring, &cqe);
+ if (!ASSERT_OK(err, "Wait for CQE"))
+ goto fail;
+
+ err = cqe->res;
+
+fail:
+ io_uring_queue_exit(&ring);
+
+ return err;
+}
+
+static int uring_setsockopt(int fd, int level, int optname, const void *optval,
+ socklen_t optlen)
+{
+ return uring_sockopt(SOCKET_URING_OP_SETSOCKOPT, fd, level, optname,
+ optval, optlen);
+}
+
+static int uring_getsockopt(int fd, int level, int optname, void *optval,
+ socklen_t *optlen)
+{
+ int ret = uring_sockopt(SOCKET_URING_OP_GETSOCKOPT, fd, level, optname,
+ optval, *optlen);
+ if (ret < 0)
+ return ret;
+
+ /* Populate optlen back to be compatible with systemcall interface,
+ * and simplify the test.
+ */
+ *optlen = ret;
+
+ return 0;
+}
+
+/* Execute the setsocktopt operation */
+static int call_setsockopt(bool use_io_uring, int fd, int level, int optname,
+ const void *optval, socklen_t optlen)
+{
+ if (use_io_uring)
+ return uring_setsockopt(fd, level, optname, optval, optlen);
+
+ return setsockopt(fd, level, optname, optval, optlen);
+}
+
+/* Execute the getsocktopt operation */
+static int call_getsockopt(bool use_io_uring, int fd, int level, int optname,
+ void *optval, socklen_t *optlen)
+{
+ if (use_io_uring)
+ return uring_getsockopt(fd, level, optname, optval, optlen);
+
+ return getsockopt(fd, level, optname, optval, optlen);
+}
+
+static int run_test(int cgroup_fd, struct sockopt_test *test, bool use_io_uring)
{
int sock_fd, err, prog_fd;
void *optval = NULL;
@@ -980,8 +1076,9 @@ static int run_test(int cgroup_fd, struct sockopt_test *test)
test->set_optlen = num_pages * sysconf(_SC_PAGESIZE) + remainder;
}

- err = setsockopt(sock_fd, test->set_level, test->set_optname,
- test->set_optval, test->set_optlen);
+ err = call_setsockopt(use_io_uring, sock_fd, test->set_level,
+ test->set_optname, test->set_optval,
+ test->set_optlen);
if (err) {
if (errno == EPERM && test->error == EPERM_SETSOCKOPT)
goto close_sock_fd;
@@ -1008,8 +1105,8 @@ static int run_test(int cgroup_fd, struct sockopt_test *test)
socklen_t expected_get_optlen = test->get_optlen_ret ?:
test->get_optlen;

- err = getsockopt(sock_fd, test->get_level, test->get_optname,
- optval, &optlen);
+ err = call_getsockopt(use_io_uring, sock_fd, test->get_level,
+ test->get_optname, optval, &optlen);
if (err) {
if (errno == EOPNOTSUPP && test->error == EOPNOTSUPP_GETSOCKOPT)
goto free_optval;
@@ -1063,7 +1160,11 @@ void test_sockopt(void)
if (!test__start_subtest(tests[i].descr))
continue;

- ASSERT_OK(run_test(cgroup_fd, &tests[i]), tests[i].descr);
+ ASSERT_OK(run_test(cgroup_fd, &tests[i], false),
+ tests[i].descr);
+ if (tests[i].io_uring_support)
+ ASSERT_OK(run_test(cgroup_fd, &tests[i], true),
+ tests[i].descr);
}

close(cgroup_fd);
--
2.34.1

2023-10-16 14:03:52

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v7 08/11] io_uring/cmd: return -EOPNOTSUPP if net is disabled

Protect io_uring_cmd_sock() to be called if CONFIG_NET is not set. If
network is not enabled, but io_uring is, then we want to return
-EOPNOTSUPP for any possible socket operation.

This is helpful because io_uring_cmd_sock() can now call functions that
only exits if CONFIG_NET is enabled without having #ifdef CONFIG_NET
inside the function itself.

Signed-off-by: Breno Leitao <[email protected]>
---
io_uring/uring_cmd.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 4bedd633c08c..42694c07d8fd 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -214,6 +214,7 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
}
EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);

+#if defined(CONFIG_NET)
int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
struct socket *sock = cmd->file->private_data;
@@ -240,3 +241,4 @@ int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
}
}
EXPORT_SYMBOL_GPL(io_uring_cmd_sock);
+#endif
--
2.34.1

2023-10-17 12:35:12

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] tools headers: Grab copy of io_uring.h

Hello Gabriel,

On Mon, Oct 16, 2023 at 02:56:55PM -0400, Gabriel Krisman Bertazi wrote:
> Breno Leitao <[email protected]> writes:
>
> > This file will be used by mini_uring.h and allow tests to run without
> > the need of installing liburing to run the tests.
> >
> > This is needed to run io_uring tests in BPF, such as
> > (tools/testing/selftests/bpf/prog_tests/sockopt.c).
> >
> > Signed-off-by: Breno Leitao <[email protected]>
>
> Can't mini_uring rely on the kernel header like
> selftests/net/io_uring_zerocopy_tx.c does?

Before this patch, io_uring_zerocopy_tx was not relying on "make
headers" headers, as far as I know. I think it was not a problem because
there was no CI running the test, and whoever was running the test was
relying on local io_uring headers.

My patch is, in fact, adding the following flag, which relies on the
headers now on:

+$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/

> I ask because this will be the third copy of these
> definitions that we're gonna need to keep in sync (kernel, liburing and
> here). Given this is only used for selftests, we better avoid the
> duplication.

Right, I don't know why this was the suggested way, but, that is how
people are using it.

I can definitely get rid of the copy and do the same mechanism as
io_uring_zerocopy_tx. This is what I've tested, and it worked fine.

---

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4225f975fce3..9f79a392acc1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -383,6 +383,8 @@ BPF_CFLAGS = -g -Wall -Werror -D__TARGET_ARCH_$(SRCARCH) $(MENDIAN) \
CLANG_CFLAGS = $(CLANG_SYS_INCLUDES) \
-Wno-compare-distinct-pointer-types

+HEADER_CFLAGS = -I$(abspath $(OUTPUT)/../../../../usr/include)
+
$(OUTPUT)/test_l4lb_noinline.o: BPF_CFLAGS += -fno-inline
$(OUTPUT)/test_xdp_noinline.o: BPF_CFLAGS += -fno-inline

@@ -551,7 +553,7 @@ $(TRUNNER_TEST_OBJS): $(TRUNNER_OUTPUT)/%.test.o: \
$(TRUNNER_BPF_SKELS_LINKED) \
$$(BPFOBJ) | $(TRUNNER_OUTPUT)
$$(call msg,TEST-OBJ,$(TRUNNER_BINARY),$$@)
- $(Q)cd $$(@D) && $$(CC) -I. $$(CFLAGS) -c $(CURDIR)/$$< $$(LDLIBS) -o $$(@F)
+ $(Q)cd $$(@D) && $$(CC) -I. $$(CFLAGS) $$(HEADER_CFLAGS) -c $(CURDIR)/$$< $$(LDLIBS) -o $$(@F)

2023-10-19 14:59:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] io_uring: Initial support for {s,g}etsockopt commands

On 10/16/23 7:47 AM, Breno Leitao wrote:
> This patchset adds support for getsockopt (SOCKET_URING_OP_GETSOCKOPT)
> and setsockopt (SOCKET_URING_OP_SETSOCKOPT) in io_uring commands.
> SOCKET_URING_OP_SETSOCKOPT implements generic case, covering all levels
> and optnames. SOCKET_URING_OP_GETSOCKOPT is limited, for now, to
> SOL_SOCKET level, which seems to be the most common level parameter for
> get/setsockopt(2).
>
> In order to keep the implementation (and tests) simple, some refactors
> were done prior to the changes, as follows:

Looks like folks are mostly happy with this now, so the next question is
how to stage it?

--
Jens Axboe

2023-10-19 15:33:22

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] io_uring: Initial support for {s,g}etsockopt commands

On Thu, 19 Oct 2023 08:58:59 -0600 Jens Axboe wrote:
> On 10/16/23 7:47 AM, Breno Leitao wrote:
> > This patchset adds support for getsockopt (SOCKET_URING_OP_GETSOCKOPT)
> > and setsockopt (SOCKET_URING_OP_SETSOCKOPT) in io_uring commands.
> > SOCKET_URING_OP_SETSOCKOPT implements generic case, covering all levels
> > and optnames. SOCKET_URING_OP_GETSOCKOPT is limited, for now, to
> > SOL_SOCKET level, which seems to be the most common level parameter for
> > get/setsockopt(2).
> >
> > In order to keep the implementation (and tests) simple, some refactors
> > were done prior to the changes, as follows:
>
> Looks like folks are mostly happy with this now, so the next question is
> how to stage it?

Would be good to get acks from BPF folks but AFAICT first four patches
apply cleanly for us now. If they apply cleanly for you I reckon you
can take them directly with io-uring. It's -rc7 time, with a bit of
luck we'll get to the merge window without a conflict.

2023-10-19 15:42:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] io_uring: Initial support for {s,g}etsockopt commands


On Mon, 16 Oct 2023 06:47:38 -0700, Breno Leitao wrote:
> This patchset adds support for getsockopt (SOCKET_URING_OP_GETSOCKOPT)
> and setsockopt (SOCKET_URING_OP_SETSOCKOPT) in io_uring commands.
> SOCKET_URING_OP_SETSOCKOPT implements generic case, covering all levels
> and optnames. SOCKET_URING_OP_GETSOCKOPT is limited, for now, to
> SOL_SOCKET level, which seems to be the most common level parameter for
> get/setsockopt(2).
>
> [...]

Applied, thanks!

[01/11] bpf: Add sockptr support for getsockopt
commit: 7cb15cc7e081730df3392f136a8789f3d2c3fd66
[02/11] bpf: Add sockptr support for setsockopt
commit: c028f6e54aa180747e384796760eee3bd78e0891
[03/11] net/socket: Break down __sys_setsockopt
commit: e70464dcdcddb5128fe7956bf809683824c64de5
[04/11] net/socket: Break down __sys_getsockopt
commit: 25f82732c8352bd0bec33c5a9989fd46cac5789f
[05/11] io_uring/cmd: Pass compat mode in issue_flags
commit: 66c87d5639f2f80421b3a01f12dcb7718f996093
[06/11] tools headers: Grab copy of io_uring.h
commit: c36507ed1a2c2cb05c4a2aad9acb39ca5d7c12fe
[07/11] selftests/net: Extract uring helpers to be reusable
commit: 11336afdd4141bbbd144b118a8a559b1993dc5d2
[08/11] io_uring/cmd: return -EOPNOTSUPP if net is disabled
commit: d807234143872e460cdf851f1b2bbda2b427f95d
[09/11] io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
commit: c3199f61b896cdef3664dc12729a2beadf322783
[10/11] io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
commit: 43ad652250d24e9496f4cd6a0d670417807ac9a0
[11/11] selftests/bpf/sockopt: Add io_uring support
commit: d9710f1d12a99738ff168e252ab8e9ffdeb90ed5

Best regards,
--
Jens Axboe



2023-10-19 15:42:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] io_uring: Initial support for {s,g}etsockopt commands

On 10/19/23 9:33 AM, Jakub Kicinski wrote:
> On Thu, 19 Oct 2023 08:58:59 -0600 Jens Axboe wrote:
>> On 10/16/23 7:47 AM, Breno Leitao wrote:
>>> This patchset adds support for getsockopt (SOCKET_URING_OP_GETSOCKOPT)
>>> and setsockopt (SOCKET_URING_OP_SETSOCKOPT) in io_uring commands.
>>> SOCKET_URING_OP_SETSOCKOPT implements generic case, covering all levels
>>> and optnames. SOCKET_URING_OP_GETSOCKOPT is limited, for now, to
>>> SOL_SOCKET level, which seems to be the most common level parameter for
>>> get/setsockopt(2).
>>>
>>> In order to keep the implementation (and tests) simple, some refactors
>>> were done prior to the changes, as follows:
>>
>> Looks like folks are mostly happy with this now, so the next question is
>> how to stage it?
>
> Would be good to get acks from BPF folks but AFAICT first four patches

Agree, those are still missing. BPF folks, do patches 1-2 look OK to
you?

> apply cleanly for us now. If they apply cleanly for you I reckon you
> can take them directly with io-uring. It's -rc7 time, with a bit of
> luck we'll get to the merge window without a conflict.

I'll tentatively setup a branch for this just to see if we run into
anything on the merge front. Depending on how the BPF side goes, I can
rebase/collect reviews/whatever as we go.

--
Jens Axboe

2023-10-19 19:51:26

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [PATCH v7 11/11] selftests/bpf/sockopt: Add io_uring support

On 10/16/23 6:47 AM, Breno Leitao wrote:
> Expand the sockopt test to use also check for io_uring {g,s}etsockopt
> commands operations.
>
> This patch starts by marking each test if they support io_uring support
> or not.
>
> Right now, io_uring cmd getsockopt() has a limitation of only
> accepting level == SOL_SOCKET, otherwise it returns -EOPNOTSUPP. Since
> there aren't any test exercising getsockopt(level == SOL_SOCKET), this
> patch changes two tests to use level == SOL_SOCKET, they are
> "getsockopt: support smaller ctx->optlen" and "getsockopt: read
> ctx->optlen".
> There is no limitation for the setsockopt() part.
>
> Later, each test runs using regular {g,s}etsockopt systemcalls, and, if
> liburing is supported, execute the same test (again), but calling
> liburing {g,s}setsockopt commands.
>
> This patch also changes the level of two tests to use SOL_SOCKET for the
> following two tests. This is going to help to exercise the io_uring
> subsystem:
> * getsockopt: read ctx->optlen
> * getsockopt: support smaller ctx->optlen

Acked-by: Martin KaFai Lau <[email protected]>