2021-08-13 16:45:36

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

Add an optional feature to open/accept directly into io_uring's fixed
file table bypassing the normal file table. Same behaviour if as the
snippet below, but in one operation:

sqe = prep_[open,accept](...);
cqe = submit_and_wait(sqe);
// error handling
io_uring_register_files_update(uring_idx, (fd = cqe->res));
// optionally
close((fd = cqe->res));

The idea in pretty old, and was brough up and implemented a year ago
by Josh Triplett, though haven't sought the light for some reasons.

Tested on basic cases, will be sent out as liburing patches later.

A copy paste from 2/2 describing user API and some notes:

The behaviour is controlled by setting sqe->file_index, where 0 implies
the old behaviour. If non-zero value is specified, then it will behave
as described and place the file into a fixed file slot
sqe->file_index - 1. A file table should be already created, the slot
should be valid and empty, otherwise the operation will fail.

Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
accept takes a file, and it already uses the flag with a different
meaning.

Note 2: it's u16, where in theory the limit for fixed file tables might
get increased in the future. If would ever happen so, we'll better
workaround later, e.g. by making ioprio to represent upper bits 16 bits.
The layout for open is tight already enough.

since RFC:
- added attribution
- updated descriptions
- rebased

Pavel Begunkov (4):
net: add accept helper not installing fd
io_uring: openat directly into fixed fd table
io_uring: hand code io_accept() fd installing
io_uring: accept directly into fixed file table

fs/io_uring.c | 114 +++++++++++++++++++++++++++++-----
include/linux/socket.h | 3 +
include/uapi/linux/io_uring.h | 2 +
net/socket.c | 71 +++++++++++----------
4 files changed, 139 insertions(+), 51 deletions(-)

--
2.32.0


2021-08-13 16:45:40

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 1/4] net: add accept helper not installing fd

Introduce and reuse a helper that acts similarly to __sys_accept4_file()
but returns struct file instead of installing file descriptor. Will be
used by io_uring.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/socket.h | 3 ++
net/socket.c | 71 ++++++++++++++++++++++--------------------
2 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 0d8e3dcb7f88..d3c1a42a2edd 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -421,6 +421,9 @@ extern int __sys_accept4_file(struct file *file, unsigned file_flags,
struct sockaddr __user *upeer_sockaddr,
int __user *upeer_addrlen, int flags,
unsigned long nofile);
+extern struct file *do_accept(struct file *file, unsigned file_flags,
+ struct sockaddr __user *upeer_sockaddr,
+ int __user *upeer_addrlen, int flags);
extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
int __user *upeer_addrlen, int flags);
extern int __sys_socket(int family, int type, int protocol);
diff --git a/net/socket.c b/net/socket.c
index 0b2dad3bdf7f..532fff5a3684 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1722,32 +1722,22 @@ SYSCALL_DEFINE2(listen, int, fd, int, backlog)
return __sys_listen(fd, backlog);
}

-int __sys_accept4_file(struct file *file, unsigned file_flags,
+struct file *do_accept(struct file *file, unsigned file_flags,
struct sockaddr __user *upeer_sockaddr,
- int __user *upeer_addrlen, int flags,
- unsigned long nofile)
+ int __user *upeer_addrlen, int flags)
{
struct socket *sock, *newsock;
struct file *newfile;
- int err, len, newfd;
+ int err, len;
struct sockaddr_storage address;

- if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
- return -EINVAL;
-
- if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
- flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
-
sock = sock_from_file(file);
- if (!sock) {
- err = -ENOTSOCK;
- goto out;
- }
+ if (!sock)
+ return ERR_PTR(-ENOTSOCK);

- err = -ENFILE;
newsock = sock_alloc();
if (!newsock)
- goto out;
+ return ERR_PTR(-ENFILE);

newsock->type = sock->type;
newsock->ops = sock->ops;
@@ -1758,18 +1748,9 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
*/
__module_get(newsock->ops->owner);

- newfd = __get_unused_fd_flags(flags, nofile);
- if (unlikely(newfd < 0)) {
- err = newfd;
- sock_release(newsock);
- goto out;
- }
newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name);
- if (IS_ERR(newfile)) {
- err = PTR_ERR(newfile);
- put_unused_fd(newfd);
- goto out;
- }
+ if (IS_ERR(newfile))
+ return newfile;

err = security_socket_accept(sock, newsock);
if (err)
@@ -1794,16 +1775,38 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
}

/* File flags are not inherited via accept() unlike another OSes. */
-
- fd_install(newfd, newfile);
- err = newfd;
-out:
- return err;
+ return newfile;
out_fd:
fput(newfile);
- put_unused_fd(newfd);
- goto out;
+ return ERR_PTR(err);
+}
+
+int __sys_accept4_file(struct file *file, unsigned file_flags,
+ struct sockaddr __user *upeer_sockaddr,
+ int __user *upeer_addrlen, int flags,
+ unsigned long nofile)
+{
+ struct file *newfile;
+ int newfd;

+ if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+ return -EINVAL;
+
+ if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
+ flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
+
+ newfd = __get_unused_fd_flags(flags, nofile);
+ if (unlikely(newfd < 0))
+ return newfd;
+
+ newfile = do_accept(file, file_flags, upeer_sockaddr, upeer_addrlen,
+ flags);
+ if (IS_ERR(newfile)) {
+ put_unused_fd(newfd);
+ return PTR_ERR(newfile);
+ }
+ fd_install(newfd, newfile);
+ return newfd;
}

/*
--
2.32.0

2021-08-13 16:46:33

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 3/4] io_uring: hand code io_accept() fd installing

Make io_accept() to handle file descriptor allocations and installation.
A preparation patch for bypassing file tables.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b4f7de5147dc..f92adfbc9a6b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4743,6 +4743,11 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
accept->flags = READ_ONCE(sqe->accept_flags);
accept->nofile = rlimit(RLIMIT_NOFILE);
+
+ if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+ return -EINVAL;
+ if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK))
+ accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
return 0;
}

@@ -4751,20 +4756,28 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
struct io_accept *accept = &req->accept;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0;
- int ret;
+ struct file *file;
+ int ret, fd;

if (req->file->f_flags & O_NONBLOCK)
req->flags |= REQ_F_NOWAIT;

- ret = __sys_accept4_file(req->file, file_flags, accept->addr,
- accept->addr_len, accept->flags,
- accept->nofile);
- if (ret == -EAGAIN && force_nonblock)
- return -EAGAIN;
- if (ret < 0) {
+ fd = __get_unused_fd_flags(accept->flags, accept->nofile);
+ if (unlikely(fd < 0))
+ return fd;
+
+ file = do_accept(req->file, file_flags, accept->addr, accept->addr_len,
+ accept->flags);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ if (ret == -EAGAIN && force_nonblock)
+ return -EAGAIN;
if (ret == -ERESTARTSYS)
ret = -EINTR;
req_set_fail(req);
+ } else {
+ fd_install(fd, file);
+ ret = fd;
}
__io_req_complete(req, issue_flags, ret, 0);
return 0;
--
2.32.0

2021-08-13 16:49:13

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 4/4] io_uring: accept directly into fixed file table

As done with open opcodes, allow accept to skip installing fd into
processes' file tables and put it directly into io_uring's fixed file
table. Same restrictions and design as for open.

Suggested-by: Josh Triplett <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f92adfbc9a6b..0e6189864d12 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4736,14 +4736,17 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)

if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
return -EINVAL;
- if (sqe->ioprio || sqe->len || sqe->buf_index)
+ if (sqe->ioprio || sqe->len)
return -EINVAL;

accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
accept->flags = READ_ONCE(sqe->accept_flags);
accept->nofile = rlimit(RLIMIT_NOFILE);
+ req->buf_index = READ_ONCE(sqe->file_index);

+ if (req->buf_index && (accept->flags & SOCK_CLOEXEC))
+ return -EINVAL;
if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
return -EINVAL;
if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK))
@@ -4756,28 +4759,34 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
struct io_accept *accept = &req->accept;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0;
+ bool fixed = !!req->buf_index;
struct file *file;
int ret, fd;

if (req->file->f_flags & O_NONBLOCK)
req->flags |= REQ_F_NOWAIT;

- fd = __get_unused_fd_flags(accept->flags, accept->nofile);
- if (unlikely(fd < 0))
- return fd;
-
+ if (!fixed) {
+ fd = __get_unused_fd_flags(accept->flags, accept->nofile);
+ if (unlikely(fd < 0))
+ return fd;
+ }
file = do_accept(req->file, file_flags, accept->addr, accept->addr_len,
accept->flags);
if (IS_ERR(file)) {
+ if (!fixed)
+ put_unused_fd(fd);
ret = PTR_ERR(file);
if (ret == -EAGAIN && force_nonblock)
return -EAGAIN;
if (ret == -ERESTARTSYS)
ret = -EINTR;
req_set_fail(req);
- } else {
+ } else if (!fixed) {
fd_install(fd, file);
ret = fd;
+ } else {
+ ret = io_install_fixed_file(req, file, issue_flags);
}
__io_req_complete(req, issue_flags, ret, 0);
return 0;
--
2.32.0

2021-08-13 16:49:13

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 2/4] io_uring: openat directly into fixed fd table

Instead of opening a file into a process's file table as usual and then
registering the fd within io_uring, some users may want to skip the
first step and place it directly into io_uring's fixed file table.
This patch adds such a capability for IORING_OP_OPENAT and
IORING_OP_OPENAT2.

The behaviour is controlled by setting sqe->file_index, where 0 implies
the old behaviour. If non-zero value is specified, then it will behave
as described and place the file into a fixed file slot
sqe->file_index - 1. A file table should be already created, the slot
should be valid and empty, otherwise the operation will fail.

Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
accept takes a file, and it already uses the flag with a different
meaning.

Note 2: it's u16, where in theory the limit for fixed file tables might
get increased in the future. If would ever happen so, we'll better
workaround later, e.g. by making ioprio to represent upper bits 16 bits.
The layout for open is tight already enough.

Suggested-by: Josh Triplett <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 76 ++++++++++++++++++++++++++++++-----
include/uapi/linux/io_uring.h | 2 +
2 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 51c4166f68b5..b4f7de5147dc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1060,6 +1060,9 @@ static void io_req_task_queue(struct io_kiocb *req);
static void io_submit_flush_completions(struct io_ring_ctx *ctx);
static int io_req_prep_async(struct io_kiocb *req);

+static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
+ unsigned int issue_flags);
+
static struct kmem_cache *req_cachep;

static const struct file_operations io_uring_fops;
@@ -3803,11 +3806,10 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe

if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
return -EINVAL;
- if (unlikely(sqe->ioprio || sqe->buf_index))
+ if (unlikely(sqe->ioprio))
return -EINVAL;
if (unlikely(req->flags & REQ_F_FIXED_FILE))
return -EBADF;
-
/* open.how should be already initialised */
if (!(req->open.how.flags & O_PATH) && force_o_largefile())
req->open.how.flags |= O_LARGEFILE;
@@ -3820,6 +3822,10 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
req->open.filename = NULL;
return ret;
}
+ req->buf_index = READ_ONCE(sqe->file_index);
+ if (req->buf_index && (req->open.how.flags & O_CLOEXEC))
+ return -EINVAL;
+
req->open.nofile = rlimit(RLIMIT_NOFILE);
req->flags |= REQ_F_NEED_CLEANUP;
return 0;
@@ -3857,8 +3863,8 @@ static int io_openat2(struct io_kiocb *req, unsigned int issue_flags)
{
struct open_flags op;
struct file *file;
- bool nonblock_set;
- bool resolve_nonblock;
+ bool resolve_nonblock, nonblock_set;
+ bool fixed = !!req->buf_index;
int ret;

ret = build_open_flags(&req->open.how, &op);
@@ -3877,9 +3883,11 @@ static int io_openat2(struct io_kiocb *req, unsigned int issue_flags)
op.open_flag |= O_NONBLOCK;
}

- ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
- if (ret < 0)
- goto err;
+ if (!fixed) {
+ ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
+ if (ret < 0)
+ goto err;
+ }

file = do_filp_open(req->open.dfd, req->open.filename, &op);
if (IS_ERR(file)) {
@@ -3888,7 +3896,8 @@ static int io_openat2(struct io_kiocb *req, unsigned int issue_flags)
* marginal gain for something that is now known to be a slower
* path. So just put it, and we'll get a new one when we retry.
*/
- put_unused_fd(ret);
+ if (!fixed)
+ put_unused_fd(ret);

ret = PTR_ERR(file);
/* only retry if RESOLVE_CACHED wasn't already set by application */
@@ -3901,7 +3910,11 @@ static int io_openat2(struct io_kiocb *req, unsigned int issue_flags)
if ((issue_flags & IO_URING_F_NONBLOCK) && !nonblock_set)
file->f_flags &= ~O_NONBLOCK;
fsnotify_open(file);
- fd_install(ret, file);
+
+ if (!fixed)
+ fd_install(ret, file);
+ else
+ ret = io_install_fixed_file(req, file, issue_flags);
err:
putname(req->open.filename);
req->flags &= ~REQ_F_NEED_CLEANUP;
@@ -7835,6 +7848,50 @@ static int io_sqe_file_register(struct io_ring_ctx *ctx, struct file *file,
#endif
}

+static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
+ unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ int i = req->buf_index - 1;
+ bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+ struct io_fixed_file *file_slot;
+ int ret = -EBADF;
+
+ if (WARN_ON_ONCE(req->buf_index == 0))
+ goto err;
+
+ io_ring_submit_lock(ctx, !force_nonblock);
+ if (file->f_op == &io_uring_fops)
+ goto err;
+ ret = -ENXIO;
+ if (!ctx->file_data)
+ goto err;
+ ret = -EINVAL;
+ if (i > ctx->nr_user_files)
+ goto err;
+
+ i = array_index_nospec(i, ctx->nr_user_files);
+ file_slot = io_fixed_file_slot(&ctx->file_table, i);
+ ret = -EEXIST;
+ if (file_slot->file_ptr)
+ goto err;
+
+ *io_get_tag_slot(ctx->file_data, i) = 0;
+ io_fixed_file_set(file_slot, file);
+ ret = io_sqe_file_register(ctx, file, i);
+ if (ret) {
+ file_slot->file_ptr = 0;
+ goto err;
+ }
+
+ ret = 0;
+err:
+ io_ring_submit_unlock(ctx, !force_nonblock);
+ if (ret)
+ fput(file);
+ return ret;
+}
+
static int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
struct io_rsrc_node *node, void *rsrc)
{
@@ -10298,6 +10355,7 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(32, __u64, user_data);
BUILD_BUG_SQE_ELEM(40, __u16, buf_index);
BUILD_BUG_SQE_ELEM(40, __u16, buf_group);
+ BUILD_BUG_SQE_ELEM(40, __u16, file_index);
BUILD_BUG_SQE_ELEM(42, __u16, personality);
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 79126d5cd289..f105deb4da1d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -52,6 +52,8 @@ struct io_uring_sqe {
__u16 buf_index;
/* for grouped buffer selection */
__u16 buf_group;
+ /* index into fixed files */
+ __u16 file_index;
} __attribute__((packed));
/* personality to use, if used */
__u16 personality;
--
2.32.0

2021-08-13 19:03:41

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
> Add an optional feature to open/accept directly into io_uring's fixed
> file table bypassing the normal file table. Same behaviour if as the
> snippet below, but in one operation:
>
> sqe = prep_[open,accept](...);
> cqe = submit_and_wait(sqe);
> // error handling
> io_uring_register_files_update(uring_idx, (fd = cqe->res));
> // optionally
> close((fd = cqe->res));
>
> The idea in pretty old, and was brough up and implemented a year ago
> by Josh Triplett, though haven't sought the light for some reasons.

Thank you for working to get this over the finish line!

> Tested on basic cases, will be sent out as liburing patches later.
>
> A copy paste from 2/2 describing user API and some notes:
>
> The behaviour is controlled by setting sqe->file_index, where 0 implies
> the old behaviour. If non-zero value is specified, then it will behave
> as described and place the file into a fixed file slot
> sqe->file_index - 1. A file table should be already created, the slot
> should be valid and empty, otherwise the operation will fail.
>
> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
> accept takes a file, and it already uses the flag with a different
> meaning.
>
> Note 2: it's u16, where in theory the limit for fixed file tables might
> get increased in the future. If would ever happen so, we'll better
> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
> The layout for open is tight already enough.

Rather than using sqe->file_index - 1, which feels like an error-prone
interface, I think it makes sense to use a dedicated flag for this, like
IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
including open, accept, and in the future many other operations such as
memfd_create. (Imagine using a single ring submission to open a memfd,
write a buffer into it, seal it, send it over a UNIX socket, and then
close it.)

The only downside is that you'll need to reject that flag in all
non-open operations. One way to unify that code might be to add a flag
in io_op_def for open-like operations, and then check in common code for
the case of non-open-like operations passing IOSQE_OPEN_FIXED.

Also, rather than using a 16-bit index for the fixed file table and
potentially requiring expansion into a different field in the future,
what about overlapping it with the nofile field in the open and accept
requests? If they're not opening a normal file descriptor, they don't
need nofile. And in the original sqe, you can then overlap it with a
32-bit field like splice_fd_in.

EEXIST seems like the wrong error-code to use if the index is already in
use; open can already return EEXIST if you pass O_EXCL. How about EBADF,
or better yet EBADSLT which is unlikely to be returned for any other
reason?

- Josh Triplett

2021-08-14 12:53:18

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/13/21 8:00 PM, Josh Triplett wrote:
> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
>> Add an optional feature to open/accept directly into io_uring's fixed
>> file table bypassing the normal file table. Same behaviour if as the
>> snippet below, but in one operation:
>>
>> sqe = prep_[open,accept](...);
>> cqe = submit_and_wait(sqe);
>> // error handling
>> io_uring_register_files_update(uring_idx, (fd = cqe->res));
>> // optionally
>> close((fd = cqe->res));
>>
>> The idea in pretty old, and was brough up and implemented a year ago
>> by Josh Triplett, though haven't sought the light for some reasons.
>
> Thank you for working to get this over the finish line!
>
>> Tested on basic cases, will be sent out as liburing patches later.
>>
>> A copy paste from 2/2 describing user API and some notes:
>>
>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>> the old behaviour. If non-zero value is specified, then it will behave
>> as described and place the file into a fixed file slot
>> sqe->file_index - 1. A file table should be already created, the slot
>> should be valid and empty, otherwise the operation will fail.
>>
>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>> accept takes a file, and it already uses the flag with a different
>> meaning.
>>
>> Note 2: it's u16, where in theory the limit for fixed file tables might
>> get increased in the future. If would ever happen so, we'll better
>> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
>> The layout for open is tight already enough.
>
> Rather than using sqe->file_index - 1, which feels like an error-prone
> interface, I think it makes sense to use a dedicated flag for this, like
> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
> including open, accept, and in the future many other operations such as
> memfd_create. (Imagine using a single ring submission to open a memfd,
> write a buffer into it, seal it, send it over a UNIX socket, and then
> close it.)
>
> The only downside is that you'll need to reject that flag in all
> non-open operations. One way to unify that code might be to add a flag
> in io_op_def for open-like operations, and then check in common code for
> the case of non-open-like operations passing IOSQE_OPEN_FIXED.

io_uring is really thin, and so I absolutely don't want any extra
overhead in the generic path, IOW anything affecting
reads/writes/sends/recvs.

The other reason is that there are only 2 bits left in sqe->flags,
and we may use them for something better, considering that it's
only open/accept and not much as this.

I agree that it feels error-prone, but at least it can be wrapped
nicely enough in liburing, e.g.

void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
const char *path, int flags,
mode_t mode, int slot_idx);


> Also, rather than using a 16-bit index for the fixed file table and
> potentially requiring expansion into a different field in the future,
> what about overlapping it with the nofile field in the open and accept
> requests? If they're not opening a normal file descriptor, they don't
> need nofile. And in the original sqe, you can then overlap it with a
> 32-bit field like splice_fd_in.

There is no nofile in SQEs, though

req->open.nofile = rlimit(RLIMIT_NOFILE);

> EEXIST seems like the wrong error-code to use if the index is already in
> use; open can already return EEXIST if you pass O_EXCL. How about EBADF,
> or better yet EBADSLT which is unlikely to be returned for any other
> reason?

Sure, sounds better indeed!

--
Pavel Begunkov

2021-08-14 23:09:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/14/21 6:50 AM, Pavel Begunkov wrote:
> On 8/13/21 8:00 PM, Josh Triplett wrote:
>> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
>>> Add an optional feature to open/accept directly into io_uring's fixed
>>> file table bypassing the normal file table. Same behaviour if as the
>>> snippet below, but in one operation:
>>>
>>> sqe = prep_[open,accept](...);
>>> cqe = submit_and_wait(sqe);
>>> // error handling
>>> io_uring_register_files_update(uring_idx, (fd = cqe->res));
>>> // optionally
>>> close((fd = cqe->res));
>>>
>>> The idea in pretty old, and was brough up and implemented a year ago
>>> by Josh Triplett, though haven't sought the light for some reasons.
>>
>> Thank you for working to get this over the finish line!
>>
>>> Tested on basic cases, will be sent out as liburing patches later.
>>>
>>> A copy paste from 2/2 describing user API and some notes:
>>>
>>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>>> the old behaviour. If non-zero value is specified, then it will behave
>>> as described and place the file into a fixed file slot
>>> sqe->file_index - 1. A file table should be already created, the slot
>>> should be valid and empty, otherwise the operation will fail.
>>>
>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>>> accept takes a file, and it already uses the flag with a different
>>> meaning.
>>>
>>> Note 2: it's u16, where in theory the limit for fixed file tables might
>>> get increased in the future. If would ever happen so, we'll better
>>> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
>>> The layout for open is tight already enough.
>>
>> Rather than using sqe->file_index - 1, which feels like an error-prone
>> interface, I think it makes sense to use a dedicated flag for this, like
>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>> including open, accept, and in the future many other operations such as
>> memfd_create. (Imagine using a single ring submission to open a memfd,
>> write a buffer into it, seal it, send it over a UNIX socket, and then
>> close it.)
>>
>> The only downside is that you'll need to reject that flag in all
>> non-open operations. One way to unify that code might be to add a flag
>> in io_op_def for open-like operations, and then check in common code for
>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>
> io_uring is really thin, and so I absolutely don't want any extra
> overhead in the generic path, IOW anything affecting
> reads/writes/sends/recvs.
>
> The other reason is that there are only 2 bits left in sqe->flags,
> and we may use them for something better, considering that it's
> only open/accept and not much as this.
>
> I agree that it feels error-prone, but at least it can be wrapped
> nicely enough in liburing, e.g.
>
> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
> const char *path, int flags,
> mode_t mode, int slot_idx);
>
>
>> Also, rather than using a 16-bit index for the fixed file table and
>> potentially requiring expansion into a different field in the future,
>> what about overlapping it with the nofile field in the open and accept
>> requests? If they're not opening a normal file descriptor, they don't
>> need nofile. And in the original sqe, you can then overlap it with a
>> 32-bit field like splice_fd_in.
>
> There is no nofile in SQEs, though
>
> req->open.nofile = rlimit(RLIMIT_NOFILE);

What's the plan in terms of limiting the amount of direct descriptors
(for lack of a better word)? That seems like an important aspect that
should get sorted out upfront.

Do we include the regular file table max_fds count for creating a new
direct descriptor, and limit to RLIMIT_NOFILE? That would seem logical,
but then that also implies that the regular file table should include
the ctx (potentially several) direct descriptors. And the latter is much
worse.

Maybe we have a way to size the direct table, which will consume entries
from the same pool that the regular file table does? That would then
work both ways, and could potentially just be done dynamically similarly
to how we expand the regular file table when we exceed its current size.

Anyway, just throwing a few ideas out there, with the intent to spark a
bit of discussion on this topic. I really like the direct descriptors,
it'll be a lot more efficient for certain use cases.

--
Jens Axboe

2021-08-15 03:34:20

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
> On 8/13/21 8:00 PM, Josh Triplett wrote:
> > Rather than using sqe->file_index - 1, which feels like an error-prone
> > interface, I think it makes sense to use a dedicated flag for this, like
> > IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
> > including open, accept, and in the future many other operations such as
> > memfd_create. (Imagine using a single ring submission to open a memfd,
> > write a buffer into it, seal it, send it over a UNIX socket, and then
> > close it.)
> >
> > The only downside is that you'll need to reject that flag in all
> > non-open operations. One way to unify that code might be to add a flag
> > in io_op_def for open-like operations, and then check in common code for
> > the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>
> io_uring is really thin, and so I absolutely don't want any extra
> overhead in the generic path, IOW anything affecting
> reads/writes/sends/recvs.

There are already several checks for valid flags in io_init_req. For
instance:
if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
!io_op_defs[req->opcode].buffer_select)
return -EOPNOTSUPP;
It'd be trivial to make io_op_defs have a "valid flags" byte, and one
bitwise op tells you if any invalid flags were passed. *Zero* additional
overhead for other operations.

Alternatively, since there are so few operations that open a file
descriptor, you could just add a separate opcode for those few
operations. That still seems preferable to overloading a 16-bit index
field for this.

With this new mechanism, I think we're going to want to support more
than 65535 fixed-file entries. I can easily imagine wanting to handle
hundreds of thousands of files or sockets this way.

> The other reason is that there are only 2 bits left in sqe->flags,
> and we may use them for something better, considering that it's
> only open/accept and not much as this.

pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
memfd_secret.

Of those, I personally would *love* to have at least pipe, socket,
pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
dup-like operation for moving things between the fixed-file table and
file descriptors.

I think this is valuable and versatile enough to merit a flag. It would
also be entirely reasonable to create separate operations for these. But
either way, I don't think this should just be determined by whether a
16-bit index is non-zero.

> I agree that it feels error-prone, but at least it can be wrapped
> nicely enough in liburing, e.g.
>
> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
> const char *path, int flags,
> mode_t mode, int slot_idx);

That wrapper wouldn't be able to handle more than a 16-bit slot index
though.

> > Also, rather than using a 16-bit index for the fixed file table and
> > potentially requiring expansion into a different field in the future,
> > what about overlapping it with the nofile field in the open and accept
> > requests? If they're not opening a normal file descriptor, they don't
> > need nofile. And in the original sqe, you can then overlap it with a
> > 32-bit field like splice_fd_in.
>
> There is no nofile in SQEs, though
>
> req->open.nofile = rlimit(RLIMIT_NOFILE);

nofile isn't needed for opening into the fixed-file table, so it could
be omitted in that case, and another field unioned with it. That would
allow passing a 32-bit fixed-file index into open and accept without
growing the size of their structures. I think, with this new capability,
we're going to want a large number of fixed files available.

In the SQE, you could overlap it with the splice_fd_in field, which
isn't needed by any calls other than splice.

- Josh Triplett

2021-08-15 03:44:02

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote:
> What's the plan in terms of limiting the amount of direct descriptors
> (for lack of a better word)? That seems like an important aspect that
> should get sorted out upfront.
[...]
> Maybe we have a way to size the direct table, which will consume entries
> from the same pool that the regular file table does? That would then
> work both ways, and could potentially just be done dynamically similarly
> to how we expand the regular file table when we exceed its current size.

I think we'll want a way to size the direct table regardless, so that
it's pre-allocated and doesn't need to be resized when an index is used.
Then, we could do one of two equally easy things, depending on what
policy we want to set:

- Deduct the full size of the fixed-file table from the allowed number
of files the process can have open. So, if RLIMIT_NOFILE is 1048576,
and you pre-allocate 1000000 entries in the fixed-file table, you can
have no more than 48576 file descriptors open. Stricter, but
potentially problematic: a program *might* expect that it can
dup2(some_fd, nofile - 1) successfully.

- Use RLIMIT_NOFILE as the maximum size of the fixed-file table. There's
precedent for this: we already use RLIMIT_NOFILE as the maximum number
of file descriptors you can have in flight over UNIX sockets.

I personally would favor the latter; it seems simple and
straightforward.

2021-08-15 10:51:30

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/15/21 4:31 AM, Josh Triplett wrote:
> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>> interface, I think it makes sense to use a dedicated flag for this, like
>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>> including open, accept, and in the future many other operations such as
>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>> close it.)
>>>
>>> The only downside is that you'll need to reject that flag in all
>>> non-open operations. One way to unify that code might be to add a flag
>>> in io_op_def for open-like operations, and then check in common code for
>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>
>> io_uring is really thin, and so I absolutely don't want any extra
>> overhead in the generic path, IOW anything affecting
>> reads/writes/sends/recvs.
>
> There are already several checks for valid flags in io_init_req. For
> instance:

Yes, it's horrible and I don't want to make it any worse.

> if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
> !io_op_defs[req->opcode].buffer_select)
> return -EOPNOTSUPP;
> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
> bitwise op tells you if any invalid flags were passed. *Zero* additional
> overhead for other operations.

Good point

> Alternatively, since there are so few operations that open a file
> descriptor, you could just add a separate opcode for those few
> operations. That still seems preferable to overloading a 16-bit index
> field for this.

I don't think so

> With this new mechanism, I think we're going to want to support more
> than 65535 fixed-file entries. I can easily imagine wanting to handle
> hundreds of thousands of files or sockets this way.

May be. What I'm curious about is that the feature doesn't really
change anything in this regard, but seems I haven't heard people
asking for larger tables.

>> The other reason is that there are only 2 bits left in sqe->flags,
>> and we may use them for something better, considering that it's
>> only open/accept and not much as this.
>
> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
> memfd_secret.

We could argue for many of those whether they should be in io_uring,
and whether there are many benefits having them async and so. It would
have another story if all the ecosystem was io_uring centric, but
that's speculations.

> Of those, I personally would *love* to have at least pipe, socket,
> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
> dup-like operation for moving things between the fixed-file table and
> file descriptors.
>
> I think this is valuable and versatile enough to merit a flag. It would
> also be entirely reasonable to create separate operations for these. But
> either way, I don't think this should just be determined by whether a
> 16-bit index is non-zero.
>
>> I agree that it feels error-prone, but at least it can be wrapped
>> nicely enough in liburing, e.g.
>>
>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>> const char *path, int flags,
>> mode_t mode, int slot_idx);
>
> That wrapper wouldn't be able to handle more than a 16-bit slot index
> though.

It would. Note, the index is "int" there, so if doesn't fit
into u16, we can fail it. And do conversion if required.

>>> Also, rather than using a 16-bit index for the fixed file table and
>>> potentially requiring expansion into a different field in the future,
>>> what about overlapping it with the nofile field in the open and accept
>>> requests? If they're not opening a normal file descriptor, they don't
>>> need nofile. And in the original sqe, you can then overlap it with a
>>> 32-bit field like splice_fd_in.
>>
>> There is no nofile in SQEs, though
>>
>> req->open.nofile = rlimit(RLIMIT_NOFILE);
>
> nofile isn't needed for opening into the fixed-file table, so it could
> be omitted in that case, and another field unioned with it.

There is no problem to place it internally. Moreover, it's at the
moment uniformly placed inside io_kiocb, but with nofile we'd need
to find the place on per-op basis.

Not like any matters, it's just bike shedding.

> allow passing a 32-bit fixed-file index into open and accept without
> growing the size of their structures. I think, with this new capability,
> we're going to want a large number of fixed files available.
>
> In the SQE, you could overlap it with the splice_fd_in field, which
> isn't needed by any calls other than splice.

But it doesn't mean it won't be used, as happened with pretty every
other field in SQE. So, it rather depends on what packing is wanted.
And reusing almost never used ->buf_index (and potentially ->ioprio),
sounds reasonable.

--
Pavel Begunkov

2021-08-15 13:03:34

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/15/21 12:03 AM, Jens Axboe wrote:
> On 8/14/21 6:50 AM, Pavel Begunkov wrote:
>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>> On Fri, Aug 13, 2021 at 05:43:09PM +0100, Pavel Begunkov wrote:
>>>> Add an optional feature to open/accept directly into io_uring's fixed
>>>> file table bypassing the normal file table. Same behaviour if as the
>>>> snippet below, but in one operation:
>>>>
>>>> sqe = prep_[open,accept](...);
>>>> cqe = submit_and_wait(sqe);
>>>> // error handling
>>>> io_uring_register_files_update(uring_idx, (fd = cqe->res));
>>>> // optionally
>>>> close((fd = cqe->res));
>>>>
>>>> The idea in pretty old, and was brough up and implemented a year ago
>>>> by Josh Triplett, though haven't sought the light for some reasons.
>>>
>>> Thank you for working to get this over the finish line!
>>>
>>>> Tested on basic cases, will be sent out as liburing patches later.
>>>>
>>>> A copy paste from 2/2 describing user API and some notes:
>>>>
>>>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>>>> the old behaviour. If non-zero value is specified, then it will behave
>>>> as described and place the file into a fixed file slot
>>>> sqe->file_index - 1. A file table should be already created, the slot
>>>> should be valid and empty, otherwise the operation will fail.
>>>>
>>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>>>> accept takes a file, and it already uses the flag with a different
>>>> meaning.
>>>>
>>>> Note 2: it's u16, where in theory the limit for fixed file tables might
>>>> get increased in the future. If would ever happen so, we'll better
>>>> workaround later, e.g. by making ioprio to represent upper bits 16 bits.
>>>> The layout for open is tight already enough.
>>>
>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>> interface, I think it makes sense to use a dedicated flag for this, like
>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>> including open, accept, and in the future many other operations such as
>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>> close it.)
>>>
>>> The only downside is that you'll need to reject that flag in all
>>> non-open operations. One way to unify that code might be to add a flag
>>> in io_op_def for open-like operations, and then check in common code for
>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>
>> io_uring is really thin, and so I absolutely don't want any extra
>> overhead in the generic path, IOW anything affecting
>> reads/writes/sends/recvs.
>>
>> The other reason is that there are only 2 bits left in sqe->flags,
>> and we may use them for something better, considering that it's
>> only open/accept and not much as this.
>>
>> I agree that it feels error-prone, but at least it can be wrapped
>> nicely enough in liburing, e.g.
>>
>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>> const char *path, int flags,
>> mode_t mode, int slot_idx);
>>
>>
>>> Also, rather than using a 16-bit index for the fixed file table and
>>> potentially requiring expansion into a different field in the future,
>>> what about overlapping it with the nofile field in the open and accept
>>> requests? If they're not opening a normal file descriptor, they don't
>>> need nofile. And in the original sqe, you can then overlap it with a
>>> 32-bit field like splice_fd_in.
>>
>> There is no nofile in SQEs, though
>>
>> req->open.nofile = rlimit(RLIMIT_NOFILE);
>
> What's the plan in terms of limiting the amount of direct descriptors
> (for lack of a better word)? That seems like an important aspect that
> should get sorted out upfront.

As was brought before, agree that it have to be solved. However, don't
think it holds this feature, as the same problems can be perfectly
achieved without it.

fd = open();
io_uring_register(fd);
close(fd);

> Do we include the regular file table max_fds count for creating a new
> direct descriptor, and limit to RLIMIT_NOFILE? That would seem logical,
> but then that also implies that the regular file table should include
> the ctx (potentially several) direct descriptors. And the latter is much
> worse.

To which object we're binding the counting? To the task that created
the ring? I'd be afraid of the following case then:

fork(NO_FDTABLE_SHARE, callback -> {
ring = create_io_uring();
io_uring_register_fds(&ring);
pass_ring_to_parent(ring);
// e.g. via socket or so.
exit();
});

Restricting based on user may have been a better option, but as well
not without problems.

Another option, which is too ugly to exist but have to mention,
is to count number of tasks and io_urings together. Maybe can spark
some better idea.

Also, do we have anything related in cgroups/namespaces?

> Maybe we have a way to size the direct table, which will consume entries
> from the same pool that the regular file table does? That would then
> work both ways, and could potentially just be done dynamically similarly
> to how we expand the regular file table when we exceed its current size.
>
> Anyway, just throwing a few ideas out there, with the intent to spark a
> bit of discussion on this topic. I really like the direct descriptors,
> it'll be a lot more efficient for certain use cases.
>

--
Pavel Begunkov

2021-08-15 14:27:21

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/15/21 11:48 AM, Pavel Begunkov wrote:
> On 8/15/21 4:31 AM, Josh Triplett wrote:
>> On Sat, Aug 14, 2021 at 01:50:24PM +0100, Pavel Begunkov wrote:
>>> On 8/13/21 8:00 PM, Josh Triplett wrote:
>>>> Rather than using sqe->file_index - 1, which feels like an error-prone
>>>> interface, I think it makes sense to use a dedicated flag for this, like
>>>> IOSQE_OPEN_FIXED. That flag could work for any open-like operation,
>>>> including open, accept, and in the future many other operations such as
>>>> memfd_create. (Imagine using a single ring submission to open a memfd,
>>>> write a buffer into it, seal it, send it over a UNIX socket, and then
>>>> close it.)
>>>>
>>>> The only downside is that you'll need to reject that flag in all
>>>> non-open operations. One way to unify that code might be to add a flag
>>>> in io_op_def for open-like operations, and then check in common code for
>>>> the case of non-open-like operations passing IOSQE_OPEN_FIXED.
>>>
>>> io_uring is really thin, and so I absolutely don't want any extra
>>> overhead in the generic path, IOW anything affecting
>>> reads/writes/sends/recvs.
>>
>> There are already several checks for valid flags in io_init_req. For
>> instance:
>
> Yes, it's horrible and I don't want to make it any worse.
>
>> if ((sqe_flags & IOSQE_BUFFER_SELECT) &&
>> !io_op_defs[req->opcode].buffer_select)
>> return -EOPNOTSUPP;
>> It'd be trivial to make io_op_defs have a "valid flags" byte, and one
>> bitwise op tells you if any invalid flags were passed. *Zero* additional
>> overhead for other operations.
>
> Good point
>
>> Alternatively, since there are so few operations that open a file
>> descriptor, you could just add a separate opcode for those few
>> operations. That still seems preferable to overloading a 16-bit index
>> field for this.
>
> I don't think so
>
>> With this new mechanism, I think we're going to want to support more
>> than 65535 fixed-file entries. I can easily imagine wanting to handle
>> hundreds of thousands of files or sockets this way.
>
> May be. What I'm curious about is that the feature doesn't really
> change anything in this regard, but seems I haven't heard people
> asking for larger tables.
>
>>> The other reason is that there are only 2 bits left in sqe->flags,
>>> and we may use them for something better, considering that it's
>>> only open/accept and not much as this.
>>
>> pipe, dup3, socket, socketpair, pidfds (via either pidfd_open or a
>> ring-based spawn mechanism), epoll_create, inotify, fanotify, signalfd,
>> timerfd, eventfd, memfd_create, userfaultfd, open_tree, fsopen, fsmount,
>> memfd_secret.
>
> We could argue for many of those whether they should be in io_uring,
> and whether there are many benefits having them async and so. It would
> have another story if all the ecosystem was io_uring centric, but
> that's speculations.
>
>> Of those, I personally would *love* to have at least pipe, socket,
>> pidfd, memfd_create, and fsopen/fsmount/open_tree, plus some manner of
>> dup-like operation for moving things between the fixed-file table and
>> file descriptors.
>>
>> I think this is valuable and versatile enough to merit a flag. It would
>> also be entirely reasonable to create separate operations for these. But
>> either way, I don't think this should just be determined by whether a
>> 16-bit index is non-zero.
>>
>>> I agree that it feels error-prone, but at least it can be wrapped
>>> nicely enough in liburing, e.g.
>>>
>>> void io_uring_prep_openat_direct(struct io_uring_sqe *sqe, int dfd,
>>> const char *path, int flags,
>>> mode_t mode, int slot_idx);
>>
>> That wrapper wouldn't be able to handle more than a 16-bit slot index
>> though.
>
> It would. Note, the index is "int" there, so if doesn't fit
> into u16, we can fail it. And do conversion if required.
>
>>>> Also, rather than using a 16-bit index for the fixed file table and
>>>> potentially requiring expansion into a different field in the future,
>>>> what about overlapping it with the nofile field in the open and accept
>>>> requests? If they're not opening a normal file descriptor, they don't
>>>> need nofile. And in the original sqe, you can then overlap it with a
>>>> 32-bit field like splice_fd_in.
>>>
>>> There is no nofile in SQEs, though
>>>
>>> req->open.nofile = rlimit(RLIMIT_NOFILE);
>>
>> nofile isn't needed for opening into the fixed-file table, so it could
>> be omitted in that case, and another field unioned with it.
>
> There is no problem to place it internally. Moreover, it's at the
> moment uniformly placed inside io_kiocb, but with nofile we'd need
> to find the place on per-op basis.
>
> Not like any matters, it's just bike shedding.
>
>> allow passing a 32-bit fixed-file index into open and accept without
>> growing the size of their structures. I think, with this new capability,
>> we're going to want a large number of fixed files available.
>>
>> In the SQE, you could overlap it with the splice_fd_in field, which
>> isn't needed by any calls other than splice.
>
> But it doesn't mean it won't be used, as happened with pretty every
> other field in SQE. So, it rather depends on what packing is wanted.
> And reusing almost never used ->buf_index (and potentially ->ioprio),
> sounds reasonable.

Aliasing with ->splice_fd_in looks better indeed (apart from it
inherently not being checked, but meh?), But I still don't think
it's a good option to use sqe->flags, and so still needs some way
to switch between modes.

Can be sqe->rw_flags as once was done with SPLICE_F_FD_IN_FIXED,
but it's IMHO an ugly hackish way. I still lean to the
0 vs >0 encoding .

--
Pavel Begunkov

2021-08-15 15:08:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/14/21 9:42 PM, Josh Triplett wrote:
> On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote:
>> What's the plan in terms of limiting the amount of direct descriptors
>> (for lack of a better word)? That seems like an important aspect that
>> should get sorted out upfront.
> [...]
>> Maybe we have a way to size the direct table, which will consume entries
>> from the same pool that the regular file table does? That would then
>> work both ways, and could potentially just be done dynamically similarly
>> to how we expand the regular file table when we exceed its current size.
>
> I think we'll want a way to size the direct table regardless, so that
> it's pre-allocated and doesn't need to be resized when an index is used.

But how do you size it then? I can see this being used into the hundreds
of thousands of fds easily, and right now the table is just an array
(though split into segments, avoiding huge allocs).

> Then, we could do one of two equally easy things, depending on what
> policy we want to set:
>
> - Deduct the full size of the fixed-file table from the allowed number
> of files the process can have open. So, if RLIMIT_NOFILE is 1048576,
> and you pre-allocate 1000000 entries in the fixed-file table, you can
> have no more than 48576 file descriptors open. Stricter, but
> potentially problematic: a program *might* expect that it can
> dup2(some_fd, nofile - 1) successfully.
>
> - Use RLIMIT_NOFILE as the maximum size of the fixed-file table. There's
> precedent for this: we already use RLIMIT_NOFILE as the maximum number
> of file descriptors you can have in flight over UNIX sockets.
>
> I personally would favor the latter; it seems simple and
> straightforward.

I strongly prefer the latter too, and hopefully that's palatable since
the default limits are quite low anyway. And, as you say, it already is
done for inflight fds as well.

--
Jens Axboe

2021-08-15 15:14:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/15/21 9:05 AM, Jens Axboe wrote:
> On 8/14/21 9:42 PM, Josh Triplett wrote:
>> On Sat, Aug 14, 2021 at 05:03:44PM -0600, Jens Axboe wrote:
>>> What's the plan in terms of limiting the amount of direct descriptors
>>> (for lack of a better word)? That seems like an important aspect that
>>> should get sorted out upfront.
>> [...]
>>> Maybe we have a way to size the direct table, which will consume entries
>>> from the same pool that the regular file table does? That would then
>>> work both ways, and could potentially just be done dynamically similarly
>>> to how we expand the regular file table when we exceed its current size.
>>
>> I think we'll want a way to size the direct table regardless, so that
>> it's pre-allocated and doesn't need to be resized when an index is used.
>
> But how do you size it then? I can see this being used into the hundreds
> of thousands of fds easily, and right now the table is just an array
> (though split into segments, avoiding huge allocs).

I guess that will just naturally follow by registering the empty set of
a given size initially. So should not actually be a problem...

--
Jens Axboe

2021-08-16 15:46:42

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

Hi Pavel,

> The behaviour is controlled by setting sqe->file_index, where 0 implies
> the old behaviour. If non-zero value is specified, then it will behave
> as described and place the file into a fixed file slot
> sqe->file_index - 1. A file table should be already created, the slot
> should be valid and empty, otherwise the operation will fail.
>
> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
> accept takes a file, and it already uses the flag with a different
> meaning.

Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx?
(And mkdirat, linkat, symlinkat when they arrive)
renameat and linkat might be trickier as they take two dirfds, but it
would make the feature more complete and useful.

metze

2021-08-17 09:35:27

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/16/21 4:45 PM, Stefan Metzmacher wrote:
> Hi Pavel,
>
>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>> the old behaviour. If non-zero value is specified, then it will behave
>> as described and place the file into a fixed file slot
>> sqe->file_index - 1. A file table should be already created, the slot
>> should be valid and empty, otherwise the operation will fail.
>>
>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>> accept takes a file, and it already uses the flag with a different
>> meaning.
>
> Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx?
> (And mkdirat, linkat, symlinkat when they arrive)
> renameat and linkat might be trickier as they take two dirfds, but it
> would make the feature more complete and useful.

Good idea. There is nothing blocking on the io_uring side, but
the fs part may get ugly, e.g. too intrusive. We definitely need
to take a look

--
Pavel Begunkov

2021-08-17 15:00:27

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] open/accept directly into io_uring fixed file table

On 8/17/21 3:33 AM, Pavel Begunkov wrote:
> On 8/16/21 4:45 PM, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>>> The behaviour is controlled by setting sqe->file_index, where 0 implies
>>> the old behaviour. If non-zero value is specified, then it will behave
>>> as described and place the file into a fixed file slot
>>> sqe->file_index - 1. A file table should be already created, the slot
>>> should be valid and empty, otherwise the operation will fail.
>>>
>>> Note 1: we can't use IOSQE_FIXED_FILE to switch between modes, because
>>> accept takes a file, and it already uses the flag with a different
>>> meaning.
>>
>> Would it be hard to support IOSQE_FIXED_FILE for the dirfd of openat*, renameat, unlinkat, statx?
>> (And mkdirat, linkat, symlinkat when they arrive)
>> renameat and linkat might be trickier as they take two dirfds, but it
>> would make the feature more complete and useful.
>
> Good idea. There is nothing blocking on the io_uring side, but
> the fs part may get ugly, e.g. too intrusive. We definitely need
> to take a look

Indeed, the io_uring side is trivial, but the VFS interface would
require a lot of man handling... That's why I didn't add support for
fixed files originally.

--
Jens Axboe