These patches add support for IORING_OP_GETDENTS, which is a new io_uring
opcode that more or less does an lseek(sqe->fd, sqe->off, SEEK_SET)
followed by a getdents64(sqe->fd, (void *)sqe->addr, sqe->len).
A dumb test program for IORING_OP_GETDENTS is available here:
https://krautbox.wantstofly.org/~buytenh/uringfind-v2.c
This test program does something along the lines of what find(1) does:
it scans recursively through a directory tree and prints the names of
all directories and files it encounters along the way -- but then using
io_uring. (The io_uring version prints the names of encountered files and
directories in an order that's determined by SQE completion order, which
is somewhat nondeterministic and likely to differ between runs.)
On a directory tree with 14-odd million files in it that's on a
six-drive (spinning disk) btrfs raid, find(1) takes:
# echo 3 > /proc/sys/vm/drop_caches
# time find /mnt/repo > /dev/null
real 24m7.815s
user 0m15.015s
sys 0m48.340s
#
And the io_uring version takes:
# echo 3 > /proc/sys/vm/drop_caches
# time ./uringfind /mnt/repo > /dev/null
real 10m29.064s
user 0m4.347s
sys 0m1.677s
#
The fully cached case also shows some speedup. find(1):
# time find /mnt/repo > /dev/null
real 0m5.223s
user 0m1.926s
sys 0m3.268s
#
Versus the io_uring version:
# time ./uringfind /mnt/repo > /dev/null
real 0m3.604s
user 0m2.417s
sys 0m0.793s
#
That said, the point of this patch isn't primarily to enable
lightning-fast find(1) or du(1), but more to complete the set of
filesystem I/O primitives available via io_uring, so that applications
can do all of their filesystem I/O using the same mechanism, without
having to manually punt some of their work out to worker threads -- and
indeed, an object storage backend server that I wrote a while ago can
run with a pure io_uring based event loop with this patch.
Changes since v2 RFC:
- Rebase onto io_uring-2021-02-17 plus a manually applied version of
the mkdirat patch. The latter is needed because userland (liburing)
has already merged the opcode for IORING_OP_MKDIRAT (in commit
"io_uring.h: 5.12 pending kernel sync") while this opcode isn't in
the kernel yet (as of io_uring-2021-02-17), and this means that this
can't be merged until IORING_OP_MKDIRAT is merged.
- Adapt to changes made in "io_uring: replace force_nonblock with flags"
that are in io_uring-2021-02-17.
Changes since v1 RFC:
- Drop the trailing '64' from IORING_OP_GETDENTS64 (suggested by
Matthew Wilcox).
- Instead of requiring that sqe->off be zero, use this field to pass
in a directory offset to start reading from. For the first
IORING_OP_GETDENTS call on a directory, this can be set to zero,
and for subsequent calls, it can be set to the ->d_off field of
the last struct linux_dirent64 returned by the previous call.
Lennert Buytenhek (2):
readdir: split the core of getdents64(2) out into vfs_getdents()
io_uring: add support for IORING_OP_GETDENTS
fs/io_uring.c | 73 ++++++++++++++++++++++++++++++++++++++++++
fs/readdir.c | 25 +++++++++-----
include/linux/fs.h | 4 ++
include/uapi/linux/io_uring.h | 1
4 files changed, 95 insertions(+), 8 deletions(-)
IORING_OP_GETDENTS behaves much like getdents64(2) and takes the same
arguments, but with a small twist: it takes an additional offset
argument, and reading from the specified directory starts at the given
offset.
For the first IORING_OP_GETDENTS call on a directory, the offset
parameter can be set to zero, and for subsequent calls, it can be
set to the ->d_off field of the last struct linux_dirent64 returned
by the previous IORING_OP_GETDENTS call.
Internally, if necessary, IORING_OP_GETDENTS will vfs_llseek() to
the right directory position before calling vfs_getdents().
IORING_OP_GETDENTS may or may not update the specified directory's
file offset, and the file offset should not be relied upon having
any particular value during or after an IORING_OP_GETDENTS call.
Signed-off-by: Lennert Buytenhek <[email protected]>
---
fs/io_uring.c | 73 +++++++++++++++++++++++++++++++++++
include/uapi/linux/io_uring.h | 1 +
2 files changed, 74 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 056bd4c90ade..6853bf48369a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -635,6 +635,13 @@ struct io_mkdir {
struct filename *filename;
};
+struct io_getdents {
+ struct file *file;
+ struct linux_dirent64 __user *dirent;
+ unsigned int count;
+ loff_t pos;
+};
+
struct io_completion {
struct file *file;
struct list_head list;
@@ -772,6 +779,7 @@ struct io_kiocb {
struct io_rename rename;
struct io_unlink unlink;
struct io_mkdir mkdir;
+ struct io_getdents getdents;
/* use only after cleaning per-op data, see io_clean_op() */
struct io_completion compl;
};
@@ -1030,6 +1038,11 @@ static const struct io_op_def io_op_defs[] = {
.work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
},
+ [IORING_OP_GETDENTS] = {
+ .needs_file = 1,
+ .work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
+ IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
+ },
};
static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
@@ -4677,6 +4690,61 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+static int io_getdents_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe)
+{
+ struct io_getdents *getdents = &req->getdents;
+
+ if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+ return -EINVAL;
+ if (sqe->ioprio || sqe->rw_flags || sqe->buf_index)
+ return -EINVAL;
+
+ getdents->pos = READ_ONCE(sqe->off);
+ getdents->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+ getdents->count = READ_ONCE(sqe->len);
+ return 0;
+}
+
+static int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_getdents *getdents = &req->getdents;
+ bool pos_unlock = false;
+ int ret = 0;
+
+ /* getdents always requires a blocking context */
+ if (issue_flags & IO_URING_F_NONBLOCK)
+ return -EAGAIN;
+
+ /* for vfs_llseek and to serialize ->iterate_shared() on this file */
+ if (file_count(req->file) > 1) {
+ pos_unlock = true;
+ mutex_lock(&req->file->f_pos_lock);
+ }
+
+ if (req->file->f_pos != getdents->pos) {
+ loff_t res = vfs_llseek(req->file, getdents->pos, SEEK_SET);
+ if (res < 0)
+ ret = res;
+ }
+
+ if (ret == 0) {
+ ret = vfs_getdents(req->file, getdents->dirent,
+ getdents->count);
+ }
+
+ if (pos_unlock)
+ mutex_unlock(&req->file->f_pos_lock);
+
+ if (ret < 0) {
+ if (ret == -ERESTARTSYS)
+ ret = -EINTR;
+ req_set_fail_links(req);
+ }
+ io_req_complete(req, ret);
+ return 0;
+}
+
#if defined(CONFIG_NET)
static int io_setup_async_msg(struct io_kiocb *req,
struct io_async_msghdr *kmsg)
@@ -6184,6 +6252,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_unlinkat_prep(req, sqe);
case IORING_OP_MKDIRAT:
return io_mkdirat_prep(req, sqe);
+ case IORING_OP_GETDENTS:
+ return io_getdents_prep(req, sqe);
}
printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6428,6 +6498,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_MKDIRAT:
ret = io_mkdirat(req, issue_flags);
break;
+ case IORING_OP_GETDENTS:
+ ret = io_getdents(req, issue_flags);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 890edd850a9e..fe097b1fa332 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -138,6 +138,7 @@ enum {
IORING_OP_RENAMEAT,
IORING_OP_UNLINKAT,
IORING_OP_MKDIRAT,
+ IORING_OP_GETDENTS,
/* this goes last, obviously */
IORING_OP_LAST,
--
2.29.2
So that IORING_OP_GETDENTS may use it, split out the core of the
getdents64(2) syscall into a helper function, vfs_getdents().
vfs_getdents() calls into filesystems' ->iterate{,_shared}() which
expect serialization on struct file, which means that callers of
vfs_getdents() are responsible for either using fdget_pos() or
performing the equivalent serialization by hand.
Cc: Al Viro <[email protected]>
Signed-off-by: Lennert Buytenhek <[email protected]>
---
fs/readdir.c | 25 +++++++++++++++++--------
include/linux/fs.h | 4 ++++
2 files changed, 21 insertions(+), 8 deletions(-)
diff --git a/fs/readdir.c b/fs/readdir.c
index 19434b3c982c..f52167c1eb61 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -348,10 +348,9 @@ static int filldir64(struct dir_context *ctx, const char *name, int namlen,
return -EFAULT;
}
-SYSCALL_DEFINE3(getdents64, unsigned int, fd,
- struct linux_dirent64 __user *, dirent, unsigned int, count)
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+ unsigned int count)
{
- struct fd f;
struct getdents_callback64 buf = {
.ctx.actor = filldir64,
.count = count,
@@ -359,11 +358,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
};
int error;
- f = fdget_pos(fd);
- if (!f.file)
- return -EBADF;
-
- error = iterate_dir(f.file, &buf.ctx);
+ error = iterate_dir(file, &buf.ctx);
if (error >= 0)
error = buf.error;
if (buf.prev_reclen) {
@@ -376,6 +371,20 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
else
error = count - buf.count;
}
+ return error;
+}
+
+SYSCALL_DEFINE3(getdents64, unsigned int, fd,
+ struct linux_dirent64 __user *, dirent, unsigned int, count)
+{
+ struct fd f;
+ int error;
+
+ f = fdget_pos(fd);
+ if (!f.file)
+ return -EBADF;
+
+ error = vfs_getdents(f.file, dirent, count);
fdput_pos(f);
return error;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..114885d3f6c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3109,6 +3109,10 @@ extern const struct inode_operations simple_symlink_inode_operations;
extern int iterate_dir(struct file *, struct dir_context *);
+struct linux_dirent64;
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+ unsigned int count);
+
int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
int flags);
int vfs_fstat(int fd, struct kstat *stat);
--
2.29.2
On 18/02/2021 12:27, Lennert Buytenhek wrote:
> IORING_OP_GETDENTS behaves much like getdents64(2) and takes the same
> arguments, but with a small twist: it takes an additional offset
> argument, and reading from the specified directory starts at the given
> offset.
>
> For the first IORING_OP_GETDENTS call on a directory, the offset
> parameter can be set to zero, and for subsequent calls, it can be
> set to the ->d_off field of the last struct linux_dirent64 returned
> by the previous IORING_OP_GETDENTS call.
>
> Internally, if necessary, IORING_OP_GETDENTS will vfs_llseek() to
> the right directory position before calling vfs_getdents().
>
> IORING_OP_GETDENTS may or may not update the specified directory's
> file offset, and the file offset should not be relied upon having
> any particular value during or after an IORING_OP_GETDENTS call.
>
> Signed-off-by: Lennert Buytenhek <[email protected]>
> ---
> fs/io_uring.c | 73 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/io_uring.h | 1 +
> 2 files changed, 74 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 056bd4c90ade..6853bf48369a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -635,6 +635,13 @@ struct io_mkdir {
> struct filename *filename;
> };
>
[...]
> +static int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_getdents *getdents = &req->getdents;
> + bool pos_unlock = false;
> + int ret = 0;
> +
> + /* getdents always requires a blocking context */
> + if (issue_flags & IO_URING_F_NONBLOCK)
> + return -EAGAIN;
> +
> + /* for vfs_llseek and to serialize ->iterate_shared() on this file */
> + if (file_count(req->file) > 1) {
Looks racy, is it safe? E.g. can be concurrently dupped and used, or just
several similar IORING_OP_GETDENTS requests.
> + pos_unlock = true;
> + mutex_lock(&req->file->f_pos_lock);
> + }
> +
> + if (req->file->f_pos != getdents->pos) {
> + loff_t res = vfs_llseek(req->file, getdents->pos, SEEK_SET);
I may be missing the previous discussions, but can this ever become
stateless, like passing an offset? Including readdir.c and beyond.
> + if (res < 0)
> + ret = res;
> + }
> +
> + if (ret == 0) {
> + ret = vfs_getdents(req->file, getdents->dirent,
> + getdents->count);
> + }
> +
> + if (pos_unlock)
> + mutex_unlock(&req->file->f_pos_lock);
> +
> + if (ret < 0) {
> + if (ret == -ERESTARTSYS)
> + ret = -EINTR;
> + req_set_fail_links(req);
> + }
> + io_req_complete(req, ret);
> + return 0;
> +}
[...]
--
Pavel Begunkov
On 19/02/2021 12:05, Pavel Begunkov wrote:
> On 18/02/2021 12:27, Lennert Buytenhek wrote:
>> IORING_OP_GETDENTS behaves much like getdents64(2) and takes the same
>> arguments, but with a small twist: it takes an additional offset
>> argument, and reading from the specified directory starts at the given
>> offset.
>>
>> For the first IORING_OP_GETDENTS call on a directory, the offset
>> parameter can be set to zero, and for subsequent calls, it can be
>> set to the ->d_off field of the last struct linux_dirent64 returned
>> by the previous IORING_OP_GETDENTS call.
>>
>> Internally, if necessary, IORING_OP_GETDENTS will vfs_llseek() to
>> the right directory position before calling vfs_getdents().
>>
>> IORING_OP_GETDENTS may or may not update the specified directory's
>> file offset, and the file offset should not be relied upon having
>> any particular value during or after an IORING_OP_GETDENTS call.
>>
>> Signed-off-by: Lennert Buytenhek <[email protected]>
>> ---
>> fs/io_uring.c | 73 +++++++++++++++++++++++++++++++++++
>> include/uapi/linux/io_uring.h | 1 +
>> 2 files changed, 74 insertions(+)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 056bd4c90ade..6853bf48369a 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -635,6 +635,13 @@ struct io_mkdir {
>> struct filename *filename;
>> };
>>
> [...]
>> +static int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> + struct io_getdents *getdents = &req->getdents;
>> + bool pos_unlock = false;
>> + int ret = 0;
>> +
>> + /* getdents always requires a blocking context */
>> + if (issue_flags & IO_URING_F_NONBLOCK)
>> + return -EAGAIN;
>> +
>> + /* for vfs_llseek and to serialize ->iterate_shared() on this file */
>> + if (file_count(req->file) > 1) {
>
> Looks racy, is it safe? E.g. can be concurrently dupped and used, or just
> several similar IORING_OP_GETDENTS requests.
>
>> + pos_unlock = true;
>> + mutex_lock(&req->file->f_pos_lock);
>> + }
>> +
>> + if (req->file->f_pos != getdents->pos) {
>> + loff_t res = vfs_llseek(req->file, getdents->pos, SEEK_SET);
>
> I may be missing the previous discussions, but can this ever become
> stateless, like passing an offset? Including readdir.c and beyond.
I mean without those seeks. An emulation would look like rewinding
pos back after vfs_getdents, though might be awful on performance.
>
>> + if (res < 0)
>> + ret = res;
>> + }
>> +
>> + if (ret == 0) {
>> + ret = vfs_getdents(req->file, getdents->dirent,
>> + getdents->count);
>> + }
>> +
>> + if (pos_unlock)
>> + mutex_unlock(&req->file->f_pos_lock);
>> +
>> + if (ret < 0) {
>> + if (ret == -ERESTARTSYS)
>> + ret = -EINTR;
>> + req_set_fail_links(req);
>> + }
>> + io_req_complete(req, ret);
>> + return 0;
>> +}
> [...]
>
--
Pavel Begunkov
On Thu, Feb 18, 2021 at 02:27:55PM +0200, Lennert Buytenhek wrote:
> IORING_OP_GETDENTS may or may not update the specified directory's
> file offset, and the file offset should not be relied upon having
> any particular value during or after an IORING_OP_GETDENTS call.
This doesn't give me the warm fuzzies. What I might suggest
is either passing a parameter to iterate_dir() or breaking out an
iterate_dir_nofpos() to make IORING_OP_GETDENTS more of a READV operation.
ie the equivalent of this:
@@ -37,7 +37,7 @@
} while (0)
-int iterate_dir(struct file *file, struct dir_context *ctx)
+int iterate_dir(struct file *file, struct dir_context *ctx, bool use_fpos)
{
struct inode *inode = file_inode(file);
bool shared = false;
@@ -60,12 +60,14 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
- ctx->pos = file->f_pos;
+ if (use_fpos)
+ ctx->pos = file->f_pos;
if (shared)
res = file->f_op->iterate_shared(file, ctx);
else
res = file->f_op->iterate(file, ctx);
- file->f_pos = ctx->pos;
+ if (use_fpos)
+ file->f_pos = ctx->pos;
fsnotify_access(file);
file_accessed(file);
}
That way there's no need to play with llseek or take a mutex on the
f_pos of the directory.
On Fri, Feb 19, 2021 at 12:05:58PM +0000, Pavel Begunkov wrote:
> > IORING_OP_GETDENTS behaves much like getdents64(2) and takes the same
> > arguments, but with a small twist: it takes an additional offset
> > argument, and reading from the specified directory starts at the given
> > offset.
> >
> > For the first IORING_OP_GETDENTS call on a directory, the offset
> > parameter can be set to zero, and for subsequent calls, it can be
> > set to the ->d_off field of the last struct linux_dirent64 returned
> > by the previous IORING_OP_GETDENTS call.
> >
> > Internally, if necessary, IORING_OP_GETDENTS will vfs_llseek() to
> > the right directory position before calling vfs_getdents().
> >
> > IORING_OP_GETDENTS may or may not update the specified directory's
> > file offset, and the file offset should not be relied upon having
> > any particular value during or after an IORING_OP_GETDENTS call.
> >
> > Signed-off-by: Lennert Buytenhek <[email protected]>
> > ---
> > fs/io_uring.c | 73 +++++++++++++++++++++++++++++++++++
> > include/uapi/linux/io_uring.h | 1 +
> > 2 files changed, 74 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 056bd4c90ade..6853bf48369a 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -635,6 +635,13 @@ struct io_mkdir {
> > struct filename *filename;
> > };
> >
> [...]
> > +static int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > +{
> > + struct io_getdents *getdents = &req->getdents;
> > + bool pos_unlock = false;
> > + int ret = 0;
> > +
> > + /* getdents always requires a blocking context */
> > + if (issue_flags & IO_URING_F_NONBLOCK)
> > + return -EAGAIN;
> > +
> > + /* for vfs_llseek and to serialize ->iterate_shared() on this file */
> > + if (file_count(req->file) > 1) {
>
> Looks racy, is it safe? E.g. can be concurrently dupped and used, or
> just several similar IORING_OP_GETDENTS requests.
I thought that it was safe, but I thought about it a bit more, and it
seems that it is unsafe -- if you IORING_REGISTER_FILES to register the
dirfd and then close the dirfd, you'll get a file_count of 1, while you
can submit concurrent operations. So I'll remove the conditional
locking. Thanks!
(If not for IORING_REGISTER_FILES, it seems safe, because then
io_file_get() will hold a(t least one) reference on the file while the
operation is in flight, so then if file_count(req->file) == 1 here,
then it means that the file is no longer referenced by any fdtable,
and nobody else should be able to get a reference to it -- but that's
a bit of a useless optimization.)
(Logic was taken from __fdget_pos, where it is safe for a different
reason, i.e. __fget_light will not bump the refcount iff current->files
is unshared.)
> > + pos_unlock = true;
> > + mutex_lock(&req->file->f_pos_lock);
> > + }
> > +
> > + if (req->file->f_pos != getdents->pos) {
> > + loff_t res = vfs_llseek(req->file, getdents->pos, SEEK_SET);
>
> I may be missing the previous discussions, but can this ever become
> stateless, like passing an offset? Including readdir.c and beyond.
My aim was to only make the minimally required change initially, but
to make that optimization possible in the future (e.g. by reserving the
right to either update or not update the file position) -- but I'll
try doing the optimization now.
> > + if (res < 0)
> > + ret = res;
> > + }
> > +
> > + if (ret == 0) {
> > + ret = vfs_getdents(req->file, getdents->dirent,
> > + getdents->count);
> > + }
> > +
> > + if (pos_unlock)
> > + mutex_unlock(&req->file->f_pos_lock);
> > +
> > + if (ret < 0) {
> > + if (ret == -ERESTARTSYS)
> > + ret = -EINTR;
> > + req_set_fail_links(req);
> > + }
> > + io_req_complete(req, ret);
> > + return 0;
> > +}
> [...]
>
> --
> Pavel Begunkov
On Fri, Feb 19, 2021 at 12:34:03PM +0000, Matthew Wilcox wrote:
> > IORING_OP_GETDENTS may or may not update the specified directory's
> > file offset, and the file offset should not be relied upon having
> > any particular value during or after an IORING_OP_GETDENTS call.
>
> This doesn't give me the warm fuzzies. What I might suggest
> is either passing a parameter to iterate_dir() or breaking out an
> iterate_dir_nofpos() to make IORING_OP_GETDENTS more of a READV operation.
> ie the equivalent of this:
>
> @@ -37,7 +37,7 @@
> } while (0)
>
>
> -int iterate_dir(struct file *file, struct dir_context *ctx)
> +int iterate_dir(struct file *file, struct dir_context *ctx, bool use_fpos)
> {
> struct inode *inode = file_inode(file);
> bool shared = false;
> @@ -60,12 +60,14 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
>
> res = -ENOENT;
> if (!IS_DEADDIR(inode)) {
> - ctx->pos = file->f_pos;
> + if (use_fpos)
> + ctx->pos = file->f_pos;
> if (shared)
> res = file->f_op->iterate_shared(file, ctx);
> else
> res = file->f_op->iterate(file, ctx);
> - file->f_pos = ctx->pos;
> + if (use_fpos)
> + file->f_pos = ctx->pos;
> fsnotify_access(file);
> file_accessed(file);
> }
>
> That way there's no need to play with llseek or take a mutex on the
> f_pos of the directory.
I'll try this!
On Fri, Feb 19, 2021 at 08:07:04PM +0200, Lennert Buytenhek wrote:
> > > IORING_OP_GETDENTS may or may not update the specified directory's
> > > file offset, and the file offset should not be relied upon having
> > > any particular value during or after an IORING_OP_GETDENTS call.
> >
> > This doesn't give me the warm fuzzies. What I might suggest
> > is either passing a parameter to iterate_dir() or breaking out an
> > iterate_dir_nofpos() to make IORING_OP_GETDENTS more of a READV operation.
> > ie the equivalent of this:
> >
> > @@ -37,7 +37,7 @@
> > } while (0)
> >
> >
> > -int iterate_dir(struct file *file, struct dir_context *ctx)
> > +int iterate_dir(struct file *file, struct dir_context *ctx, bool use_fpos)
> > {
> > struct inode *inode = file_inode(file);
> > bool shared = false;
> > @@ -60,12 +60,14 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
> >
> > res = -ENOENT;
> > if (!IS_DEADDIR(inode)) {
> > - ctx->pos = file->f_pos;
> > + if (use_fpos)
> > + ctx->pos = file->f_pos;
> > if (shared)
> > res = file->f_op->iterate_shared(file, ctx);
> > else
> > res = file->f_op->iterate(file, ctx);
> > - file->f_pos = ctx->pos;
> > + if (use_fpos)
> > + file->f_pos = ctx->pos;
> > fsnotify_access(file);
> > file_accessed(file);
> > }
> >
> > That way there's no need to play with llseek or take a mutex on the
> > f_pos of the directory.
>
> I'll try this!
The patch below (on top of v3) does what you suggest, and it removes
the vfs_llseek() call, but there's two issues:
- We still need to take some sort of mutex on the directory, because,
while ->iterate_shared() can be called concurrently on different
struct files that point to the same underlying dir inode, it cannot
be called concurrently on the same struct file. From
Documentation/filesystems/porting.rst:
->iterate_shared() is added; it's a parallel variant of ->iterate().
Exclusion on struct file level is still provided (as well as that
between it and lseek on the same struct file) but if your directory
has been opened several times, you can get these called in parallel.
[...]
- Calling a filesystem's ->iterate_shared() on the same dir with changing
file positions but without calling the directory's ->llseek() in between
to notify the filesystem of changes in the file position seems to violate
an (unstated?) guarantee. It works on my btrfs root fs, since that uses
generic_file_llseek() for directory ->llseek(), but e.g. ceph does:
| static loff_t ceph_dir_llseek(struct file *file, loff_t offset, int whence)
| {
| struct ceph_dir_file_info *dfi = file->private_data;
| struct inode *inode = file->f_mapping->host;
| loff_t retval;
|
| inode_lock(inode);
| retval = -EINVAL;
| switch (whence) {
| case SEEK_CUR:
| offset += file->f_pos;
| case SEEK_SET:
| break;
| case SEEK_END:
| retval = -EOPNOTSUPP;
| default:
| goto out;
| }
|
| if (offset >= 0) {
| if (need_reset_readdir(dfi, offset)) {
| dout("dir_llseek dropping %p content\n", file);
| reset_readdir(dfi);
| } else if (is_hash_order(offset) && offset > file->f_pos) {
| /* for hash offset, we don't know if a forward seek
| * is within same frag */
| dfi->dir_release_count = 0;
| dfi->readdir_cache_idx = -1;
| }
|
| if (offset != file->f_pos) {
| file->f_pos = offset;
| file->f_version = 0;
| dfi->file_info.flags &= ~CEPH_F_ATEND;
| }
| retval = offset;
| }
| out:
| inode_unlock(inode);
| return retval;
| }
So I think we probably can't get rid of the conditional vfs_llseek()
call for now (and we'll probably have to keep taking the dir's
->f_pos_lock) -- what do you think?
(The caveat about that the file pointer may or may not be updated by
IORING_OP_GETDENTS would allow making this optimization in the future,
and for now it would mean that you can't mix getdents64() and
IORING_OP_GETDENTS calls on the same dirfd, which would seem like an
unusual thing to do anyway.)
Thanks!
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3362e812928d..97bf0965de30 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4709,25 +4709,17 @@ static int io_getdents_prep(struct io_kiocb *req,
static int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_getdents *getdents = &req->getdents;
- int ret = 0;
+ int ret;
/* getdents always requires a blocking context */
if (issue_flags & IO_URING_F_NONBLOCK)
return -EAGAIN;
- /* for vfs_llseek and to serialize ->iterate_shared() on this file */
+ /* to serialize ->iterate_shared() on this file */
mutex_lock(&req->file->f_pos_lock);
- if (req->file->f_pos != getdents->pos) {
- loff_t res = vfs_llseek(req->file, getdents->pos, SEEK_SET);
- if (res < 0)
- ret = res;
- }
-
- if (ret == 0) {
- ret = vfs_getdents(req->file, getdents->dirent,
- getdents->count);
- }
+ ret = vfs_getdents(req->file, getdents->dirent,
+ getdents->count, &getdents->pos);
mutex_unlock(&req->file->f_pos_lock);
diff --git a/fs/readdir.c b/fs/readdir.c
index f52167c1eb61..ffdc70fe5dcf 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -37,7 +37,7 @@
} while (0)
-int iterate_dir(struct file *file, struct dir_context *ctx)
+int __iterate_dir(struct file *file, struct dir_context *ctx, bool use_fpos)
{
struct inode *inode = file_inode(file);
bool shared = false;
@@ -60,12 +60,14 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
- ctx->pos = file->f_pos;
+ if (use_fpos)
+ ctx->pos = file->f_pos;
if (shared)
res = file->f_op->iterate_shared(file, ctx);
else
res = file->f_op->iterate(file, ctx);
- file->f_pos = ctx->pos;
+ if (use_fpos)
+ file->f_pos = ctx->pos;
fsnotify_access(file);
file_accessed(file);
}
@@ -76,6 +78,11 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
out:
return res;
}
+
+int iterate_dir(struct file *file, struct dir_context *ctx)
+{
+ return __iterate_dir(file, ctx, true);
+}
EXPORT_SYMBOL(iterate_dir);
/*
@@ -349,7 +356,7 @@ static int filldir64(struct dir_context *ctx, const char *name, int namlen,
}
int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
- unsigned int count)
+ unsigned int count, loff_t *f_pos)
{
struct getdents_callback64 buf = {
.ctx.actor = filldir64,
@@ -358,7 +365,13 @@ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
};
int error;
- error = iterate_dir(file, &buf.ctx);
+ if (f_pos == NULL) {
+ error = __iterate_dir(file, &buf.ctx, true);
+ } else {
+ buf.ctx.pos = *f_pos;
+ error = __iterate_dir(file, &buf.ctx, false);
+ }
+
if (error >= 0)
error = buf.error;
if (buf.prev_reclen) {
@@ -384,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
if (!f.file)
return -EBADF;
- error = vfs_getdents(f.file, dirent, count);
+ error = vfs_getdents(f.file, dirent, count, NULL);
fdput_pos(f);
return error;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 114885d3f6c4..7104cd9b26ca 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3107,11 +3107,12 @@ const char *simple_get_link(struct dentry *, struct inode *,
struct delayed_call *);
extern const struct inode_operations simple_symlink_inode_operations;
+extern int __iterate_dir(struct file *, struct dir_context *, bool);
extern int iterate_dir(struct file *, struct dir_context *);
struct linux_dirent64;
int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
- unsigned int count);
+ unsigned int count, loff_t *f_pos);
int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
int flags);
From: Lennert Buytenhek
> Sent: 18 February 2021 12:27
>
> These patches add support for IORING_OP_GETDENTS, which is a new io_uring
> opcode that more or less does an lseek(sqe->fd, sqe->off, SEEK_SET)
> followed by a getdents64(sqe->fd, (void *)sqe->addr, sqe->len).
>
> A dumb test program for IORING_OP_GETDENTS is available here:
>
> https://krautbox.wantstofly.org/~buytenh/uringfind-v2.c
>
> This test program does something along the lines of what find(1) does:
> it scans recursively through a directory tree and prints the names of
> all directories and files it encounters along the way -- but then using
> io_uring. (The io_uring version prints the names of encountered files and
> directories in an order that's determined by SQE completion order, which
> is somewhat nondeterministic and likely to differ between runs.)
>
> On a directory tree with 14-odd million files in it that's on a
> six-drive (spinning disk) btrfs raid, find(1) takes:
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time find /mnt/repo > /dev/null
>
> real 24m7.815s
> user 0m15.015s
> sys 0m48.340s
> #
>
> And the io_uring version takes:
>
> # echo 3 > /proc/sys/vm/drop_caches
> # time ./uringfind /mnt/repo > /dev/null
>
> real 10m29.064s
> user 0m4.347s
> sys 0m1.677s
> #
While there may be uses for IORING_OP_GETDENTS are you sure your
test is comparing like with like?
The underlying work has to be done in either case, so you are
swapping system calls for code complexity.
I suspect that find is actually doing a stat() call on every
directory entry and that your io_uring example is just believing
the 'directory' flag returned in the directory entry for most
modern filesystems.
If you write a program that does openat(), readdir(), close()
for each directory and with a long enough buffer (mostly) do
one readdir() per directory you'll get a much better comparison.
You could even write a program with 2 threads, one does all the
open/readdir/close system calls and the other does the printing
and generating the list of directories to process.
That should get the equivalent overlapping that io_uring gives
without much of the complexity.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 2/20/21 10:44 AM, David Laight wrote:
> From: Lennert Buytenhek
>> Sent: 18 February 2021 12:27
>>
>> These patches add support for IORING_OP_GETDENTS, which is a new io_uring
>> opcode that more or less does an lseek(sqe->fd, sqe->off, SEEK_SET)
>> followed by a getdents64(sqe->fd, (void *)sqe->addr, sqe->len).
>>
>> A dumb test program for IORING_OP_GETDENTS is available here:
>>
>> https://krautbox.wantstofly.org/~buytenh/uringfind-v2.c
>>
>> This test program does something along the lines of what find(1) does:
>> it scans recursively through a directory tree and prints the names of
>> all directories and files it encounters along the way -- but then using
>> io_uring. (The io_uring version prints the names of encountered files and
>> directories in an order that's determined by SQE completion order, which
>> is somewhat nondeterministic and likely to differ between runs.)
>>
>> On a directory tree with 14-odd million files in it that's on a
>> six-drive (spinning disk) btrfs raid, find(1) takes:
>>
>> # echo 3 > /proc/sys/vm/drop_caches
>> # time find /mnt/repo > /dev/null
>>
>> real 24m7.815s
>> user 0m15.015s
>> sys 0m48.340s
>> #
>>
>> And the io_uring version takes:
>>
>> # echo 3 > /proc/sys/vm/drop_caches
>> # time ./uringfind /mnt/repo > /dev/null
>>
>> real 10m29.064s
>> user 0m4.347s
>> sys 0m1.677s
>> #
>
> While there may be uses for IORING_OP_GETDENTS are you sure your
> test is comparing like with like?
> The underlying work has to be done in either case, so you are
> swapping system calls for code complexity.
What complexity?
> I suspect that find is actually doing a stat() call on every
> directory entry and that your io_uring example is just believing
> the 'directory' flag returned in the directory entry for most
> modern filesystems.
While that may be true (find doing stat as well), the runtime is
clearly dominated by IO. Adding a stat on top would be an extra
copy, but no extra IO.
> If you write a program that does openat(), readdir(), close()
> for each directory and with a long enough buffer (mostly) do
> one readdir() per directory you'll get a much better comparison.
>
> You could even write a program with 2 threads, one does all the
> open/readdir/close system calls and the other does the printing
> and generating the list of directories to process.
> That should get the equivalent overlapping that io_uring gives
> without much of the complexity.
But this is what take the most offense to - it's _trivial_ to
write that program with io_uring, especially compared to managing
threads. Threads are certainly a more known paradigm at this point,
but an io_uring submit + reap loop is definitely not "much of the
complexity". If you're referring to the kernel change itself, that's
trivial, as the diffstat shows.
--
Jens Axboe
From: Jens Axboe
> Sent: 20 February 2021 18:29
>
> On 2/20/21 10:44 AM, David Laight wrote:
> > From: Lennert Buytenhek
> >> Sent: 18 February 2021 12:27
> >>
> >> These patches add support for IORING_OP_GETDENTS, which is a new io_uring
> >> opcode that more or less does an lseek(sqe->fd, sqe->off, SEEK_SET)
> >> followed by a getdents64(sqe->fd, (void *)sqe->addr, sqe->len).
> >>
> >> A dumb test program for IORING_OP_GETDENTS is available here:
> >>
> >> https://krautbox.wantstofly.org/~buytenh/uringfind-v2.c
> >>
> >> This test program does something along the lines of what find(1) does:
> >> it scans recursively through a directory tree and prints the names of
> >> all directories and files it encounters along the way -- but then using
> >> io_uring. (The io_uring version prints the names of encountered files and
> >> directories in an order that's determined by SQE completion order, which
> >> is somewhat nondeterministic and likely to differ between runs.)
> >>
> >> On a directory tree with 14-odd million files in it that's on a
> >> six-drive (spinning disk) btrfs raid, find(1) takes:
> >>
> >> # echo 3 > /proc/sys/vm/drop_caches
> >> # time find /mnt/repo > /dev/null
> >>
> >> real 24m7.815s
> >> user 0m15.015s
> >> sys 0m48.340s
> >> #
> >>
> >> And the io_uring version takes:
> >>
> >> # echo 3 > /proc/sys/vm/drop_caches
> >> # time ./uringfind /mnt/repo > /dev/null
> >>
> >> real 10m29.064s
> >> user 0m4.347s
> >> sys 0m1.677s
> >> #
> >
> > While there may be uses for IORING_OP_GETDENTS are you sure your
> > test is comparing like with like?
> > The underlying work has to be done in either case, so you are
> > swapping system calls for code complexity.
>
> What complexity?
Evan adding commands to a list to execute later is 'complexity'.
As in adding more cpu cycles.
> > I suspect that find is actually doing a stat() call on every
> > directory entry and that your io_uring example is just believing
> > the 'directory' flag returned in the directory entry for most
> > modern filesystems.
>
> While that may be true (find doing stat as well), the runtime is
> clearly dominated by IO. Adding a stat on top would be an extra
> copy, but no extra IO.
I'd expect stat() to require the disk inode be read into memory.
getdents() only requires the data of the directory be read.
So calling stat() requires a lot more IO.
The other thing I just realises is that the 'system time'
output from time is completely meaningless for the io_uring case.
All that processing is done by a kernel thread and I doubt
is re-attributed to the user process.
> > If you write a program that does openat(), readdir(), close()
> > for each directory and with a long enough buffer (mostly) do
> > one readdir() per directory you'll get a much better comparison.
> >
> > You could even write a program with 2 threads, one does all the
> > open/readdir/close system calls and the other does the printing
> > and generating the list of directories to process.
> > That should get the equivalent overlapping that io_uring gives
> > without much of the complexity.
>
> But this is what take the most offense to - it's _trivial_ to
> write that program with io_uring, especially compared to managing
> threads. Threads are certainly a more known paradigm at this point,
> but an io_uring submit + reap loop is definitely not "much of the
> complexity". If you're referring to the kernel change itself, that's
> trivial, as the diffstat shows.
I've looked at the kernel code in io_uring.c.
Makes me pull my hair out (what's left of it - mostly beard).
Apart from saving system call costs I don't actually understand why
it isn't a userspace library?
Anyway, I thought the point of io_uring was to attempt to implement
asynchronous IO on a unix system.
If you want async IO you need to go back to the mid 1970s and pick
the ancestors of RSM/11M rather than those of K&R's unix.
That leads you to Ultrix and then Windows NT.
And yes, I have written code that did async IO under RSM/11M.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 2/21/21 12:38 PM, David Laight wrote:
> From: Jens Axboe
>> Sent: 20 February 2021 18:29
>>
>> On 2/20/21 10:44 AM, David Laight wrote:
>>> From: Lennert Buytenhek
>>>> Sent: 18 February 2021 12:27
>>>>
>>>> These patches add support for IORING_OP_GETDENTS, which is a new io_uring
>>>> opcode that more or less does an lseek(sqe->fd, sqe->off, SEEK_SET)
>>>> followed by a getdents64(sqe->fd, (void *)sqe->addr, sqe->len).
>>>>
>>>> A dumb test program for IORING_OP_GETDENTS is available here:
>>>>
>>>> https://krautbox.wantstofly.org/~buytenh/uringfind-v2.c
>>>>
>>>> This test program does something along the lines of what find(1) does:
>>>> it scans recursively through a directory tree and prints the names of
>>>> all directories and files it encounters along the way -- but then using
>>>> io_uring. (The io_uring version prints the names of encountered files and
>>>> directories in an order that's determined by SQE completion order, which
>>>> is somewhat nondeterministic and likely to differ between runs.)
>>>>
>>>> On a directory tree with 14-odd million files in it that's on a
>>>> six-drive (spinning disk) btrfs raid, find(1) takes:
>>>>
>>>> # echo 3 > /proc/sys/vm/drop_caches
>>>> # time find /mnt/repo > /dev/null
>>>>
>>>> real 24m7.815s
>>>> user 0m15.015s
>>>> sys 0m48.340s
>>>> #
>>>>
>>>> And the io_uring version takes:
>>>>
>>>> # echo 3 > /proc/sys/vm/drop_caches
>>>> # time ./uringfind /mnt/repo > /dev/null
>>>>
>>>> real 10m29.064s
>>>> user 0m4.347s
>>>> sys 0m1.677s
>>>> #
>>>
>>> While there may be uses for IORING_OP_GETDENTS are you sure your
>>> test is comparing like with like?
>>> The underlying work has to be done in either case, so you are
>>> swapping system calls for code complexity.
>>
>> What complexity?
>
> Evan adding commands to a list to execute later is 'complexity'.
> As in adding more cpu cycles.
That's a pretty blanket statement. If the list is heavily shared, and
hence contended, yes that's generally true. But it isn't.
>>> I suspect that find is actually doing a stat() call on every
>>> directory entry and that your io_uring example is just believing
>>> the 'directory' flag returned in the directory entry for most
>>> modern filesystems.
>>
>> While that may be true (find doing stat as well), the runtime is
>> clearly dominated by IO. Adding a stat on top would be an extra
>> copy, but no extra IO.
>
> I'd expect stat() to require the disk inode be read into memory.
> getdents() only requires the data of the directory be read.
> So calling stat() requires a lot more IO.
I actually went and checked instead of guessing, and find isn't doing a
stat by default on the files.
> The other thing I just realises is that the 'system time'
> output from time is completely meaningless for the io_uring case.
> All that processing is done by a kernel thread and I doubt
> is re-attributed to the user process.
For sure, you can't directly compare the sys times. But the actual
runtime is of course directly comparable. The actual example is btrfs,
which heavily offloads to threads as well. So the find case doesn't show
you the full picture either. Note that once the reworked worker scheme
is adopted, sys time will be directly attributed to the original task
and it will be included (for io_uring, not talking about btrfs).
I'm going to ignore the rest of this email as it isn't productive going
down that path. Suffice it to say that ideally _no_ operations will be
using the async offload, that's really only for regular file IO where
you cannot poll for readiness. And even those cases are gradually being
improved to not rely on it, like for async buffered reads. We still
need to handle writes better, and ideally things like this as well. But
that's a bit further out.
--
Jens Axboe