LinuxLists.cc - [PATCH v3b] eventfd: convert to f_op->read

2020-05-01 17:56:30

Subject: [PATCH v3b] eventfd: convert to f_op->read_iter()

eventfd is using ->read() as it's file_operations read handler, but
this prevents passing in information about whether a given IO operation
is blocking or not. We can only use the file flags for that. To support
async (-EAGAIN/poll based) retries for io_uring, we need ->read_iter()
support. Convert eventfd to using ->read_iter().

Signed-off-by: Jens Axboe <[email protected]>

---

Actually send out the right patch...

Since v2:

- Cleanup eventfd_read() as per Al's suggestions

Since v1:

- Add FMODE_NOWAIT to the eventfd file

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 78e41c7c3d05..c9fa1e9cf5e3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -216,32 +216,32 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *w
}
EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);

-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
- loff_t *ppos)
+static ssize_t eventfd_read(struct kiocb *iocb, struct iov_iter *to)
{
+ struct file *file = iocb->ki_filp;
struct eventfd_ctx *ctx = file->private_data;
- ssize_t res;
__u64 ucnt = 0;
DECLARE_WAITQUEUE(wait, current);

- if (count < sizeof(ucnt))
+ if (iov_iter_count(to) < sizeof(ucnt))
return -EINVAL;
-
spin_lock_irq(&ctx->wqh.lock);
- res = -EAGAIN;
- if (ctx->count > 0)
- res = sizeof(ucnt);
- else if (!(file->f_flags & O_NONBLOCK)) {
+ if (!ctx->count) {
+ if ((file->f_flags & O_NONBLOCK) ||
+ (iocb->ki_flags & IOCB_NOWAIT)) {
+ spin_unlock_irq(&ctx->wqh.lock);
+ return -EAGAIN;
+ }
__add_wait_queue(&ctx->wqh, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
- if (ctx->count > 0) {
- res = sizeof(ucnt);
+ if (ctx->count)
break;
- }
if (signal_pending(current)) {
- res = -ERESTARTSYS;
- break;
+ __remove_wait_queue(&ctx->wqh, &wait);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock_irq(&ctx->wqh.lock);
+ return -ERESTARTSYS;
}
spin_unlock_irq(&ctx->wqh.lock);
schedule();
@@ -250,17 +250,14 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
- if (likely(res > 0)) {
- eventfd_ctx_do_read(ctx, &ucnt);
- if (waitqueue_active(&ctx->wqh))
- wake_up_locked_poll(&ctx->wqh, EPOLLOUT);
- }
+ eventfd_ctx_do_read(ctx, &ucnt);
+ if (waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh, EPOLLOUT);
spin_unlock_irq(&ctx->wqh.lock);
-
- if (res > 0 && put_user(ucnt, (__u64 __user *)buf))
+ if (unlikely(copy_to_iter(&ucnt, sizeof(ucnt), to) != sizeof(ucnt)))
return -EFAULT;

- return res;
+ return sizeof(ucnt);
}

static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
@@ -329,7 +326,7 @@ static const struct file_operations eventfd_fops = {
#endif
.release = eventfd_release,
.poll = eventfd_poll,
- .read = eventfd_read,
+ .read_iter = eventfd_read,
.write = eventfd_write,
.llseek = noop_llseek,
};
@@ -427,8 +424,17 @@ static int do_eventfd(unsigned int count, int flags)

fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
- if (fd < 0)
+ if (fd < 0) {
eventfd_free_ctx(ctx);
+ } else {
+ struct file *file;
+
+ file = fget(fd);
+ if (file) {
+ file->f_mode |= FMODE_NOWAIT;
+ fput(file);
+ }
+ }

return fd;
}

--
Jens Axboe

2020-05-01 19:04:30

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH v3b] eventfd: convert to f_op->read_iter()

On Fri, May 01, 2020 at 11:54:01AM -0600, Jens Axboe wrote:

> @@ -427,8 +424,17 @@ static int do_eventfd(unsigned int count, int flags)
>
> fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
> O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
> - if (fd < 0)
> + if (fd < 0) {
> eventfd_free_ctx(ctx);
> + } else {
> + struct file *file;
> +
> + file = fget(fd);
> + if (file) {
> + file->f_mode |= FMODE_NOWAIT;
> + fput(file);
> + }

No. The one and only thing you can do to return value of anon_inode_getfd() is to
return the fscker to userland. You *CAN* *NOT* assume that descriptor table is
still pointing to whatever you've just created.

As soon as it's in descriptor table, it's out of your hands. And frankly, if you
are playing with descriptors, you should be very well aware of that.

Descriptor tables are fundamentally shared objects; they *can* be accessed and
modified by other threads, right behind your back.

*IF* you are going to play with ->f_mode, you must use get_unused_fd_flags(),
anon_inode_getfile(), modify ->f_mode of the result and use fd_install() to
put it into descriptor table. With put_unused_fd() as cleanup in case
of allocation failure.

2020-05-01 19:10:52

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v3b] eventfd: convert to f_op->read_iter()

On 5/1/20 1:00 PM, Al Viro wrote:
> On Fri, May 01, 2020 at 11:54:01AM -0600, Jens Axboe wrote:
>
>> @@ -427,8 +424,17 @@ static int do_eventfd(unsigned int count, int flags)
>>
>> fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
>> O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
>> - if (fd < 0)
>> + if (fd < 0) {
>> eventfd_free_ctx(ctx);
>> + } else {
>> + struct file *file;
>> +
>> + file = fget(fd);
>> + if (file) {
>> + file->f_mode |= FMODE_NOWAIT;
>> + fput(file);
>> + }
>
> No. The one and only thing you can do to return value of anon_inode_getfd() is to
> return the fscker to userland. You *CAN* *NOT* assume that descriptor table is
> still pointing to whatever you've just created.
>
> As soon as it's in descriptor table, it's out of your hands. And frankly, if you
> are playing with descriptors, you should be very well aware of that.
>
> Descriptor tables are fundamentally shared objects; they *can* be accessed and
> modified by other threads, right behind your back.
>
> *IF* you are going to play with ->f_mode, you must use get_unused_fd_flags(),
> anon_inode_getfile(), modify ->f_mode of the result and use fd_install() to
> put it into descriptor table. With put_unused_fd() as cleanup in case
> of allocation failure.

OK, that makes sense, so we've got f_mode set before the fd_install() and fd
visibility. I wrote that up, will test, and send out a v4... Thanks Al, this
is very helpful.

--
Jens Axboe