Hello,
syzbot found the following crash on:
HEAD commit: 7d762d69145a afs: Fix manually set volume location server ..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=131d832ac00000
kernel config: https://syzkaller.appspot.com/x/.config?x=b76ec970784287c
dashboard link: https://syzkaller.appspot.com/bug?extid=503d4cc169fcec1cb18c
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11934262c00000
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]
==================================================================
BUG: KASAN: use-after-free in unix_dgram_poll+0x5e1/0x690
net/unix/af_unix.c:2695
Read of size 4 at addr ffff88809292aae0 by task syz-executor.1/18946
CPU: 0 PID: 18946 Comm: syz-executor.1 Not tainted 5.0.0-rc8+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
__asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
unix_dgram_poll+0x5e1/0x690 net/unix/af_unix.c:2695
sock_poll+0x291/0x340 net/socket.c:1127
vfs_poll include/linux/poll.h:86 [inline]
aio_poll fs/aio.c:1766 [inline]
__io_submit_one fs/aio.c:1876 [inline]
io_submit_one+0xe3e/0x1cf0 fs/aio.c:1909
__do_sys_io_submit fs/aio.c:1954 [inline]
__se_sys_io_submit fs/aio.c:1924 [inline]
__x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1924
? 0xffffffff81000000
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e29
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd43ca93c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e29
RDX: 0000000020000600 RSI: 1ffffffffffffd70 RDI: 00007fd43ca73000
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd43ca946d4
R13: 00000000004bf02f R14: 00000000004d09b0 R15: 00000000ffffffff
Allocated by task 18946:
save_stack+0x45/0xd0 mm/kasan/common.c:73
set_track mm/kasan/common.c:85 [inline]
__kasan_kmalloc mm/kasan/common.c:495 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:468
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:503
slab_post_alloc_hook mm/slab.h:440 [inline]
slab_alloc mm/slab.c:3388 [inline]
kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3548
sk_prot_alloc+0x67/0x2e0 net/core/sock.c:1471
sk_alloc+0x39/0xf70 net/core/sock.c:1531
unix_create1+0xc3/0x530 net/unix/af_unix.c:764
unix_create+0x103/0x1e0 net/unix/af_unix.c:825
__sock_create+0x3e6/0x750 net/socket.c:1275
sock_create net/socket.c:1315 [inline]
__sys_socketpair+0x272/0x5e0 net/socket.c:1407
__do_sys_socketpair net/socket.c:1456 [inline]
__se_sys_socketpair net/socket.c:1453 [inline]
__x64_sys_socketpair+0x97/0xf0 net/socket.c:1453
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 18944:
save_stack+0x45/0xd0 mm/kasan/common.c:73
set_track mm/kasan/common.c:85 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:457
kasan_slab_free+0xe/0x10 mm/kasan/common.c:465
__cache_free mm/slab.c:3494 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3754
sk_prot_free net/core/sock.c:1512 [inline]
__sk_destruct+0x4b6/0x6d0 net/core/sock.c:1596
sk_destruct+0x7b/0x90 net/core/sock.c:1604
__sk_free+0xce/0x300 net/core/sock.c:1615
sk_free+0x42/0x50 net/core/sock.c:1626
sock_put include/net/sock.h:1707 [inline]
unix_release_sock+0x921/0xbb0 net/unix/af_unix.c:573
unix_release+0x44/0x90 net/unix/af_unix.c:835
__sock_release+0xd3/0x250 net/socket.c:579
sock_close+0x1b/0x30 net/socket.c:1139
__fput+0x2df/0x8d0 fs/file_table.c:278
____fput+0x16/0x20 fs/file_table.c:309
task_work_run+0x14a/0x1c0 kernel/task_work.c:113
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:166
prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
do_syscall_64+0x52d/0x610 arch/x86/entry/common.c:293
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff88809292a740
which belongs to the cache UNIX(49:syz1) of size 1728
The buggy address is located 928 bytes inside of
1728-byte region [ffff88809292a740, ffff88809292ae00)
The buggy address belongs to the page:
page:ffffea00024a4a80 count:1 mapcount:0 mapping:ffff8880920c0800 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00028223c8 ffffea0002581248 ffff8880920c0800
raw: 0000000000000000 ffff88809292a000 0000000100000002 ffff8880a9718ec0
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8880a9718ec0
Memory state around the buggy address:
ffff88809292a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88809292aa00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88809292aa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88809292ab00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88809292ab80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches
On Sun, Mar 03, 2019 at 02:22:04AM -0800, syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: 7d762d69145a afs: Fix manually set volume location server ..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=131d832ac00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=b76ec970784287c
> dashboard link: https://syzkaller.appspot.com/bug?extid=503d4cc169fcec1cb18c
> compiler: gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11934262c00000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
>
> ==================================================================
> BUG: KASAN: use-after-free in unix_dgram_poll+0x5e1/0x690
> net/unix/af_unix.c:2695
> Read of size 4 at addr ffff88809292aae0 by task syz-executor.1/18946
>
> CPU: 0 PID: 18946 Comm: syz-executor.1 Not tainted 5.0.0-rc8+ #88
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x172/0x1f0 lib/dump_stack.c:113
> print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
> kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
> __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
> unix_dgram_poll+0x5e1/0x690 net/unix/af_unix.c:2695
> sock_poll+0x291/0x340 net/socket.c:1127
> vfs_poll include/linux/poll.h:86 [inline]
> aio_poll fs/aio.c:1766 [inline]
> __io_submit_one fs/aio.c:1876 [inline]
> io_submit_one+0xe3e/0x1cf0 fs/aio.c:1909
> __do_sys_io_submit fs/aio.c:1954 [inline]
> __se_sys_io_submit fs/aio.c:1924 [inline]
> __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1924
> ? 0xffffffff81000000
> do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
Apparently, a call of ->poll() overlapping with call of ->release()
(if not outright happening after it). Which should never happen...
Maybe unrelated to this bug, but... What's to prevent a wakeup
that happens just after we'd been added to a waitqueue by ->poll()
triggering aio_poll_wake(), which gets to aio_poll_complete()
with its fput() *before* we'd reached the end of ->poll() instance?
I wonder if adding
get_file(req->file); // make sure that early completion is safe
right after
req->file = fget(iocb->aio_fildes);
if (unlikely(!req->file))
return -EBADF;
paired with
fput(req->file);
right after out: in aio_poll() is needed... Am I missing something
obvious here? Christoph?
On Sun, Mar 03, 2019 at 01:55:02PM +0000, Al Viro wrote:
> Maybe unrelated to this bug, but... What's to prevent a wakeup
> that happens just after we'd been added to a waitqueue by ->poll()
> triggering aio_poll_wake(), which gets to aio_poll_complete()
> with its fput() *before* we'd reached the end of ->poll() instance?
>
> I wonder if adding
> get_file(req->file); // make sure that early completion is safe
> right after
> req->file = fget(iocb->aio_fildes);
> if (unlikely(!req->file))
> return -EBADF;
> paired with
> fput(req->file);
> right after out: in aio_poll() is needed... Am I missing something
> obvious here? Christoph?
In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll()
2) aio_poll() resolves the descriptor to struct file by
req->file = fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait()
had been called and only once") it locks the queue. That's what the extra
reference to iocb had been for - we know we can safely access it.
5) With queue locked, we check if ->woken has already been set to true
(by aio_poll_wake()) and, if it had been, we unlock the queue, drop
a reference to aio_kiocb and bugger off - at that point it's
a responsibility to aio_poll_wake() and the stuff called/scheduled by
it. That code will drop the reference to file in req->file, along
with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock
the queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with the
other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the queue
lock), so that wakeup won't get us. No async activity will be happening,
so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking safely.
And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually
we are saved by the still present reference from descriptor table -
that fput() is not the final one. But if another thread closes that
descriptor right after our fget() and wakeup does happen before
->poll() returns, we are in trouble - final fput() done while we
are in the middle of a method.
What we need is to hold on to the file reference the same way we do with
aio_kiocb. No need to store the reference to what we'd got in a separate
variable - req->file is never reassigned and we'd already made sure that
req won't be freed under us, so dropping the extra reference with
fput(req->file) is fine in all cases.
Fixes: bfe4037e722ec
Cc: [email protected]
Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/aio.c b/fs/aio.c
index 3083180a54c8..7e88bfabdac2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
/* one for removal from waitqueue, one for this function */
refcount_set(&aiocb->ki_refcnt, 2);
+ get_file(req->file);
mask = vfs_poll(req->file, &apt.pt) & req->events;
if (unlikely(!req->head)) {
@@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
spin_unlock_irq(&ctx->ctx_lock);
out:
+ fput(req->file);
if (unlikely(apt.error)) {
fput(req->file);
return apt.error;
On 03/03/2019 07:18 AM, Al Viro wrote:
> Fixes: bfe4037e722ec
> Cc: [email protected]
> Signed-off-by: Al Viro <[email protected]>
> ---
> diff --git a/fs/aio.c b/fs/aio.c
> index 3083180a54c8..7e88bfabdac2 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>
> /* one for removal from waitqueue, one for this function */
> refcount_set(&aiocb->ki_refcnt, 2);
> + get_file(req->file);
>
> mask = vfs_poll(req->file, &apt.pt) & req->events;
> if (unlikely(!req->head)) {
> @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
> spin_unlock_irq(&ctx->ctx_lock);
>
> out:
> + fput(req->file);
> if (unlikely(apt.error)) {
> fput(req->file);
> return apt.error;
>
Very nice changelog Al, thanks for fixing this.
Reviewed-by: Eric Dumazet <[email protected]>
On Sun, Mar 3, 2019 at 7:18 AM Al Viro <[email protected]> wrote:
>
> > Maybe unrelated to this bug, but... What's to prevent a wakeup
> > that happens just after we'd been added to a waitqueue by ->poll()
> > triggering aio_poll_wake(), which gets to aio_poll_complete()
> > with its fput() *before* we'd reached the end of ->poll() instance?
I'm assuming you're talking about the second vfs_poll() in
aio_poll_complete_work()? The one we call before we check for
"rew->cancelled" properly under the spinlock?
> 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll()
> 2) aio_poll() resolves the descriptor to struct file by
> req->file = fget(iocb->aio_fildes)
[...]
So honestly, the whole filp handling in aio looks overly complicated to me.
All the different operations do that silly fget/fput() dance, although
aio_read/write at least tried to make a common helper function for
handling it.
But as far as I can tell, they *all* could do:
- look up file in aio_submit() when allocating and creating the aio_kiocb
- drop the filp in 'iocb_put()' (which happens whether things
complete successfully or not).
and we'd have avoided a lot of complexity, and we'd have avoided this bug.
Your patch fixes the poll() case, but it does so by just letting the
existing complexity remain, and adding a second fget/fput pair in the
poll logic.
It would seem like it would be much better to rip all the complexity
out entirely, and replace it with sane, simple and obviously correct
code.
Hmm?
In other words, why wouldn't something like the attached work instead?
TOTALLY UNTESTED! It builds, and it looks sane, but maybe I'm
overlooking some obvious problem with it. But doesn't it look nice to
see
2 files changed, 41 insertions(+), 50 deletions(-)
with actual code reduction, and a fundamental simplification in
handling of the file pointer?
Linus
On Sun, Mar 3, 2019 at 11:44 AM Linus Torvalds
<[email protected]> wrote:
>
> But doesn't it look nice to see
>
> 2 files changed, 41 insertions(+), 50 deletions(-)
>
> with actual code reduction, and a fundamental simplification in
> handling of the file pointer?
A coupl,e of the changes are "useless", and do the same thing as not
having them at all:
- struct inode *inode = file_inode(kiocb->ki_filp);
+ struct inode *inode = file_inode(iocb->ki_filp);
- file_end_write(kiocb->ki_filp);
+ file_end_write(iocb->ki_filp);
because the "ki_filp" ends up existing in both kiocb and iocb. At one
point of editing that file I decided to try to just remove it from the
sub-structs entirely and only keep it in the top-level structure, but
it needs to be inside the 'struct kiocb' anyway for all the other
users outside of fs/aio.c.
Anyway, I don't think the patch is wrong (although I haven't actually
_tested_ it) but I wanted to point out that those two one-liner
changes are just "noise" that doesn't matter for the working of the
patch.
In the above, we have 'kiocb' being the embedded 'struct kiocb', and
'iocb' is the 'struct aio_kiocb' that contains it. 'ki_filp' is the
exact same field in both cases.
Linus
Linus
On Sun, Mar 03, 2019 at 11:44:33AM -0800, Linus Torvalds wrote:
> On Sun, Mar 3, 2019 at 7:18 AM Al Viro <[email protected]> wrote:
> >
> > > Maybe unrelated to this bug, but... What's to prevent a wakeup
> > > that happens just after we'd been added to a waitqueue by ->poll()
> > > triggering aio_poll_wake(), which gets to aio_poll_complete()
> > > with its fput() *before* we'd reached the end of ->poll() instance?
>
> I'm assuming you're talking about the second vfs_poll() in
> aio_poll_complete_work()? The one we call before we check for
> "rew->cancelled" properly under the spinlock?
No. The first one, right in aio_poll().
> But as far as I can tell, they *all* could do:
>
> - look up file in aio_submit() when allocating and creating the aio_kiocb
>
> - drop the filp in 'iocb_put()' (which happens whether things
> complete successfully or not).
>
> and we'd have avoided a lot of complexity, and we'd have avoided this bug.
>
> Your patch fixes the poll() case, but it does so by just letting the
> existing complexity remain, and adding a second fget/fput pair in the
> poll logic.
>
> It would seem like it would be much better to rip all the complexity
> out entirely, and replace it with sane, simple and obviously correct
> code.
>
> Hmm?
>
> In other words, why wouldn't something like the attached work instead?
I'll need to dig out the mail archive from last year, but IIRC this
had been considered and there'd been non-trivial problems. Give me
an hour or so and I'll dig that out (there'd been many rounds of
review, with long threads involved, some private, some on fsdevel).
> @@ -1060,6 +1071,7 @@ static inline void iocb_put(struct aio_kiocb *iocb)
> {
> if (refcount_read(&iocb->ki_refcnt) == 0 ||
> refcount_dec_and_test(&iocb->ki_refcnt)) {
> + fput(iocb->ki_filp);
Trivial oops here - it might be NULL.
On Sun, Mar 3, 2019 at 12:30 PM Al Viro <[email protected]> wrote:
>
> On Sun, Mar 03, 2019 at 11:44:33AM -0800, Linus Torvalds wrote:
> >
> > I'm assuming you're talking about the second vfs_poll() in
> > aio_poll_complete_work()? The one we call before we check for
> > "rew->cancelled" properly under the spinlock?
>
> No. The first one, right in aio_poll().
Ok, they are both confusing. The lifetime of that filp is just not
clear, with the whole "it could have been completed asynchronously"
issue.
> I'll need to dig out the mail archive from last year, but IIRC this
> had been considered and there'd been non-trivial problems. Give me
> an hour or so and I'll dig that out (there'd been many rounds of
> review, with long threads involved, some private, some on fsdevel).
Looking more at the patch, it still looks eminently sane to me.I fixed
the silly "ki_filp" thing you noticed (I thought we made fput() take
NULL, but you're right we don't), and simplified the patch to not do
the changes that aren't necessary.
I fixed the silly "filp can be NULL" issue you pointed out (I lazily
thought we allowed fput(NULL), but you're right, we definitely don't),
and simplified the patch to not do the unnecessary changes where we
can just access the file pointer multiple different ways.
And the resulting kernel boots fine, but I doubt I have anything that
actually uses io_submit(), so that doesn't mean much.
Slightly updated patch attached anyway, even smaller than before:
2 files changed, 36 insertions(+), 44 deletions(-)
with several of the new lines being comments about that file pointer
placement in the union.
Linus
On Sun, Mar 03, 2019 at 02:23:33PM -0800, Linus Torvalds wrote:
OK, having dug through the archives, the reasons were not strong.
So that part is OK...
> @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)
> {
> if (refcount_read(&iocb->ki_refcnt) == 0 ||
> refcount_dec_and_test(&iocb->ki_refcnt)) {
> + if (iocb->ki_filp)
> + fput(iocb->ki_filp);
> percpu_ref_put(&iocb->ki_ctx->reqs);
> kmem_cache_free(kiocb_cachep, iocb);
> }
That reminds me - refcount_t here is rather ridiculous; what we
have is
* everything other than aio_poll: ki_refcnt is 0 all along
* aio_poll: originally 0, then set to 2, then two iocb_put()
are done (either both synchronous to aio_poll(), or one sync and one
async).
That's not a refcount at all. It's a flag, set only for aio_poll ones.
And that test in iocb_put() is "if flag is set, clear it and bugger off".
What's worse, AFAICS the callers in aio_poll() are buggered (see below)
> static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask)
> {
> - struct file *file = iocb->poll.file;
> -
> aio_complete(iocb, mangle_poll(mask), 0);
> - fput(file);
> }
No reasons to keep that function at all now...
> - if (unlikely(apt.error)) {
> - fput(req->file);
> + if (unlikely(apt.error))
> return apt.error;
> - }
>
> if (mask)
> aio_poll_complete(aiocb, mask);
Looking at that thing... How does it manage to avoid leaks
when we try to use it on e.g. /dev/tty, which has
poll_wait(file, &tty->read_wait, wait);
poll_wait(file, &tty->write_wait, wait);
in n_tty_poll()?
AFAICS, we'll succeed adding it to the first queue, then have
aio_poll_queue_proc() fail and set apt.error to -EINVAL.
Suppose we are looking for EPOLLIN and there's nothing ready
to read. We'll go
mask = vfs_poll(req->file, &apt.pt) & req->events;
mask is 0.
if (unlikely(!req->head)) {
nope - it's &tty->read_wait, not NULL
/* we did not manage to set up a waitqueue, done */
goto out;
}
spin_lock_irq(&ctx->ctx_lock);
spin_lock(&req->head->lock);
if (req->woken) {
nope - no wakeups so far
/* wake_up context handles the rest */
mask = 0;
apt.error = 0;
} else if (mask || apt.error) {
apt.error is non-zero
/* if we get an error or a mask we are done */
WARN_ON_ONCE(list_empty(&req->wait.entry));
list_del_init(&req->wait.entry);
OK, removed from queue
} else {
/* actually waiting for an event */
list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
aiocb->ki_cancel = aio_poll_cancel;
}
spin_unlock(&req->head->lock);
spin_unlock_irq(&ctx->ctx_lock);
out:
if (unlikely(apt.error)) {
fput(req->file);
return apt.error;
... and we return -EINVAL to __io_submit_one(), where we hit
/*
* If ret is 0, we'd either done aio_complete() ourselves or have
* arranged for that to be done asynchronously. Anything non-zero
* means that we need to destroy req ourselves.
*/
if (ret)
goto out_put_req;
return 0;
out_put_req:
if (req->ki_eventfd)
eventfd_ctx_put(req->ki_eventfd);
iocb_put(req);
out_put_reqs_available:
put_reqs_available(ctx, 1);
return ret;
and all knowledge of req is lost. But we'd done only *one* iocb_put() in
that case, and ->ki_refcnt had been set to 2 by aio_poll(). How could it
avoid a leak? The same goes for "->poll() returns something without
bothering to call poll_wait()" case, actually...
IOW, I would rather have aio_poll() (along with your fput-a-bit-later change)
do this -
out:
if (mask && !apt.error)
aio_complete(aiocb, mangle_poll(mask), 0);
iocb_put(aiocb);
return apt.error;
Comments?
On Sun, Mar 3, 2019 at 4:19 PM Al Viro <[email protected]> wrote:
>
> On Sun, Mar 03, 2019 at 01:55:02PM +0000, Al Viro wrote:
>
> > Maybe unrelated to this bug, but... What's to prevent a wakeup
> > that happens just after we'd been added to a waitqueue by ->poll()
> > triggering aio_poll_wake(), which gets to aio_poll_complete()
> > with its fput() *before* we'd reached the end of ->poll() instance?
> >
> > I wonder if adding
> > get_file(req->file); // make sure that early completion is safe
> > right after
> > req->file = fget(iocb->aio_fildes);
> > if (unlikely(!req->file))
> > return -EBADF;
> > paired with
> > fput(req->file);
> > right after out: in aio_poll() is needed... Am I missing something
> > obvious here? Christoph?
>
> In more details - normally IOCB_CMD_POLL handling looks so:
>
> 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll()
> 2) aio_poll() resolves the descriptor to struct file by
> req->file = fget(iocb->aio_fildes)
> 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
> aio_kiocb to 2 (bumps by 1, that is).
> 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait()
> had been called and only once") it locks the queue. That's what the extra
> reference to iocb had been for - we know we can safely access it.
> 5) With queue locked, we check if ->woken has already been set to true
> (by aio_poll_wake()) and, if it had been, we unlock the queue, drop
> a reference to aio_kiocb and bugger off - at that point it's
> a responsibility to aio_poll_wake() and the stuff called/scheduled by
> it. That code will drop the reference to file in req->file, along
> with the other reference to our aio_kiocb.
> 6) otherwise, we see whether we need to wait. If we do, we unlock
> the queue, drop one reference to aio_kiocb and go away - eventual
> wakeup (or cancel) will deal with the reference to file and with the
> other reference to aio_kiocb
> 7) otherwise we remove ourselves from waitqueue (still under the queue
> lock), so that wakeup won't get us. No async activity will be happening,
> so we can safely drop req->file and iocb ourselves.
>
> If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
> won't get freed under us, so we can do all the checks and locking safely.
> And we don't touch ->file if we detect that case.
>
> However, vfs_poll() most certainly *does* touch the file it had been
> given. So wakeup coming while we are still in ->poll() might end up
> doing fput() on that file. That case is not too rare, and usually
> we are saved by the still present reference from descriptor table -
> that fput() is not the final one. But if another thread closes that
> descriptor right after our fget() and wakeup does happen before
> ->poll() returns, we are in trouble - final fput() done while we
> are in the middle of a method.
>
> What we need is to hold on to the file reference the same way we do with
> aio_kiocb. No need to store the reference to what we'd got in a separate
> variable - req->file is never reassigned and we'd already made sure that
> req won't be freed under us, so dropping the extra reference with
> fput(req->file) is fine in all cases.
>
> Fixes: bfe4037e722ec
> Cc: [email protected]
> Signed-off-by: Al Viro <[email protected]>
Please add the Reported-by tag from the report.
If you don't add it, then either:
1. somebody (you) will need to remember to later go and associate fix
the the report with "#syz fix" command
2. or the bug will stay open and syzbot will never report
use-after-frees in unix_dgram_poll again (it's still the same already
reported bug)
3. or worse somebody will spend time re-debugging this bug just to
find that you already fixed it but did not include the tag
Thanks
> ---
> diff --git a/fs/aio.c b/fs/aio.c
> index 3083180a54c8..7e88bfabdac2 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1767,6 +1767,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>
> /* one for removal from waitqueue, one for this function */
> refcount_set(&aiocb->ki_refcnt, 2);
> + get_file(req->file);
>
> mask = vfs_poll(req->file, &apt.pt) & req->events;
> if (unlikely(!req->head)) {
> @@ -1793,6 +1794,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
> spin_unlock_irq(&ctx->ctx_lock);
>
> out:
> + fput(req->file);
> if (unlikely(apt.error)) {
> fput(req->file);
> return apt.error;
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20190303151846.GQ2217%40ZenIV.linux.org.uk.
> For more options, visit https://groups.google.com/d/optout.
On Sun, Mar 3, 2019 at 6:36 PM Al Viro <[email protected]> wrote:
>
> OK, having dug through the archives, the reasons were not strong.
> So that part is OK...
I've committed the patch.
However, I didn't actually do the separate and independent cleanups:
> > @@ -1060,6 +1071,8 @@ static inline void iocb_put(struct aio_kiocb *iocb)
> > {
> > if (refcount_read(&iocb->ki_refcnt) == 0 ||
> > refcount_dec_and_test(&iocb->ki_refcnt)) {
> > + if (iocb->ki_filp)
> > + fput(iocb->ki_filp);
> > percpu_ref_put(&iocb->ki_ctx->reqs);
> > kmem_cache_free(kiocb_cachep, iocb);
> > }
>
> That reminds me - refcount_t here is rather ridiculous; what we
> have is
> * everything other than aio_poll: ki_refcnt is 0 all along
> * aio_poll: originally 0, then set to 2, then two iocb_put()
> are done (either both synchronous to aio_poll(), or one sync and one
> async).
>
> That's not a refcount at all. It's a flag, set only for aio_poll ones.
> And that test in iocb_put() is "if flag is set, clear it and bugger off".
>
> What's worse, AFAICS the callers in aio_poll() are buggered (see below)
Ok. Suggestions?
> > - if (unlikely(apt.error)) {
> > - fput(req->file);
> > + if (unlikely(apt.error))
> > return apt.error;
> > - }
> >
> > if (mask)
> > aio_poll_complete(aiocb, mask);
>
> Looking at that thing... How does it manage to avoid leaks
> when we try to use it on e.g. /dev/tty, which has
> poll_wait(file, &tty->read_wait, wait);
> poll_wait(file, &tty->write_wait, wait);
> in n_tty_poll()?
I don't think that's he only case that uses multiple poll entries.
It's now the poll() machinery is designed to be used, after all.
Although maybe everybody ends up using just a single wait-queue..
> AFAICS, we'll succeed adding it to the first queue, then have
> aio_poll_queue_proc() fail and set apt.error to -EINVAL.
Yeah, that's bogus, but documented. I guess nobody really expects to
use aio_poll on anything but a socket or something?
But your refcount leak looks valid. Hmm.
So yes, that whole ki_refcnt looks a bit odd and pointless. And
apparently buggy too.
> Comments?
Can we just
(a) remove ki_refcnt entirely
(b) remove the "iocb_put(aiocb);"
from aio_poll()?
Every actual successful io submission should end in aio_complete(), or
we free the aio iocb in the "out_put_req" error case. There's no point
in doing the refcount as you pointed out, and it seems to be actively
buggy.
Anyway, all of this (and your suggestion to just remove
"aio_poll_complete()" and just use "aio_complete()") is independent of
the file refcounting part, I feel. Which is why I just committed that
patch as-is (84c4e1f89fef "aio: simplify - and fix - fget/fput for
io_submit()").
Linus
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 5dd5f35d054c..5837a29e63fe 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1066,6 +1066,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
static inline void iocb_put(struct aio_kiocb *iocb)
{
+ if (iocb->ki_eventfd)
+ eventfd_ctx_put(iocb->ki_eventfd);
if (iocb->ki_filp)
fput(iocb->ki_filp);
percpu_ref_put(&iocb->ki_ctx->reqs);
@@ -1142,10 +1144,8 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
* eventfd. The eventfd_signal() function is safe to be called
* from IRQ context.
*/
- if (iocb->ki_eventfd) {
+ if (iocb->ki_eventfd)
eventfd_signal(iocb->ki_eventfd, 1);
- eventfd_ctx_put(iocb->ki_eventfd);
- }
/*
* We have to order our ring_info tail store above and test
@@ -1819,18 +1819,19 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
goto out_put_req;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
+ struct eventfd_ctx *eventfd;
/*
* If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
* instance of the file* now. The file descriptor must be
* an eventfd() fd, and will be signaled for each completed
* event using the eventfd_signal() function.
*/
- req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
- if (IS_ERR(req->ki_eventfd)) {
- ret = PTR_ERR(req->ki_eventfd);
- req->ki_eventfd = NULL;
+ eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
+ if (IS_ERR(eventfd)) {
+ ret = PTR_ERR(eventfd);
goto out_put_req;
}
+ req->ki_eventfd = eventfd;
}
ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
@@ -1881,8 +1882,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
goto out_put_req;
return 0;
out_put_req:
- if (req->ki_eventfd)
- eventfd_ctx_put(req->ki_eventfd);
iocb_put(req);
out_put_reqs_available:
put_reqs_available(ctx, 1);
--
2.11.0
From: Al Viro <[email protected]>
We want iocb_put() happening on errors, to balance the extra reference
we'd taken. As it is, we end up with a leak. The rules should be
* error: iocb_put() to deal with the extra ref, return error,
let the caller do another iocb_put().
* async: iocb_put() to deal with the extra ref, return 0.
* no error, event present immediately: aio_poll_complete() to
report it, iocb_put() to deal with the extra ref, return 0.
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 25 +++++++++++--------------
1 file changed, 11 insertions(+), 14 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 3a8b894378e0..22b288997441 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1724,6 +1724,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
struct kioctx *ctx = aiocb->ki_ctx;
struct poll_iocb *req = &aiocb->poll;
struct aio_poll_table apt;
+ bool async = false;
__poll_t mask;
/* reject any unknown events outside the normal event mask. */
@@ -1760,30 +1761,26 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
spin_lock_irq(&ctx->ctx_lock);
spin_lock(&req->head->lock);
- if (req->woken) {
- /* wake_up context handles the rest */
- mask = 0;
+ if (req->woken) { /* already taken up by aio_poll_wake() */
+ async = true;
apt.error = 0;
- } else if (mask || apt.error) {
- /* if we get an error or a mask we are done */
- WARN_ON_ONCE(list_empty(&req->wait.entry));
- list_del_init(&req->wait.entry);
- } else {
- /* actually waiting for an event */
+ } else if (!mask && !apt.error) { /* actually waiting for an event */
list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
aiocb->ki_cancel = aio_poll_cancel;
+ async = true;
+ } else { /* if we get an error or a mask we are done */
+ WARN_ON_ONCE(list_empty(&req->wait.entry));
+ list_del_init(&req->wait.entry);
+ /* no wakeup in the future either; aiocb is ours to dispose of */
}
spin_unlock(&req->head->lock);
spin_unlock_irq(&ctx->ctx_lock);
out:
- if (unlikely(apt.error))
- return apt.error;
-
- if (mask)
+ if (async && !apt.error)
aio_poll_complete(aiocb, mask);
iocb_put(aiocb);
- return 0;
+ return apt.error;
}
static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
--
2.11.0
From: Al Viro <[email protected]>
The only reason for taking the extra ref to iocb is that we want
to access it after vfs_poll() and an early wakeup could have it
already completed by the time vfs_poll() returns.
That's very easy to avoid, though - we need to know which lock
to grab and, having grabbed it, we need to check if an early
wakeup has already happened. So let's just copy the reference
to waitqueue into aio_poll_table and instead of having the
"woken" flag in iocb, move it to aio_poll() stack frame and
put its address into iocb (and make sure it's cleared, so
aio_poll_wake() won't step onto it after aio_poll() exits).
That's enough to get rid of the refcount. In async case freeing
is done by aio_poll_complete() (and aio_poll() will only touch
the iocb past the vfs_poll() if it's guaranteed that aio_poll_complete()
has not happened yet), in all other cases we make sure that wakeups
hadn't and won't happen and deal with disposal of iocb ourselves.
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 55 +++++++++++++++++++++++++++----------------------------
1 file changed, 27 insertions(+), 28 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 22b288997441..ee062253e303 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -180,8 +180,8 @@ struct fsync_iocb {
struct poll_iocb {
struct file *file;
struct wait_queue_head *head;
+ bool *taken;
__poll_t events;
- bool woken;
bool cancelled;
struct wait_queue_entry wait;
struct work_struct work;
@@ -209,8 +209,6 @@ struct aio_kiocb {
struct list_head ki_list; /* the aio core uses this
* for cancellation */
- refcount_t ki_refcnt;
-
/*
* If the aio_resfd field of the userspace iocb is not zero,
* this is the underlying eventfd context to deliver events to.
@@ -1034,7 +1032,6 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
percpu_ref_get(&ctx->reqs);
req->ki_ctx = ctx;
INIT_LIST_HEAD(&req->ki_list);
- refcount_set(&req->ki_refcnt, 0);
req->ki_eventfd = NULL;
return req;
}
@@ -1069,13 +1066,10 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
static inline void iocb_put(struct aio_kiocb *iocb)
{
- if (refcount_read(&iocb->ki_refcnt) == 0 ||
- refcount_dec_and_test(&iocb->ki_refcnt)) {
- if (iocb->ki_filp)
- fput(iocb->ki_filp);
- percpu_ref_put(&iocb->ki_ctx->reqs);
- kmem_cache_free(kiocb_cachep, iocb);
- }
+ if (iocb->ki_filp)
+ fput(iocb->ki_filp);
+ percpu_ref_put(&iocb->ki_ctx->reqs);
+ kmem_cache_free(kiocb_cachep, iocb);
}
static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
@@ -1672,8 +1666,10 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
if (mask && !(mask & req->events))
return 0;
- req->woken = true;
-
+ if (unlikely(req->taken)) {
+ *req->taken = true;
+ req->taken = NULL;
+ }
if (mask) {
/*
* Try to complete the iocb inline if we can. Use
@@ -1698,6 +1694,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
struct aio_poll_table {
struct poll_table_struct pt;
+ struct wait_queue_head *head;
struct aio_kiocb *iocb;
int error;
};
@@ -1715,7 +1712,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
}
pt->error = 0;
- pt->iocb->poll.head = head;
+ pt->head = pt->iocb->poll.head = head;
add_wait_queue(head, &pt->iocb->poll.wait);
}
@@ -1738,7 +1735,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP;
req->head = NULL;
- req->woken = false;
+ req->taken = &async;
req->cancelled = false;
apt.pt._qproc = aio_poll_queue_proc;
@@ -1750,36 +1747,38 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
INIT_LIST_HEAD(&req->wait.entry);
init_waitqueue_func_entry(&req->wait, aio_poll_wake);
- /* one for removal from waitqueue, one for this function */
- refcount_set(&aiocb->ki_refcnt, 2);
-
- mask = vfs_poll(req->file, &apt.pt) & req->events;
- if (unlikely(!req->head)) {
+ mask = req->events;
+ mask &= vfs_poll(req->file, &apt.pt);
+ /*
+ * Careful: we might've been put into waitqueue *and* already
+ * woken up before vfs_poll() returns. The caller is holding
+ * a reference to file, so it couldn't have been killed under
+ * us, no matter what. However, in case of early wakeup, @req
+ * might be already gone by now.
+ */
+ if (unlikely(!apt.head)) {
/* we did not manage to set up a waitqueue, done */
goto out;
}
-
spin_lock_irq(&ctx->ctx_lock);
- spin_lock(&req->head->lock);
- if (req->woken) { /* already taken up by aio_poll_wake() */
- async = true;
+ spin_lock(&apt.head->lock);
+ if (async) { /* already taken up by aio_poll_wake() */
apt.error = 0;
} else if (!mask && !apt.error) { /* actually waiting for an event */
list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
aiocb->ki_cancel = aio_poll_cancel;
+ req->taken = NULL;
async = true;
} else { /* if we get an error or a mask we are done */
WARN_ON_ONCE(list_empty(&req->wait.entry));
list_del_init(&req->wait.entry);
/* no wakeup in the future either; aiocb is ours to dispose of */
}
- spin_unlock(&req->head->lock);
+ spin_unlock(&apt.head->lock);
spin_unlock_irq(&ctx->ctx_lock);
-
out:
- if (async && !apt.error)
+ if (!async && !apt.error)
aio_poll_complete(aiocb, mask);
- iocb_put(aiocb);
return apt.error;
}
--
2.11.0
From: Al Viro <[email protected]>
that ssize_t is a rudiment of earlier calling conventions; it's been
used only to pass 0 and -E... since last autumn.
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index ee062253e303..5dd5f35d054c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1508,13 +1508,13 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
}
}
-static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
+static int aio_read(struct kiocb *req, const struct iocb *iocb,
bool vectored, bool compat)
{
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
struct file *file;
- ssize_t ret;
+ int ret;
ret = aio_prep_rw(req, iocb);
if (ret)
@@ -1536,13 +1536,13 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
return ret;
}
-static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
+static int aio_write(struct kiocb *req, const struct iocb *iocb,
bool vectored, bool compat)
{
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
struct file *file;
- ssize_t ret;
+ int ret;
ret = aio_prep_rw(req, iocb);
if (ret)
@@ -1716,7 +1716,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
add_wait_queue(head, &pt->iocb->poll.wait);
}
-static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
+static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
{
struct kioctx *ctx = aiocb->ki_ctx;
struct poll_iocb *req = &aiocb->poll;
@@ -1787,7 +1787,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
{
struct aio_kiocb *req;
struct file *file;
- ssize_t ret;
+ int ret;
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved2)) {
--
2.11.0
From: Al Viro <[email protected]>
makes for somewhat cleaner control flow in __io_submit_one()
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 86 +++++++++++++++++++++++++++++++---------------------------------
1 file changed, 42 insertions(+), 44 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index af51b1360305..6993581b77b2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1788,36 +1788,15 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
}
static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
- struct iocb __user *user_iocb, bool compat)
+ struct iocb __user *user_iocb, struct aio_kiocb *req,
+ bool compat)
{
- struct aio_kiocb *req;
struct file *file;
int ret;
- /* enforce forwards compatibility on users */
- if (unlikely(iocb->aio_reserved2)) {
- pr_debug("EINVAL: reserve field set\n");
- return -EINVAL;
- }
-
- /* prevent overflows */
- if (unlikely(
- (iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
- (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
- ((ssize_t)iocb->aio_nbytes < 0)
- )) {
- pr_debug("EINVAL: overflow check\n");
- return -EINVAL;
- }
-
- req = aio_get_req(ctx);
- if (unlikely(!req))
- return -EAGAIN;
-
req->ki_filp = fget(iocb->aio_fildes);
- ret = -EBADF;
if (unlikely(!req->ki_filp))
- goto out_put_req;
+ return -EBADF;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
struct eventfd_ctx *eventfd;
@@ -1828,17 +1807,15 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
* event using the eventfd_signal() function.
*/
eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
- if (IS_ERR(eventfd)) {
- ret = PTR_ERR(eventfd);
- goto out_put_req;
- }
+ if (IS_ERR(eventfd))
+ return PTR_ERR(req->ki_eventfd);
+
req->ki_eventfd = eventfd;
}
- ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
- if (unlikely(ret)) {
+ if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) {
pr_debug("EFAULT: aio_key\n");
- goto out_put_req;
+ return -EFAULT;
}
req->ki_user_iocb = user_iocb;
@@ -1873,30 +1850,51 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
break;
}
fput(file);
-
- /*
- * If ret is 0, we'd either done aio_complete() ourselves or have
- * arranged for that to be done asynchronously. Anything non-zero
- * means that we need to destroy req ourselves.
- */
- if (ret)
- goto out_put_req;
- return 0;
-out_put_req:
- iocb_put(req);
- put_reqs_available(ctx, 1);
return ret;
}
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
bool compat)
{
+ struct aio_kiocb *req;
struct iocb iocb;
+ int err;
if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
return -EFAULT;
- return __io_submit_one(ctx, &iocb, user_iocb, compat);
+ /* enforce forwards compatibility on users */
+ if (unlikely(iocb.aio_reserved2)) {
+ pr_debug("EINVAL: reserve field set\n");
+ return -EINVAL;
+ }
+
+ /* prevent overflows */
+ if (unlikely(
+ (iocb.aio_buf != (unsigned long)iocb.aio_buf) ||
+ (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) ||
+ ((ssize_t)iocb.aio_nbytes < 0)
+ )) {
+ pr_debug("EINVAL: overflow check\n");
+ return -EINVAL;
+ }
+
+ req = aio_get_req(ctx);
+ if (unlikely(!req))
+ return -EAGAIN;
+
+ err = __io_submit_one(ctx, &iocb, user_iocb, req, compat);
+
+ /*
+ * If err is 0, we'd either done aio_complete() ourselves or have
+ * arranged for that to be done asynchronously. Anything non-zero
+ * means that we need to destroy req ourselves.
+ */
+ if (unlikely(err)) {
+ iocb_put(req);
+ put_reqs_available(ctx, 1);
+ }
+ return err;
}
/* sys_io_submit:
--
2.11.0
From: Al Viro <[email protected]>
In case of early wakeups, aio_poll() assumes that aio_poll_complete()
has either already happened or is imminent. In that case we do not
want to put iocb on the list of cancellables. However, ignored
wakeups need to be treated as if wakeup has not happened at all.
Trivially fixed by having aio_poll_wake() set ->woken only after
it's committed to taking iocb out of the waitqueue.
Spotted-by: zhengbin <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index ea30b78187ed..3a8b894378e0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1668,13 +1668,13 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
__poll_t mask = key_to_poll(key);
unsigned long flags;
+ /* for instances that support it check for an event match first: */
+ if (mask && !(mask & req->events))
+ return 0;
+
req->woken = true;
- /* for instances that support it check for an event match first: */
if (mask) {
- if (!(mask & req->events))
- return 0;
-
/*
* Try to complete the iocb inline if we can. Use
* irqsave/irqrestore because not all filesystems (e.g. fuse)
--
2.11.0
From: Al Viro <[email protected]>
simplifies the caller
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 5837a29e63fe..af51b1360305 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1029,6 +1029,11 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
return NULL;
+ if (unlikely(!get_reqs_available(ctx))) {
+ kfree(req);
+ return NULL;
+ }
+
percpu_ref_get(&ctx->reqs);
req->ki_ctx = ctx;
INIT_LIST_HEAD(&req->ki_list);
@@ -1805,13 +1810,9 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
return -EINVAL;
}
- if (!get_reqs_available(ctx))
- return -EAGAIN;
-
- ret = -EAGAIN;
req = aio_get_req(ctx);
if (unlikely(!req))
- goto out_put_reqs_available;
+ return -EAGAIN;
req->ki_filp = fget(iocb->aio_fildes);
ret = -EBADF;
@@ -1883,7 +1884,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
return 0;
out_put_req:
iocb_put(req);
-out_put_reqs_available:
put_reqs_available(ctx, 1);
return ret;
}
--
2.11.0
From: Al Viro <[email protected]>
"aio: remove the extra get_file/fput pair in io_submit_one" was
too optimistic - not dereferencing file pointer after e.g.
->write_iter() returns is not enough; that reference might've been
the only thing that kept alive objects that are referenced
*before* the method returns. Such as inode, for example...
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/aio.c b/fs/aio.c
index 3d9669d011b9..ea30b78187ed 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1790,6 +1790,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
struct iocb __user *user_iocb, bool compat)
{
struct aio_kiocb *req;
+ struct file *file;
ssize_t ret;
/* enforce forwards compatibility on users */
@@ -1844,6 +1845,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
req->ki_user_iocb = user_iocb;
req->ki_user_data = iocb->aio_data;
+ file = get_file(req->ki_filp); /* req can die too early */
switch (iocb->aio_lio_opcode) {
case IOCB_CMD_PREAD:
@@ -1872,6 +1874,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
ret = -EINVAL;
break;
}
+ fput(file);
/*
* If ret is 0, we'd either done aio_complete() ourselves or have
--
2.11.0
On Wed, Mar 6, 2019 at 4:03 PM Al Viro <[email protected]> wrote:
>
> From: Al Viro <[email protected]>
>
> "aio: remove the extra get_file/fput pair in io_submit_one" was
> too optimistic - not dereferencing file pointer after e.g.
> ->write_iter() returns is not enough; that reference might've been
> the only thing that kept alive objects that are referenced
> *before* the method returns. Such as inode, for example...
I still; think that this is actually _worse_ than just having the
refcount on the req instead.
As it is, we have that completely insane "ref can go away from under
us", because nothing keeps that around, which then causes all those
other crazy issues with "woken" etc garbage.
I think we should be able to get rid of those entirely. Make the
poll() case just return zero if it has added the entry successfully to
poll queue. No need for "woken", no need for all that odd "oh, but
now the req might no longer exist".
The refcount wasn't the problem. Everything *else* was the problem,
including only using the refcount for the poll case etc.
Linus
On Wed, Mar 06, 2019 at 04:23:04PM -0800, Linus Torvalds wrote:
> On Wed, Mar 6, 2019 at 4:03 PM Al Viro <[email protected]> wrote:
> >
> > From: Al Viro <[email protected]>
> >
> > "aio: remove the extra get_file/fput pair in io_submit_one" was
> > too optimistic - not dereferencing file pointer after e.g.
> > ->write_iter() returns is not enough; that reference might've been
> > the only thing that kept alive objects that are referenced
> > *before* the method returns. Such as inode, for example...
>
> I still; think that this is actually _worse_ than just having the
> refcount on the req instead.
>
> As it is, we have that completely insane "ref can go away from under
> us", because nothing keeps that around, which then causes all those
> other crazy issues with "woken" etc garbage.
>
> I think we should be able to get rid of those entirely. Make the
> poll() case just return zero if it has added the entry successfully to
> poll queue. No need for "woken", no need for all that odd "oh, but
> now the req might no longer exist".
Not really. Sure, you can get rid of "might no longer exist"
considerations, but you still need to decide which way do we want to
handle it. There are 3 cases:
* it's already taken up; don't put on the list for possible
cancel, don't call aio_complete().
* will eventually be woken up; put on the list for possible
cancle, don't call aio_complete().
* wanted to be on several queues, fortunately not woken up
yet. Make sure it's gone from queue, return an error.
* none of the above, and ->poll() has reported what we wanted
from the very beginning. Remove from queue, call aio_complete().
You'll need some logics to handle that. I can buy the "if we know
the req is still alive, we can check if it's still queued instead of
separate woken flag", but but it won't win you much ;-/
On Thu, Mar 07, 2019 at 12:41:59AM +0000, Al Viro wrote:
> On Wed, Mar 06, 2019 at 04:23:04PM -0800, Linus Torvalds wrote:
> > On Wed, Mar 6, 2019 at 4:03 PM Al Viro <[email protected]> wrote:
> > >
> > > From: Al Viro <[email protected]>
> > >
> > > "aio: remove the extra get_file/fput pair in io_submit_one" was
> > > too optimistic - not dereferencing file pointer after e.g.
> > > ->write_iter() returns is not enough; that reference might've been
> > > the only thing that kept alive objects that are referenced
> > > *before* the method returns. Such as inode, for example...
> >
> > I still; think that this is actually _worse_ than just having the
> > refcount on the req instead.
> >
> > As it is, we have that completely insane "ref can go away from under
> > us", because nothing keeps that around, which then causes all those
> > other crazy issues with "woken" etc garbage.
> >
> > I think we should be able to get rid of those entirely. Make the
> > poll() case just return zero if it has added the entry successfully to
> > poll queue. No need for "woken", no need for all that odd "oh, but
> > now the req might no longer exist".
>
> Not really. Sure, you can get rid of "might no longer exist"
> considerations, but you still need to decide which way do we want to
> handle it. There are 3 cases:
> * it's already taken up; don't put on the list for possible
> cancel, don't call aio_complete().
> * will eventually be woken up; put on the list for possible
> cancle, don't call aio_complete().
> * wanted to be on several queues, fortunately not woken up
> yet. Make sure it's gone from queue, return an error.
> * none of the above, and ->poll() has reported what we wanted
> from the very beginning. Remove from queue, call aio_complete().
>
> You'll need some logics to handle that. I can buy the "if we know
> the req is still alive, we can check if it's still queued instead of
> separate woken flag", but but it won't win you much ;-/
If anything, the one good reason for refcount would be the risk that
some ->read_iter() or ->write_iter() will try to dereference iocb
after having decided to return -EIOCBQUEUED and submitted all bios.
I think that doesn't happen, but making sure it doesn't would be
a good argument in favour of that refcount.
On Thu, Mar 07, 2019 at 12:48:28AM +0000, Al Viro wrote:
> On Thu, Mar 07, 2019 at 12:41:59AM +0000, Al Viro wrote:
> > On Wed, Mar 06, 2019 at 04:23:04PM -0800, Linus Torvalds wrote:
> > > On Wed, Mar 6, 2019 at 4:03 PM Al Viro <[email protected]> wrote:
> > > >
> > > > From: Al Viro <[email protected]>
> > > >
> > > > "aio: remove the extra get_file/fput pair in io_submit_one" was
> > > > too optimistic - not dereferencing file pointer after e.g.
> > > > ->write_iter() returns is not enough; that reference might've been
> > > > the only thing that kept alive objects that are referenced
> > > > *before* the method returns. Such as inode, for example...
> > >
> > > I still; think that this is actually _worse_ than just having the
> > > refcount on the req instead.
> > >
> > > As it is, we have that completely insane "ref can go away from under
> > > us", because nothing keeps that around, which then causes all those
> > > other crazy issues with "woken" etc garbage.
> > >
> > > I think we should be able to get rid of those entirely. Make the
> > > poll() case just return zero if it has added the entry successfully to
> > > poll queue. No need for "woken", no need for all that odd "oh, but
> > > now the req might no longer exist".
> >
> > Not really. Sure, you can get rid of "might no longer exist"
> > considerations, but you still need to decide which way do we want to
> > handle it. There are 3 cases:
> > * it's already taken up; don't put on the list for possible
> > cancel, don't call aio_complete().
> > * will eventually be woken up; put on the list for possible
> > cancle, don't call aio_complete().
> > * wanted to be on several queues, fortunately not woken up
> > yet. Make sure it's gone from queue, return an error.
> > * none of the above, and ->poll() has reported what we wanted
> > from the very beginning. Remove from queue, call aio_complete().
> >
> > You'll need some logics to handle that. I can buy the "if we know
> > the req is still alive, we can check if it's still queued instead of
> > separate woken flag", but but it won't win you much ;-/
>
> If anything, the one good reason for refcount would be the risk that
> some ->read_iter() or ->write_iter() will try to dereference iocb
> after having decided to return -EIOCBQUEUED and submitted all bios.
> I think that doesn't happen, but making sure it doesn't would be
> a good argument in favour of that refcount.
*grumble*
It is a good argument, unfortunately ;-/ Proof that instances do not
step into that is rather subtle and won't take much to break. OK...
I'll try to massage that series on top of your patch; I still hate the
post-vfs_poll() logics in aio_poll() ;-/ Give me about half an hour
and I'll have something to post.
On Wed, Mar 6, 2019 at 5:20 PM Al Viro <[email protected]> wrote:
>
> I'll try to massage that series on top of your patch; I still hate the
> post-vfs_poll() logics in aio_poll() ;-/ Give me about half an hour
> and I'll have something to post.
No inherent hurry, I sent the ping just to make sure it hadn't gotten lost.
And yeah, I think the post-vfs_poll() logic cannot possibly be
necessary. My gut feel is that *if* we have the refcounting right,
then we should be able to just let the wakeup come in at any later
point, and ordering shouldn't matter all that much, and we shouldn't
even need any locking.
I'd like to think that it can be done with something like "just 'or'
in the mask atomically" (so that we don't care about ordering between
the synchronous vfs_poll() and the async poll wakeup), together with
"when refcount goes to zero, finish the thing off and complete it" (so
that we don't care who finishes first).
No "woken" logic, no "who fired first" logic, no BS. Just make the
operations work regardless of ordering.
And maybe it can't be done. But the current model seems just so hacky
that it can't be the right model.
Linus
+ if (async && !apt.error) --->may be this should be if (!async && !apt.error) ?
On 2019/3/7 8:03, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> We want iocb_put() happening on errors, to balance the extra reference
> we'd taken. As it is, we end up with a leak. The rules should be
> * error: iocb_put() to deal with the extra ref, return error,
> let the caller do another iocb_put().
> * async: iocb_put() to deal with the extra ref, return 0.
> * no error, event present immediately: aio_poll_complete() to
> report it, iocb_put() to deal with the extra ref, return 0.
>
> Signed-off-by: Al Viro <[email protected]>
> ---
> fs/aio.c | 25 +++++++++++--------------
> 1 file changed, 11 insertions(+), 14 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 3a8b894378e0..22b288997441 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1724,6 +1724,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
> struct kioctx *ctx = aiocb->ki_ctx;
> struct poll_iocb *req = &aiocb->poll;
> struct aio_poll_table apt;
> + bool async = false;
> __poll_t mask;
>
> /* reject any unknown events outside the normal event mask. */
> @@ -1760,30 +1761,26 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>
> spin_lock_irq(&ctx->ctx_lock);
> spin_lock(&req->head->lock);
> - if (req->woken) {
> - /* wake_up context handles the rest */
> - mask = 0;
> + if (req->woken) { /* already taken up by aio_poll_wake() */
> + async = true;
> apt.error = 0;
> - } else if (mask || apt.error) {
> - /* if we get an error or a mask we are done */
> - WARN_ON_ONCE(list_empty(&req->wait.entry));
> - list_del_init(&req->wait.entry);
> - } else {
> - /* actually waiting for an event */
> + } else if (!mask && !apt.error) { /* actually waiting for an event */
> list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
> aiocb->ki_cancel = aio_poll_cancel;
> + async = true;
> + } else { /* if we get an error or a mask we are done */
> + WARN_ON_ONCE(list_empty(&req->wait.entry));
> + list_del_init(&req->wait.entry);
> + /* no wakeup in the future either; aiocb is ours to dispose of */
> }
> spin_unlock(&req->head->lock);
> spin_unlock_irq(&ctx->ctx_lock);
>
> out:
> - if (unlikely(apt.error))
> - return apt.error;
> -
> - if (mask)
> + if (async && !apt.error)
> aio_poll_complete(aiocb, mask);
> iocb_put(aiocb);
> - return 0;
> + return apt.error;
> }
>
> static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
>
On Thu, Mar 07, 2019 at 12:03:10AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> In case of early wakeups, aio_poll() assumes that aio_poll_complete()
> has either already happened or is imminent. In that case we do not
> want to put iocb on the list of cancellables. However, ignored
> wakeups need to be treated as if wakeup has not happened at all.
> Trivially fixed by having aio_poll_wake() set ->woken only after
> it's committed to taking iocb out of the waitqueue.
>
> Spotted-by: zhengbin <[email protected]>
> Signed-off-by: Al Viro <[email protected]>
... and unfortunately it's worse than just that - what both of us
have missed is that one could have non-specific wakep + schedule_work +
aio_poll_complete_work() rechecking ->poll(), seeing nothing of
interest and reinserting into queue. All before vfs_poll() manages
to return into aio_poll(). The window is harder to hit, but it's
still there, with exact same "failed to add to cancel list" kind of bug
if we do hit it ;-/
On Wed, Mar 06, 2019 at 05:30:21PM -0800, Linus Torvalds wrote:
> On Wed, Mar 6, 2019 at 5:20 PM Al Viro <[email protected]> wrote:
> >
> > I'll try to massage that series on top of your patch; I still hate the
> > post-vfs_poll() logics in aio_poll() ;-/ Give me about half an hour
> > and I'll have something to post.
>
> No inherent hurry, I sent the ping just to make sure it hadn't gotten lost.
>
> And yeah, I think the post-vfs_poll() logic cannot possibly be
> necessary. My gut feel is that *if* we have the refcounting right,
> then we should be able to just let the wakeup come in at any later
> point, and ordering shouldn't matter all that much, and we shouldn't
> even need any locking.
>
> I'd like to think that it can be done with something like "just 'or'
> in the mask atomically" (so that we don't care about ordering between
> the synchronous vfs_poll() and the async poll wakeup), together with
> "when refcount goes to zero, finish the thing off and complete it" (so
> that we don't care who finishes first).
>
> No "woken" logic, no "who fired first" logic, no BS. Just make the
> operations work regardless of ordering.
>
> And maybe it can't be done. But the current model seems just so hacky
> that it can't be the right model.
Umm... It is kinda-sorta doable; we do need something vaguely similar
to ->woken ("should we add it to the list of cancellables, or is the
async reference already gone?"), but other than that it seems to be
feasible.
See vfs.git#work.aio; the crucial bits are in these commits:
keep io_event in aio_kiocb
get rid of aio_complete() res/res2 arguments
move aio_complete() to final iocb_put(), try to fix aio_poll() logics
The first two are preparations, the last is where the fixes (hopefully)
happen.
The logics in aio_poll() after vfs_poll():
* we might want to steal the async reference (e.g. due to event
returned from the very beginning, or due to attempt to put on more than
one waitqueue, which makes results unreliable). That's _NOT_ possible
if the thing had been put on a waitqueue, but currently isn't there.
It might be either due to early wakeup having done everything or the
same having scheduled aio_poll_complete_work(). In either case, the
best we can do is to ignore the return value of vfs_poll() and, in
case of error, mark the sucker cancelled. We *can't* return an error
in that case.
* if we want and can steal the async reference, rip it from
waitqueue; otherwise, put it on the "cancellable" list, unless it's
already gone or unless we are simulating the cancel ourselves.
* if vfs_poll() has reported something we want and we have
successufully stolen the iocb, put it there, have the reference
we'd taken over dropped and return 0
Comments?
So, what should we do?
On 2019/3/7 10:18, Al Viro wrote:
> On Thu, Mar 07, 2019 at 12:03:10AM +0000, Al Viro wrote:
>> From: Al Viro <[email protected]>
>>
>> In case of early wakeups, aio_poll() assumes that aio_poll_complete()
>> has either already happened or is imminent. In that case we do not
>> want to put iocb on the list of cancellables. However, ignored
>> wakeups need to be treated as if wakeup has not happened at all.
>> Trivially fixed by having aio_poll_wake() set ->woken only after
>> it's committed to taking iocb out of the waitqueue.
>>
>> Spotted-by: zhengbin <[email protected]>
>> Signed-off-by: Al Viro <[email protected]>
>
> ... and unfortunately it's worse than just that - what both of us
> have missed is that one could have non-specific wakep + schedule_work +
> aio_poll_complete_work() rechecking ->poll(), seeing nothing of
> interest and reinserting into queue. All before vfs_poll() manages
> to return into aio_poll(). The window is harder to hit, but it's
> still there, with exact same "failed to add to cancel list" kind of bug
> if we do hit it ;-/
>
> .
>
On Fri, Mar 08, 2019 at 03:36:50AM +0000, Al Viro wrote:
> See vfs.git#work.aio; the crucial bits are in these commits:
> keep io_event in aio_kiocb
> get rid of aio_complete() res/res2 arguments
> move aio_complete() to final iocb_put(), try to fix aio_poll() logics
> The first two are preparations, the last is where the fixes (hopefully)
> happen.
Looks sensible. I'll try to run the tests over it, and I've added
Avi so that maybe he can make sure that scylladb is also happy with it,
that was usually the best way to find aio poll bugs..
On Fri, Mar 08, 2019 at 03:36:50AM +0000, Al Viro wrote:
> See vfs.git#work.aio; the crucial bits are in these commits:
> keep io_event in aio_kiocb
> get rid of aio_complete() res/res2 arguments
> move aio_complete() to final iocb_put(), try to fix aio_poll() logics
> The first two are preparations, the last is where the fixes (hopefully)
> happen.
OK, refactored, cleaned up and force-pushed. Current state:
Al Viro (7):
keep io_event in aio_kiocb
aio: store event at final iocb_put()
Fix aio_poll() races
make aio_read()/aio_write() return int
move dropping ->ki_eventfd into iocb_destroy()
deal with get_reqs_available() in aio_get_req() itself
aio: move sanity checks and request allocation to io_submit_one()
Linus Torvalds (1):
pin iocb through aio.
fs/aio.c | 327 ++++++++++++++++++++++++++++-----------------------------------
1 file changed, 146 insertions(+), 181 deletions(-)
From: Linus Torvalds <[email protected]>
aio_poll() is not the only case that needs file pinned; worse, while
aio_read()/aio_write() can live without pinning iocb itself, the
proof is rather brittle and can easily break on later changes.
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 3d9669d011b9..363d7d7c8bff 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1022,6 +1022,9 @@ static bool get_reqs_available(struct kioctx *ctx)
/* aio_get_req
* Allocate a slot for an aio request.
* Returns NULL if no requests are free.
+ *
+ * The refcount is initialized to 2 - one for the async op completion,
+ * one for the synchronous code that does this.
*/
static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
{
@@ -1034,7 +1037,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
percpu_ref_get(&ctx->reqs);
req->ki_ctx = ctx;
INIT_LIST_HEAD(&req->ki_list);
- refcount_set(&req->ki_refcnt, 0);
+ refcount_set(&req->ki_refcnt, 2);
req->ki_eventfd = NULL;
return req;
}
@@ -1067,15 +1070,18 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
return ret;
}
+static inline void iocb_destroy(struct aio_kiocb *iocb)
+{
+ if (iocb->ki_filp)
+ fput(iocb->ki_filp);
+ percpu_ref_put(&iocb->ki_ctx->reqs);
+ kmem_cache_free(kiocb_cachep, iocb);
+}
+
static inline void iocb_put(struct aio_kiocb *iocb)
{
- if (refcount_read(&iocb->ki_refcnt) == 0 ||
- refcount_dec_and_test(&iocb->ki_refcnt)) {
- if (iocb->ki_filp)
- fput(iocb->ki_filp);
- percpu_ref_put(&iocb->ki_ctx->reqs);
- kmem_cache_free(kiocb_cachep, iocb);
- }
+ if (refcount_dec_and_test(&iocb->ki_refcnt))
+ iocb_destroy(iocb);
}
static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
@@ -1749,9 +1755,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
INIT_LIST_HEAD(&req->wait.entry);
init_waitqueue_func_entry(&req->wait, aio_poll_wake);
- /* one for removal from waitqueue, one for this function */
- refcount_set(&aiocb->ki_refcnt, 2);
-
mask = vfs_poll(req->file, &apt.pt) & req->events;
if (unlikely(!req->head)) {
/* we did not manage to set up a waitqueue, done */
@@ -1782,7 +1785,6 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
if (mask)
aio_poll_complete(aiocb, mask);
- iocb_put(aiocb);
return 0;
}
@@ -1873,18 +1875,21 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
break;
}
+ /* Done with the synchronous reference */
+ iocb_put(req);
+
/*
* If ret is 0, we'd either done aio_complete() ourselves or have
* arranged for that to be done asynchronously. Anything non-zero
* means that we need to destroy req ourselves.
*/
- if (ret)
- goto out_put_req;
- return 0;
+ if (!ret)
+ return 0;
+
out_put_req:
if (req->ki_eventfd)
eventfd_ctx_put(req->ki_eventfd);
- iocb_put(req);
+ iocb_destroy(req);
out_put_reqs_available:
put_reqs_available(ctx, 1);
return ret;
--
2.11.0
From: Al Viro <[email protected]>
Instead of having aio_complete() set ->ki_res.{res,res2}, do that
explicitly in its callers, drop the reference (as aio_complete()
used to do) and delay the rest until the final iocb_put().
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 45 ++++++++++++++++++++-------------------------
1 file changed, 20 insertions(+), 25 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 2249a7a1d6b3..b9c4c1894020 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb)
kmem_cache_free(kiocb_cachep, iocb);
}
-static inline void iocb_put(struct aio_kiocb *iocb)
-{
- if (refcount_dec_and_test(&iocb->ki_refcnt))
- iocb_destroy(iocb);
-}
-
-static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
- long res, long res2)
-{
- iocb->ki_res.res = res;
- iocb->ki_res.res2 = res2;
- *ev = iocb->ki_res;
-}
-
/* aio_complete
* Called when the io request on the given iocb is complete.
*/
-static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
+static void aio_complete(struct aio_kiocb *iocb)
{
struct kioctx *ctx = iocb->ki_ctx;
struct aio_ring *ring;
@@ -1118,14 +1104,14 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
event = ev_page + pos % AIO_EVENTS_PER_PAGE;
- aio_fill_event(event, iocb, res, res2);
+ *event = iocb->ki_res;
kunmap_atomic(ev_page);
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
- pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
+ pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n",
ctx, tail, iocb, (void __user *)iocb->ki_res.obj, iocb->ki_res.data,
- res, res2);
+ iocb->ki_res.res, iocb->ki_res.res2);
/* after flagging the request as done, we
* must never even look at it again
@@ -1167,7 +1153,14 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
if (waitqueue_active(&ctx->wait))
wake_up(&ctx->wait);
- iocb_put(iocb);
+}
+
+static inline void iocb_put(struct aio_kiocb *iocb)
+{
+ if (refcount_dec_and_test(&iocb->ki_refcnt)) {
+ aio_complete(iocb);
+ iocb_destroy(iocb);
+ }
}
/* aio_read_events_ring
@@ -1441,7 +1434,9 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
file_end_write(kiocb->ki_filp);
}
- aio_complete(iocb, res, res2);
+ iocb->ki_res.res = res;
+ iocb->ki_res.res2 = res2;
+ iocb_put(iocb);
}
static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
@@ -1589,11 +1584,10 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
static void aio_fsync_work(struct work_struct *work)
{
- struct fsync_iocb *req = container_of(work, struct fsync_iocb, work);
- int ret;
+ struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, fsync.work);
- ret = vfs_fsync(req->file, req->datasync);
- aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
+ iocb->ki_res.res = vfs_fsync(iocb->fsync.file, iocb->fsync.datasync);
+ iocb_put(iocb);
}
static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
@@ -1614,7 +1608,8 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask)
{
- aio_complete(iocb, mangle_poll(mask), 0);
+ iocb->ki_res.res = mangle_poll(mask);
+ iocb_put(iocb);
}
static void aio_poll_complete_work(struct work_struct *work)
--
2.11.0
From: Al Viro <[email protected]>
aio_poll() has to cope with several unpleasant problems:
* requests that might stay around indefinitely need to
be made visible for io_cancel(2); that must not be done to
a request already completed, though.
* in cases when ->poll() has placed us on a waitqueue,
wakeup might have happened (and request completed) before ->poll()
returns.
* worse, in some early wakeup cases request might end
up re-added into the queue later - we can't treat "woken up and
currently not in the queue" as "it's not going to stick around
indefinitely"
* ... moreover, ->poll() might have decided not to
put it on any queues to start with, and that needs to be distinguished
from the previous case
* ->poll() might have tried to put us on more than one queue.
Only the first will succeed for aio poll, so we might end up missing
wakeups. OTOH, we might very well notice that only after the
wakeup hits and request gets completed (all before ->poll() gets
around to the second poll_wait()). In that case it's too late to
decide that we have an error.
req->woken was an attempt to deal with that. Unfortunately, it was
broken. What we need to keep track of is not that wakeup has happened -
the thing might come back after that. It's that async reference is
already gone and won't come back, so we can't (and needn't) put the
request on the list of cancellables.
The easiest case is "request hadn't been put on any waitqueues"; we
can tell by seeing NULL apt.head, and in that case there won't be
anything async. We should either complete the request ourselves
(if vfs_poll() reports anything of interest) or return an error.
In all other cases we get exclusion with wakeups by grabbing the
queue lock.
If request is currently on queue and we have something interesting
from vfs_poll(), we can steal it and complete the request ourselves.
If it's on queue and vfs_poll() has not reported anything interesting,
we either put it on the cancellable list, or, if we know that it
hadn't been put on all queues ->poll() wanted it on, we steal it and
return an error.
If it's _not_ on queue, it's either been already dealt with (in which
case we do nothing), or there's aio_poll_complete_work() about to be
executed. In that case we either put it on the cancellable list,
or, if we know it hadn't been put on all queues ->poll() wanted it on,
simulate what cancel would've done.
It's a lot more convoluted than I'd like it to be. Single-consumer APIs
suck, and unfortunately aio is not an exception...
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 71 +++++++++++++++++++++++++++++-----------------------------------
1 file changed, 32 insertions(+), 39 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index b9c4c1894020..f47a29f7f201 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -181,7 +181,7 @@ struct poll_iocb {
struct file *file;
struct wait_queue_head *head;
__poll_t events;
- bool woken;
+ bool done;
bool cancelled;
struct wait_queue_entry wait;
struct work_struct work;
@@ -1606,12 +1606,6 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
return 0;
}
-static inline void aio_poll_complete(struct aio_kiocb *iocb, __poll_t mask)
-{
- iocb->ki_res.res = mangle_poll(mask);
- iocb_put(iocb);
-}
-
static void aio_poll_complete_work(struct work_struct *work)
{
struct poll_iocb *req = container_of(work, struct poll_iocb, work);
@@ -1637,9 +1631,11 @@ static void aio_poll_complete_work(struct work_struct *work)
return;
}
list_del_init(&iocb->ki_list);
+ iocb->ki_res.res = mangle_poll(mask);
+ req->done = true;
spin_unlock_irq(&ctx->ctx_lock);
- aio_poll_complete(iocb, mask);
+ iocb_put(iocb);
}
/* assumes we are called with irqs disabled */
@@ -1671,7 +1667,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
if (mask && !(mask & req->events))
return 0;
- req->woken = true;
+ list_del_init(&req->wait.entry);
if (mask) {
/*
@@ -1682,15 +1678,14 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
*/
if (spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) {
list_del(&iocb->ki_list);
+ iocb->ki_res.res = mangle_poll(mask);
+ req->done = true;
spin_unlock_irqrestore(&iocb->ki_ctx->ctx_lock, flags);
-
- list_del_init(&req->wait.entry);
- aio_poll_complete(iocb, mask);
+ iocb_put(iocb);
return 1;
}
}
- list_del_init(&req->wait.entry);
schedule_work(&req->work);
return 1;
}
@@ -1723,6 +1718,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
struct kioctx *ctx = aiocb->ki_ctx;
struct poll_iocb *req = &aiocb->poll;
struct aio_poll_table apt;
+ bool cancel = false;
__poll_t mask;
/* reject any unknown events outside the normal event mask. */
@@ -1736,7 +1732,7 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP;
req->head = NULL;
- req->woken = false;
+ req->done = false;
req->cancelled = false;
apt.pt._qproc = aio_poll_queue_proc;
@@ -1749,36 +1745,33 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
init_waitqueue_func_entry(&req->wait, aio_poll_wake);
mask = vfs_poll(req->file, &apt.pt) & req->events;
- if (unlikely(!req->head)) {
- /* we did not manage to set up a waitqueue, done */
- goto out;
- }
-
spin_lock_irq(&ctx->ctx_lock);
- spin_lock(&req->head->lock);
- if (req->woken) {
- /* wake_up context handles the rest */
- mask = 0;
+ if (likely(req->head)) {
+ spin_lock(&req->head->lock);
+ if (unlikely(list_empty(&req->wait.entry))) {
+ if (apt.error)
+ cancel = true;
+ apt.error = 0;
+ mask = 0;
+ }
+ if (mask || apt.error) {
+ list_del_init(&req->wait.entry);
+ } else if (cancel) {
+ WRITE_ONCE(req->cancelled, true);
+ } else if (!req->done) { /* actually waiting for an event */
+ list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
+ aiocb->ki_cancel = aio_poll_cancel;
+ }
+ spin_unlock(&req->head->lock);
+ }
+ if (mask) { /* no async, we'd stolen it */
+ aiocb->ki_res.res = mangle_poll(mask);
apt.error = 0;
- } else if (mask || apt.error) {
- /* if we get an error or a mask we are done */
- WARN_ON_ONCE(list_empty(&req->wait.entry));
- list_del_init(&req->wait.entry);
- } else {
- /* actually waiting for an event */
- list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
- aiocb->ki_cancel = aio_poll_cancel;
}
- spin_unlock(&req->head->lock);
spin_unlock_irq(&ctx->ctx_lock);
-
-out:
- if (unlikely(apt.error))
- return apt.error;
-
if (mask)
- aio_poll_complete(aiocb, mask);
- return 0;
+ iocb_put(aiocb);
+ return apt.error;
}
static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 56 +++++++++++++++++++++-----------------------------------
1 file changed, 21 insertions(+), 35 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 363d7d7c8bff..2249a7a1d6b3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -204,8 +204,7 @@ struct aio_kiocb {
struct kioctx *ki_ctx;
kiocb_cancel_fn *ki_cancel;
- struct iocb __user *ki_user_iocb; /* user's aiocb */
- __u64 ki_user_data; /* user's data for completion */
+ struct io_event ki_res;
struct list_head ki_list; /* the aio core uses this
* for cancellation */
@@ -1087,10 +1086,9 @@ static inline void iocb_put(struct aio_kiocb *iocb)
static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
long res, long res2)
{
- ev->obj = (u64)(unsigned long)iocb->ki_user_iocb;
- ev->data = iocb->ki_user_data;
- ev->res = res;
- ev->res2 = res2;
+ iocb->ki_res.res = res;
+ iocb->ki_res.res2 = res2;
+ *ev = iocb->ki_res;
}
/* aio_complete
@@ -1126,7 +1124,7 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
- ctx, tail, iocb, iocb->ki_user_iocb, iocb->ki_user_data,
+ ctx, tail, iocb, (void __user *)iocb->ki_res.obj, iocb->ki_res.data,
res, res2);
/* after flagging the request as done, we
@@ -1674,13 +1672,13 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
__poll_t mask = key_to_poll(key);
unsigned long flags;
+ /* for instances that support it check for an event match first: */
+ if (mask && !(mask & req->events))
+ return 0;
+
req->woken = true;
- /* for instances that support it check for an event match first: */
if (mask) {
- if (!(mask & req->events))
- return 0;
-
/*
* Try to complete the iocb inline if we can. Use
* irqsave/irqrestore because not all filesystems (e.g. fuse)
@@ -1844,8 +1842,10 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
goto out_put_req;
}
- req->ki_user_iocb = user_iocb;
- req->ki_user_data = iocb->aio_data;
+ req->ki_res.obj = (u64)(unsigned long)user_iocb;
+ req->ki_res.data = iocb->aio_data;
+ req->ki_res.res = 0;
+ req->ki_res.res2 = 0;
switch (iocb->aio_lio_opcode) {
case IOCB_CMD_PREAD:
@@ -2002,24 +2002,6 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
}
#endif
-/* lookup_kiocb
- * Finds a given iocb for cancellation.
- */
-static struct aio_kiocb *
-lookup_kiocb(struct kioctx *ctx, struct iocb __user *iocb)
-{
- struct aio_kiocb *kiocb;
-
- assert_spin_locked(&ctx->ctx_lock);
-
- /* TODO: use a hash or array, this sucks. */
- list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) {
- if (kiocb->ki_user_iocb == iocb)
- return kiocb;
- }
- return NULL;
-}
-
/* sys_io_cancel:
* Attempts to cancel an iocb previously passed to io_submit. If
* the operation is successfully cancelled, the resulting event is
@@ -2037,6 +2019,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
struct aio_kiocb *kiocb;
int ret = -EINVAL;
u32 key;
+ u64 obj = (u64)(unsigned long)iocb;
if (unlikely(get_user(key, &iocb->aio_key)))
return -EFAULT;
@@ -2048,10 +2031,13 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
return -EINVAL;
spin_lock_irq(&ctx->ctx_lock);
- kiocb = lookup_kiocb(ctx, iocb);
- if (kiocb) {
- ret = kiocb->ki_cancel(&kiocb->rw);
- list_del_init(&kiocb->ki_list);
+ /* TODO: use a hash or array, this sucks. */
+ list_for_each_entry(kiocb, &ctx->active_reqs, ki_list) {
+ if (kiocb->ki_res.obj == obj) {
+ ret = kiocb->ki_cancel(&kiocb->rw);
+ list_del_init(&kiocb->ki_list);
+ break;
+ }
}
spin_unlock_irq(&ctx->ctx_lock);
--
2.11.0
From: Al Viro <[email protected]>
that ssize_t is a rudiment of earlier calling conventions; it's been
used only to pass 0 and -E... since last autumn.
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index f47a29f7f201..f9e8f1edfe36 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1513,13 +1513,13 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
}
}
-static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
+static int aio_read(struct kiocb *req, const struct iocb *iocb,
bool vectored, bool compat)
{
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
struct file *file;
- ssize_t ret;
+ int ret;
ret = aio_prep_rw(req, iocb);
if (ret)
@@ -1541,13 +1541,13 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
return ret;
}
-static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
+static int aio_write(struct kiocb *req, const struct iocb *iocb,
bool vectored, bool compat)
{
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
struct iov_iter iter;
struct file *file;
- ssize_t ret;
+ int ret;
ret = aio_prep_rw(req, iocb);
if (ret)
@@ -1713,7 +1713,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
add_wait_queue(head, &pt->iocb->poll.wait);
}
-static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
+static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
{
struct kioctx *ctx = aiocb->ki_ctx;
struct poll_iocb *req = &aiocb->poll;
@@ -1778,7 +1778,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
struct iocb __user *user_iocb, bool compat)
{
struct aio_kiocb *req;
- ssize_t ret;
+ int ret;
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved2)) {
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index f9e8f1edfe36..595c19965a8b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1071,6 +1071,8 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
static inline void iocb_destroy(struct aio_kiocb *iocb)
{
+ if (iocb->ki_eventfd)
+ eventfd_ctx_put(iocb->ki_eventfd);
if (iocb->ki_filp)
fput(iocb->ki_filp);
percpu_ref_put(&iocb->ki_ctx->reqs);
@@ -1138,10 +1140,8 @@ static void aio_complete(struct aio_kiocb *iocb)
* eventfd. The eventfd_signal() function is safe to be called
* from IRQ context.
*/
- if (iocb->ki_eventfd) {
+ if (iocb->ki_eventfd)
eventfd_signal(iocb->ki_eventfd, 1);
- eventfd_ctx_put(iocb->ki_eventfd);
- }
/*
* We have to order our ring_info tail store above and test
@@ -1810,18 +1810,19 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
goto out_put_req;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
+ struct eventfd_ctx *eventfd;
/*
* If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
* instance of the file* now. The file descriptor must be
* an eventfd() fd, and will be signaled for each completed
* event using the eventfd_signal() function.
*/
- req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
- if (IS_ERR(req->ki_eventfd)) {
- ret = PTR_ERR(req->ki_eventfd);
- req->ki_eventfd = NULL;
+ eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
+ if (IS_ERR(eventfd)) {
+ ret = PTR_ERR(eventfd);
goto out_put_req;
}
+ req->ki_eventfd = eventfd;
}
ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
@@ -1875,8 +1876,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
return 0;
out_put_req:
- if (req->ki_eventfd)
- eventfd_ctx_put(req->ki_eventfd);
iocb_destroy(req);
out_put_reqs_available:
put_reqs_available(ctx, 1);
--
2.11.0
From: Al Viro <[email protected]>
simplifies the caller
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 595c19965a8b..66ca52c57cca 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1033,6 +1033,11 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
return NULL;
+ if (unlikely(!get_reqs_available(ctx))) {
+ kfree(req);
+ return NULL;
+ }
+
percpu_ref_get(&ctx->reqs);
req->ki_ctx = ctx;
INIT_LIST_HEAD(&req->ki_list);
@@ -1796,13 +1801,9 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
return -EINVAL;
}
- if (!get_reqs_available(ctx))
- return -EAGAIN;
-
- ret = -EAGAIN;
req = aio_get_req(ctx);
if (unlikely(!req))
- goto out_put_reqs_available;
+ return -EAGAIN;
req->ki_filp = fget(iocb->aio_fildes);
ret = -EBADF;
@@ -1877,7 +1878,6 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
out_put_req:
iocb_destroy(req);
-out_put_reqs_available:
put_reqs_available(ctx, 1);
return ret;
}
--
2.11.0
From: Al Viro <[email protected]>
makes for somewhat cleaner control flow in __io_submit_one()
Signed-off-by: Al Viro <[email protected]>
---
fs/aio.c | 119 ++++++++++++++++++++++++++++-----------------------------------
1 file changed, 53 insertions(+), 66 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 66ca52c57cca..2d1db52b1268 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1780,35 +1780,12 @@ static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
}
static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
- struct iocb __user *user_iocb, bool compat)
+ struct iocb __user *user_iocb, struct aio_kiocb *req,
+ bool compat)
{
- struct aio_kiocb *req;
- int ret;
-
- /* enforce forwards compatibility on users */
- if (unlikely(iocb->aio_reserved2)) {
- pr_debug("EINVAL: reserve field set\n");
- return -EINVAL;
- }
-
- /* prevent overflows */
- if (unlikely(
- (iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
- (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
- ((ssize_t)iocb->aio_nbytes < 0)
- )) {
- pr_debug("EINVAL: overflow check\n");
- return -EINVAL;
- }
-
- req = aio_get_req(ctx);
- if (unlikely(!req))
- return -EAGAIN;
-
req->ki_filp = fget(iocb->aio_fildes);
- ret = -EBADF;
if (unlikely(!req->ki_filp))
- goto out_put_req;
+ return -EBADF;
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
struct eventfd_ctx *eventfd;
@@ -1819,17 +1796,15 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
* event using the eventfd_signal() function.
*/
eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
- if (IS_ERR(eventfd)) {
- ret = PTR_ERR(eventfd);
- goto out_put_req;
- }
+ if (IS_ERR(eventfd))
+ return PTR_ERR(req->ki_eventfd);
+
req->ki_eventfd = eventfd;
}
- ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
- if (unlikely(ret)) {
+ if (unlikely(put_user(KIOCB_KEY, &user_iocb->aio_key))) {
pr_debug("EFAULT: aio_key\n");
- goto out_put_req;
+ return -EFAULT;
}
req->ki_res.obj = (u64)(unsigned long)user_iocb;
@@ -1839,58 +1814,70 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
switch (iocb->aio_lio_opcode) {
case IOCB_CMD_PREAD:
- ret = aio_read(&req->rw, iocb, false, compat);
- break;
+ return aio_read(&req->rw, iocb, false, compat);
case IOCB_CMD_PWRITE:
- ret = aio_write(&req->rw, iocb, false, compat);
- break;
+ return aio_write(&req->rw, iocb, false, compat);
case IOCB_CMD_PREADV:
- ret = aio_read(&req->rw, iocb, true, compat);
- break;
+ return aio_read(&req->rw, iocb, true, compat);
case IOCB_CMD_PWRITEV:
- ret = aio_write(&req->rw, iocb, true, compat);
- break;
+ return aio_write(&req->rw, iocb, true, compat);
case IOCB_CMD_FSYNC:
- ret = aio_fsync(&req->fsync, iocb, false);
- break;
+ return aio_fsync(&req->fsync, iocb, false);
case IOCB_CMD_FDSYNC:
- ret = aio_fsync(&req->fsync, iocb, true);
- break;
+ return aio_fsync(&req->fsync, iocb, true);
case IOCB_CMD_POLL:
- ret = aio_poll(req, iocb);
- break;
+ return aio_poll(req, iocb);
default:
pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);
- ret = -EINVAL;
- break;
+ return -EINVAL;
}
-
- /* Done with the synchronous reference */
- iocb_put(req);
-
- /*
- * If ret is 0, we'd either done aio_complete() ourselves or have
- * arranged for that to be done asynchronously. Anything non-zero
- * means that we need to destroy req ourselves.
- */
- if (!ret)
- return 0;
-
-out_put_req:
- iocb_destroy(req);
- put_reqs_available(ctx, 1);
- return ret;
}
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
bool compat)
{
+ struct aio_kiocb *req;
struct iocb iocb;
+ int err;
if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
return -EFAULT;
- return __io_submit_one(ctx, &iocb, user_iocb, compat);
+ /* enforce forwards compatibility on users */
+ if (unlikely(iocb.aio_reserved2)) {
+ pr_debug("EINVAL: reserve field set\n");
+ return -EINVAL;
+ }
+
+ /* prevent overflows */
+ if (unlikely(
+ (iocb.aio_buf != (unsigned long)iocb.aio_buf) ||
+ (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) ||
+ ((ssize_t)iocb.aio_nbytes < 0)
+ )) {
+ pr_debug("EINVAL: overflow check\n");
+ return -EINVAL;
+ }
+
+ req = aio_get_req(ctx);
+ if (unlikely(!req))
+ return -EAGAIN;
+
+ err = __io_submit_one(ctx, &iocb, user_iocb, req, compat);
+
+ /* Done with the synchronous reference */
+ iocb_put(req);
+
+ /*
+ * If err is 0, we'd either done aio_complete() ourselves or have
+ * arranged for that to be done asynchronously. Anything non-zero
+ * means that we need to destroy req ourselves.
+ */
+ if (unlikely(err)) {
+ iocb_destroy(req);
+ put_reqs_available(ctx, 1);
+ }
+ return err;
}
/* sys_io_submit:
--
2.11.0
On Sun, Mar 10, 2019 at 07:06:18AM +0000, Al Viro wrote:
> OK, refactored, cleaned up and force-pushed. Current state:
This survives the libaio test suite at least.
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
On Sun, Mar 10, 2019 at 07:08:16AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> Signed-off-by: Al Viro <[email protected]>
This could use a little changelog at least that explains the why.
It also removes lookup_kiocb and folds that into the only caller,
which is at least worth mentioning (and probably should be a separate
patch).
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
On Sun, Mar 10, 2019 at 07:08:17AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> Instead of having aio_complete() set ->ki_res.{res,res2}, do that
> explicitly in its callers, drop the reference (as aio_complete()
> used to do) and delay the rest until the final iocb_put().
>
> Signed-off-by: Al Viro <[email protected]>
> ---
> fs/aio.c | 45 ++++++++++++++++++++-------------------------
> 1 file changed, 20 insertions(+), 25 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 2249a7a1d6b3..b9c4c1894020 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb)
> kmem_cache_free(kiocb_cachep, iocb);
> }
>
> -static inline void iocb_put(struct aio_kiocb *iocb)
> -{
> - if (refcount_dec_and_test(&iocb->ki_refcnt))
> - iocb_destroy(iocb);
> -}
Maybe iocb_put should just have been added in the place you move
it to in patch 1?
On Sun, Mar 10, 2019 at 07:08:20AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> Signed-off-by: Al Viro <[email protected]>
Looks sensible, but a changelog would be nice.
> if (iocb->aio_flags & IOCB_FLAG_RESFD) {
> + struct eventfd_ctx *eventfd;
> /*
> * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
> * instance of the file* now. The file descriptor must be
> * an eventfd() fd, and will be signaled for each completed
> * event using the eventfd_signal() function.
> */
> + eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
I don't think there is any point in the cast in either the old or new
code.
On Sun, Mar 10, 2019 at 07:08:21AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> simplifies the caller
>
> Signed-off-by: Al Viro <[email protected]>
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
On Sun, Mar 10, 2019 at 07:08:22AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> makes for somewhat cleaner control flow in __io_submit_one()
>
> Signed-off-by: Al Viro <[email protected]>
I wonder if we should even bother keeping __io_submit_one. Splitting
that out was prep work from Jens for something that eventually turned
into io_uring.
Where do we put the second iocb reference in case we return from
vfs_poll without ever being woken?
Also it seems like the complete code would still benefit from a little
helper, something like:
diff --git a/fs/aio.c b/fs/aio.c
index b2a5c7b3a1fe..8415e5e484ce 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1611,6 +1611,13 @@ static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
return 0;
}
+static void aio_poll_finish(struct aio_kiocb *iocb, __poll_t mask)
+{
+ list_del_init(&iocb->ki_list);
+ iocb->ki_res.res = mangle_poll(mask);
+ iocb->poll.done = true;
+}
+
static void aio_poll_complete_work(struct work_struct *work)
{
struct poll_iocb *req = container_of(work, struct poll_iocb, work);
@@ -1635,9 +1642,7 @@ static void aio_poll_complete_work(struct work_struct *work)
spin_unlock_irq(&ctx->ctx_lock);
return;
}
- list_del_init(&iocb->ki_list);
- iocb->ki_res.res = mangle_poll(mask);
- req->done = true;
+ aio_poll_finish(iocb, mask);
spin_unlock_irq(&ctx->ctx_lock);
iocb_put(iocb);
@@ -1674,24 +1679,20 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
list_del_init(&req->wait.entry);
- if (mask) {
+ if (mask && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) {
/*
* Try to complete the iocb inline if we can. Use
* irqsave/irqrestore because not all filesystems (e.g. fuse)
* call this function with IRQs disabled and because IRQs
* have to be disabled before ctx_lock is obtained.
*/
- if (spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) {
- list_del(&iocb->ki_list);
- iocb->ki_res.res = mangle_poll(mask);
- req->done = true;
- spin_unlock_irqrestore(&iocb->ki_ctx->ctx_lock, flags);
- iocb_put(iocb);
- return 1;
- }
+ aio_poll_finish(iocb, mask);
+ spin_unlock_irqrestore(&iocb->ki_ctx->ctx_lock, flags);
+ iocb_put(iocb);
+ } else {
+ schedule_work(&req->work);
}
- schedule_work(&req->work);
return 1;
}
On Mon, Mar 11, 2019 at 08:58:31PM +0100, Christoph Hellwig wrote:
> Where do we put the second iocb reference in case we return from
> vfs_poll without ever being woken?
Depends. If mask is non-zero (i.e. vfs_poll() has returned something
we care about) and it has never been woken, we steal it and drop the
reference ourselves. If it is zero and we see that ->poll() has tried
to put it on two queues, we steal it (again, assuming it's not on
waitqueue and _can_ be stolen) and return -EINVAL. In that case
__io_submit_one() (or, by the end of the series, io_submit_one())
will call iocb_destroy(). And in the normal waiting case (nothing
interesting reported and no errors) it will end up on the list of
cancellables. Then it either will get completed by later wakeup, which
will drop the reference, or it will get eventually cancelled, which will
hit the same aio_poll_complete_work() and drop the reference...
> Also it seems like the complete code would still benefit from a little
> helper, something like:
Umm... Not sure I like the name (something like aio_poll_done() seems
to be better), but other than that - no problem.
On Mon, Mar 11, 2019 at 08:48:30PM +0100, Christoph Hellwig wrote:
> On Sun, Mar 10, 2019 at 07:08:22AM +0000, Al Viro wrote:
> > From: Al Viro <[email protected]>
> >
> > makes for somewhat cleaner control flow in __io_submit_one()
> >
> > Signed-off-by: Al Viro <[email protected]>
>
> I wonder if we should even bother keeping __io_submit_one. Splitting
> that out was prep work from Jens for something that eventually turned
> into io_uring.
*shrug*
I think it's a bit easier to keep track of control flow - failure exits
are simpler that way.
On Mon, Mar 11, 2019 at 08:44:31PM +0100, Christoph Hellwig wrote:
> On Sun, Mar 10, 2019 at 07:08:17AM +0000, Al Viro wrote:
> > From: Al Viro <[email protected]>
> >
> > Instead of having aio_complete() set ->ki_res.{res,res2}, do that
> > explicitly in its callers, drop the reference (as aio_complete()
> > used to do) and delay the rest until the final iocb_put().
> >
> > Signed-off-by: Al Viro <[email protected]>
> > ---
> > fs/aio.c | 45 ++++++++++++++++++++-------------------------
> > 1 file changed, 20 insertions(+), 25 deletions(-)
> >
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 2249a7a1d6b3..b9c4c1894020 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb)
> > kmem_cache_free(kiocb_cachep, iocb);
> > }
> >
> > -static inline void iocb_put(struct aio_kiocb *iocb)
> > -{
> > - if (refcount_dec_and_test(&iocb->ki_refcnt))
> > - iocb_destroy(iocb);
> > -}
>
> Maybe iocb_put should just have been added in the place you move
> it to in patch 1?
Might as well...
On Mon, Mar 11, 2019 at 08:43:06PM +0100, Christoph Hellwig wrote:
> On Sun, Mar 10, 2019 at 07:08:16AM +0000, Al Viro wrote:
> > From: Al Viro <[email protected]>
> >
> > Signed-off-by: Al Viro <[email protected]>
>
> This could use a little changelog at least that explains the why.
>
> It also removes lookup_kiocb and folds that into the only caller,
> which is at least worth mentioning (and probably should be a separate
> patch).
More confusingly, it contains a pointless rudiment in aio_poll_wake();
that chunk doesn't solve the problems with ->woken, and the whole
thing gets killed off shortly...
On Mon, Mar 11, 2019 at 09:13:28PM +0000, Al Viro wrote:
> On Mon, Mar 11, 2019 at 08:44:31PM +0100, Christoph Hellwig wrote:
> > On Sun, Mar 10, 2019 at 07:08:17AM +0000, Al Viro wrote:
> > > From: Al Viro <[email protected]>
> > >
> > > Instead of having aio_complete() set ->ki_res.{res,res2}, do that
> > > explicitly in its callers, drop the reference (as aio_complete()
> > > used to do) and delay the rest until the final iocb_put().
> > >
> > > Signed-off-by: Al Viro <[email protected]>
> > > ---
> > > fs/aio.c | 45 ++++++++++++++++++++-------------------------
> > > 1 file changed, 20 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/fs/aio.c b/fs/aio.c
> > > index 2249a7a1d6b3..b9c4c1894020 100644
> > > --- a/fs/aio.c
> > > +++ b/fs/aio.c
> > > @@ -1077,24 +1077,10 @@ static inline void iocb_destroy(struct aio_kiocb *iocb)
> > > kmem_cache_free(kiocb_cachep, iocb);
> > > }
> > >
> > > -static inline void iocb_put(struct aio_kiocb *iocb)
> > > -{
> > > - if (refcount_dec_and_test(&iocb->ki_refcnt))
> > > - iocb_destroy(iocb);
> > > -}
> >
> > Maybe iocb_put should just have been added in the place you move
> > it to in patch 1?
>
> Might as well...
Actually, that wouldn't be any better - it would need at least a declaration
before aio_complete(), since in the original patch aio_complete() calls it.
So that wouldn't be less noisy...
On Mon, Mar 11, 2019 at 09:06:18PM +0000, Al Viro wrote:
> On Mon, Mar 11, 2019 at 08:58:31PM +0100, Christoph Hellwig wrote:
> > Where do we put the second iocb reference in case we return from
> > vfs_poll without ever being woken?
>
> Depends. If mask is non-zero (i.e. vfs_poll() has returned something
> we care about) and it has never been woken, we steal it and drop the
> reference ourselves. If it is zero and we see that ->poll() has tried
> to put it on two queues, we steal it (again, assuming it's not on
> waitqueue and _can_ be stolen) and return -EINVAL. In that case
> __io_submit_one() (or, by the end of the series, io_submit_one())
> will call iocb_destroy(). And in the normal waiting case (nothing
> interesting reported and no errors) it will end up on the list of
> cancellables. Then it either will get completed by later wakeup, which
> will drop the reference, or it will get eventually cancelled, which will
> hit the same aio_poll_complete_work() and drop the reference...
Ok, seems like the logic is sane. I was missing how the actual
mask logic worked in aio_poll().
> > Also it seems like the complete code would still benefit from a little
> > helper, something like:
>
> Umm... Not sure I like the name (something like aio_poll_done() seems
> to be better), but other than that - no problem.
I don't care about the name. Feel free to change it to whatever suits
you.