LinuxLists.cc - More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

2016-06-17 04:09:36

Subject: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

Hello!

So I think I finally kind of understand this other d_splice_alias problem
I've been having.

Imagine that we have a negative dentry in the cache.

Now we have two threads that are trying to open this file in parallel,
the fs is NFS, so atomic_open is defined, it's not a create, so
parent is locked in shared mode and they both enter first lookup_open
and then (because the dentry is negative) into nfs_atomic_open().

Now in nfs_atomic_open we have two threads that run alongside each other
trying to open this file and both fail (the code is with Al's earlier patch
for the other d_splice_alias problem).

inode = NFS_PROTO(dir)->open_context(dir, ctx, open_flags, &attr, opened);
if (IS_ERR(inode)) {
err = PTR_ERR(inode);
trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
put_nfs_open_context(ctx);
d_drop(dentry);
. . .
case -ENOTDIR:
goto no_open;
. . .
no_open:
res = nfs_lookup(dir, dentry, lookup_flags);

So they both do d_drop(), the dentry is now unhashed, and they both
dive into nfs_lookup().
There eventually they both call

res = d_splice_alias(inode, dentry);

And so the first lucky one continues on it's merry way with a hashed dentry,
but the other less lucky one ends up calling into d_splice_alias() with
dentry that's already hashed and hits the very familiar assertion.

I took a brief look into ceph and it looks like a very similar thing
might happen there with handle_reply() for two parallel replies calling into
ceph_fill_trace() and then splice_alias()->d_splice_alias(), since the
unhashed check it does is not under any locks, it's unsafe, so the problem
might be more generic than just NFS too.

So I wonder how to best fix this? Holding some sort of dentry lock across a call
into atomic_open in VFS? We cannot just make d_splice_alias() callers call with
inode->i_lock held because dentry might be negative.

I know this is really happening because I in fact have 3 threads doing the above
described race, one luckily passing through and two hitting the assertion:
PID: 2689619 TASK: ffff8800ca740540 CPU: 6 COMMAND: "ls"
#0 [ffff8800c6da3758] machine_kexec at ffffffff81042462
#1 [ffff8800c6da37b8] __crash_kexec at ffffffff811365ff
#2 [ffff8800c6da3880] __crash_kexec at ffffffff811366d5
#3 [ffff8800c6da3898] crash_kexec at ffffffff8113671b
#4 [ffff8800c6da38b8] oops_end at ffffffff81018cb4
#5 [ffff8800c6da38e0] die at ffffffff8101917b
#6 [ffff8800c6da3910] do_trap at ffffffff81015b92
#7 [ffff8800c6da3960] do_error_trap at ffffffff81015d8b
#8 [ffff8800c6da3a20] do_invalid_op at ffffffff810166e0
#9 [ffff8800c6da3a30] invalid_op at ffffffff8188c91e
[exception RIP: d_splice_alias+478]
RIP: ffffffff8128784e RSP: ffff8800c6da3ae8 RFLAGS: 00010286
RAX: ffff8800d8661ab8 RBX: ffff8800d3b6b178 RCX: ffff8800ca740540
RDX: ffffffff81382f09 RSI: ffff8800d3b6b178 RDI: ffff8800d8661ab8
RBP: ffff8800c6da3b20 R8: 0000000000000000 R9: 0000000000000000
R10: ffff8800ca740540 R11: 0000000000000000 R12: ffff8800d6491600
R13: 0000000000000102 R14: ffff8800d39063d0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8800c6da3b28] nfs_lookup at ffffffff8137cd1c
#11 [ffff8800c6da3b80] nfs_atomic_open at ffffffff8137ed61
#12 [ffff8800c6da3c30] lookup_open at ffffffff8127901a
#13 [ffff8800c6da3d50] path_openat at ffffffff8127bf15
#14 [ffff8800c6da3dd8] do_filp_open at ffffffff8127d9b1
#15 [ffff8800c6da3ee0] do_sys_open at ffffffff81269f90
#16 [ffff8800c6da3f40] sys_open at ffffffff8126a09e
#17 [ffff8800c6da3f50] entry_SYSCALL_64_fastpath at ffffffff8188aebc
RIP: 00007f92a76d4d20 RSP: 00007fff640c1160 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f92a76d4d20
RDX: 000055c9f5be9f50 RSI: 0000000000090800 RDI: 000055c9f5bea1d0
RBP: 0000000000000000 R8: 000055c9f5bea340 R9: 0000000000000000
R10: 000055c9f5bea300 R11: 0000000000000206 R12: 00007f92a800d6b0
R13: 000055c9f5bf3000 R14: 000055c9f5bea3e0 R15: 0000000000000000
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

PID: 2689597 TASK: ffff880109b40bc0 CPU: 7 COMMAND: "ls"
#0 [ffff8800c07178e0] die at ffffffff8101914e
#1 [ffff8800c0717910] do_trap at ffffffff81015b92
#2 [ffff8800c0717960] do_error_trap at ffffffff81015d8b
#3 [ffff8800c0717a20] do_invalid_op at ffffffff810166e0
#4 [ffff8800c0717a30] invalid_op at ffffffff8188c91e
[exception RIP: d_splice_alias+478]
RIP: ffffffff8128784e RSP: ffff8800c0717ae8 RFLAGS: 00010286
RAX: ffff8800d8661ab8 RBX: ffff8800d3b6b178 RCX: ffff880109b40bc0
RDX: ffffffff81382f09 RSI: ffff8800d3b6b178 RDI: ffff8800d8661ab8
RBP: ffff8800c0717b20 R8: 0000000000000000 R9: 0000000000000000
R10: ffff880109b40bc0 R11: 0000000000000000 R12: ffff8800d6a29e40
R13: 0000000000000102 R14: ffff8800d39063d0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff8800c0717b28] nfs_lookup at ffffffff8137cd1c
#6 [ffff8800c0717b80] nfs_atomic_open at ffffffff8137ed61
#7 [ffff8800c0717c30] lookup_open at ffffffff8127901a
#8 [ffff8800c0717d50] path_openat at ffffffff8127bf15
#9 [ffff8800c0717dd8] do_filp_open at ffffffff8127d9b1
#10 [ffff8800c0717ee0] do_sys_open at ffffffff81269f90
#11 [ffff8800c0717f40] sys_open at ffffffff8126a09e
#12 [ffff8800c0717f50] entry_SYSCALL_64_fastpath at ffffffff8188aebc
RIP: 00007f5cef6f8d20 RSP: 00007ffcc98f87b0 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f5cef6f8d20
RDX: 000055de64230f50 RSI: 0000000000090800 RDI: 000055de642311d0
RBP: 0000000000000000 R8: 000055de64231320 R9: 0000000000000000
R10: 000055de64231300 R11: 0000000000000202 R12: 00007f5cf00316b0
R13: 000055de6423a000 R14: 000055de642313c0 R15: 0000000000000000
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

The dentry is 0xffff8800d3b6b178 in both cases.

2016-06-17 04:29:18

by Al Viro

[permalink] [raw]

Subject: Re: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

On Fri, Jun 17, 2016 at 12:09:19AM -0400, Oleg Drokin wrote:

> So they both do d_drop(), the dentry is now unhashed, and they both
> dive into nfs_lookup().
> There eventually they both call
>
> res = d_splice_alias(inode, dentry);
>
> And so the first lucky one continues on it's merry way with a hashed dentry,
> but the other less lucky one ends up calling into d_splice_alias() with
> dentry that's already hashed and hits the very familiar assertion.
>
> I took a brief look into ceph and it looks like a very similar thing
> might happen there with handle_reply() for two parallel replies calling into
> ceph_fill_trace() and then splice_alias()->d_splice_alias(), since the
> unhashed check it does is not under any locks, it's unsafe, so the problem
> might be more generic than just NFS too.
>
> So I wonder how to best fix this? Holding some sort of dentry lock across a call
> into atomic_open in VFS? We cannot just make d_splice_alias() callers call with
> inode->i_lock held because dentry might be negative.

Oh, lovely... So basically the problem is that we violate the "no lookups on
the same name in parallel" rule on those fallbacks from foo_atomic_open() to
foo_lookup(). The thing is, a lot of ->atomic_open() instances have such
fallbacks and I wonder if that's a sign that we need to lift some of that
to fs/namei.c...

Hell knows; alternative is to have that d_drop() followed by d_alloc_parallel()
and feeding that dentry to lookup. I'll play with that a bit and see what's
better; hopefully I'll have something by tomorrow.