2013-09-10 17:14:29

by Josh Boyer

[permalink] [raw]
Subject: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

Hi All,

We've had a user report a backtrace from hitting the
BUG_ON(!ret->d_lockref.count) added with the lockref infrastructure
(commit 98474236f72) on rawhide today[1]. I've grabbed the backtrace
below. The user has btrfs, NFS, and sshfs in usage with this oops.

I've not seen anything similar, but I could have missed it. Does this
look familiar to anyone?

josh

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1006472

kernel: [ 765.817482] ------------[ cut here ]------------
kernel: [ 765.818426] kernel BUG at fs/dcache.c:648!
kernel: [ 765.818426] invalid opcode: 0000 [#1] SMP
kernel: [ 765.818426] Modules linked in: rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache fuse rfcomm bnep ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack w83627ehf
hwmon_vid snd_hda_codec_realtek snd_hda_intel snd_hda_codec coretemp
kvm_intel snd_hwdep kvm snd_seq snd_seq_device snd_pcm iTCO_wdt
iTCO_vendor_support microcode serio_raw i2c_i801 lpc_ich mfd_core
btusb bluetooth r8169 rfkill mii snd_page_alloc shpchp snd_timer snd
soundcore acpi_cpufreq nfsd auth_rpcgss uinput nfs_acl lockd sunrpc
binfmt_misc ata_generic pata_acpi btrfs libcrc32c xor raid1 raid0
raid6_pq i915 firewire_ohci firewire_core crc_itu_t video pata_it8213
i2c_algo_bit drm_kms_helper drm i2c_dev i2c_core
kernel: [ 765.818426] CPU: 0 PID: 13007 Comm: gkrellm Not tainted
3.12.0-0.rc0.git16.2.fc21.x86_64 #1
kernel: [ 765.818426] Hardware name: Supermicro C2SEA/C2SEA, BIOS 1.0c
03/11/2009
kernel: [ 765.818426] task: ffff8801a1ff6c80 ti: ffff8801cb902000
task.ti: ffff8801cb902000
kernel: [ 765.818426] RIP: 0010:[<ffffffff811c03cd>]
[<ffffffff811c03cd>] dget_parent+0x6d/0x80
kernel: [ 765.818426] RSP: 0018:ffff8801cb903c38 EFLAGS: 00010246
kernel: [ 765.818426] RAX: 0000000000000000 RBX: ffff88022baca600 RCX:
00000000cced6f6f
kernel: [ 765.818426] RDX: 000000000000006f RSI: 0000000000000000 RDI:
ffff88022baca658
kernel: [ 765.818426] RBP: ffff8801cb903c50 R08: 8080808080808080 R09:
fefefefefefefeff
kernel: [ 765.818426] R10: 2f2f2f2f2f2f2f2f R11: 722e7172742e2d2d R12:
ffff88022bb2bd80
kernel: [ 765.818426] R13: ffff88022baca658 R14: ffff8801a1ff6c80 R15:
0000000000000002
kernel: [ 765.818426] FS: 00007fbda2e39a00(0000)
GS:ffff880237c00000(0000) knlGS:0000000000000000
kernel: [ 765.818426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 765.818426] CR2: 000000000042f860 CR3: 00000001a1d26000 CR4:
00000000000007f0
kernel: [ 765.818426] Stack:
kernel: [ 765.818426] ffff8801cb903de0 ffff88022bb2bd80
0000000000000000 ffff8801cb903c70
kernel: [ 765.818426] ffffffff811b5508 ffff8801cb903de0
ffff88023761b003 ffff8801cb903d10
kernel: [ 765.818426] ffffffff811b6225 0000000000000000
000201da00000000 ffff8801a1ff6c80
kernel: [ 765.818426] Call Trace:
kernel: [ 765.818426] [<ffffffff811b5508>] handle_dots+0x1c8/0x290
kernel: [ 765.818426] [<ffffffff811b6225>] link_path_walk+0x335/0x8a0
kernel: [ 765.818426] [<ffffffff8113ed21>] ? read_cache_page+0x21/0x30
kernel: [ 765.818426] [<ffffffff811b6bb4>] path_lookupat+0x334/0x7b0
kernel: [ 765.818426] [<ffffffff811b705b>] filename_lookup+0x2b/0xd0
kernel: [ 765.818426] [<ffffffff811baa64>] user_path_at_empty+0x54/0x90
kernel: [ 765.818426] [<ffffffff811baab1>] user_path_at+0x11/0x20
kernel: [ 765.818426] [<ffffffff811af2d0>] vfs_fstatat+0x50/0xa0
kernel: [ 765.818426] [<ffffffff811af4cf>] SYSC_newstat+0x1f/0x40
kernel: [ 765.818426] [<ffffffff810e7cc6>] ? __audit_syscall_exit+0x1f6/0x2a0
kernel: [ 765.818426] [<ffffffff811af91e>] SyS_newstat+0xe/0x10
kernel: [ 765.818426] [<ffffffff81667a19>] system_call_fastpath+0x16/0x1b
kernel: [ 765.818426] Code: 18 4c 8d 6b 58 4c 89 ef e8 b1 ef 49 00 49
3b 5c 24 18 75 19 8b 43 5c 85 c0 74 10 83 c0 01 89 43 5c 41 80 45 00
01 48 89 d8 eb c2 <0f> 0b 80 43 58 01 eb c9 66 66 2e 0f 1f 84 00 00 00
00 00 66 66
kernel: [ 765.818426] RIP [<ffffffff811c03cd>] dget_parent+0x6d/0x80
kernel: [ 765.818426] RSP <ffff8801cb903c38>
kernel: [ 765.929977] ---[ end trace 384aa993d7ac57c0 ]---


2013-09-10 17:33:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 10:14 AM, Josh Boyer <[email protected]> wrote:
>
> We've had a user report a backtrace from hitting the
> BUG_ON(!ret->d_lockref.count) added with the lockref infrastructure
> (commit 98474236f72) on rawhide today[1]. I've grabbed the backtrace
> below. The user has btrfs, NFS, and sshfs in usage with this oops.
>
> I've not seen anything similar, but I could have missed it. Does this
> look familiar to anyone?

Nope. And the dget_parent() case itself hasn't even changed - that
BUG_ON() wasn't really added by the lockref code, it's just a
search-and-replace change of a BUG_ON(!d_count) to
BUG_ON(!d_lockref.count). The BUG_ON() existed before.

That whole "dget_parent()" thing is also in the _simple_ case (not RCU
mode), and the BUG_ON is for when the dentry is properly locked, so
that's all "safe" code. The refcount must have gotten corrupted
earlier.

Do you have the mainline git ID of that rawhide kernel? Because there
*was* a real bug in d_rcu_to_refcount. I don't see how it could
trigger that particular issue, but it could trigger scheduling while
in the rcu-protected region and that in turn could result in odd
things down the line, so..

That particular bug exists between commits 15570086b590 ("vfs:
reimplement d_rcu_to_refcount() using lockref_get_or_lock()") that
introduced it, and e5c832d55588 ("vfs: fix dentry RCU to refcounting
possibly sleeping dput()") that should have fixed it. But I don't know
what mainline kernel that "kernel-3.12.0-0.rc0.git16.2.fc21.x86_64" is
based on. I'm sure that information exists somewhere..

Linus

2013-09-10 17:39:21

by Josh Boyer

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 1:33 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Sep 10, 2013 at 10:14 AM, Josh Boyer <[email protected]> wrote:
>>
>> We've had a user report a backtrace from hitting the
>> BUG_ON(!ret->d_lockref.count) added with the lockref infrastructure
>> (commit 98474236f72) on rawhide today[1]. I've grabbed the backtrace
>> below. The user has btrfs, NFS, and sshfs in usage with this oops.
>>
>> I've not seen anything similar, but I could have missed it. Does this
>> look familiar to anyone?
>
> Nope. And the dget_parent() case itself hasn't even changed - that
> BUG_ON() wasn't really added by the lockref code, it's just a
> search-and-replace change of a BUG_ON(!d_count) to
> BUG_ON(!d_lockref.count). The BUG_ON() existed before.
>
> That whole "dget_parent()" thing is also in the _simple_ case (not RCU
> mode), and the BUG_ON is for when the dentry is properly locked, so
> that's all "safe" code. The refcount must have gotten corrupted
> earlier.
>
> Do you have the mainline git ID of that rawhide kernel? Because there
> *was* a real bug in d_rcu_to_refcount. I don't see how it could
> trigger that particular issue, but it could trigger scheduling while
> in the rcu-protected region and that in turn could result in odd
> things down the line, so..
>
> That particular bug exists between commits 15570086b590 ("vfs:
> reimplement d_rcu_to_refcount() using lockref_get_or_lock()") that
> introduced it, and e5c832d55588 ("vfs: fix dentry RCU to refcounting
> possibly sleeping dput()") that should have fixed it. But I don't know
> what mainline kernel that "kernel-3.12.0-0.rc0.git16.2.fc21.x86_64" is
> based on. I'm sure that information exists somewhere..

The subject says v3.11-7890-ge5c832d, which is the git-describe output
of the mainline kernel for that Fedora build. Sorry, I should have
made that clearer. So according to that, it should be based on the
actual commit you identified as the fix.

I have your latest tree as of this morning (v3.11-8716-g26b0332)
building right now and have asked Moneta to test it.

josh

2013-09-10 17:47:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 10:39 AM, Josh Boyer <[email protected]> wrote:
>
> The subject says v3.11-7890-ge5c832d, which is the git-describe output
> of the mainline kernel for that Fedora build.

Duh. I just read the bugzilla and the oops ;)

But yes, e5c832d is obviously the "fixed" kernel. Let me think about this.

Linus

2013-09-10 18:25:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 10:47 AM, Linus Torvalds
<[email protected]> wrote:
>
> But yes, e5c832d is obviously the "fixed" kernel. Let me think about this.

Ok, I think I found it.

I missed that "terminate_walk()" for the RCU case does this:

nd->flags &= ~LOOKUP_RCU;
if (!(nd->flags & LOOKUP_ROOT))
nd->root.mnt = NULL;
unlock_rcu_walk();

and my unlazy_walk() essentially terminated the walk _without_
clearing that nd->root.mnt thing (it did clear the LOOKUP_RCU bit and
unlock_rcy_walk(). So then later, we'd end up doing an extra
path_put(). Explaining a zero d_lockref.count.

The whole damn root.mnt behavior with !LOOKUP_ROOT is a mystery and
needs more comments. But the attached trivial patch should do the
missing portion of terminate_walk().

Al, can you walk us through the rules for what "root.mnt == NULL"
really means? It's basically used as a flag for whether we've gotten
the root pointer or not. But it's pretty damn esoteric.

Now I'm starting to wonder how come _I_ never saw any issues. Maybe it
ends up underflowing so quickly that most people just see a big
negative number..

Patch is entirely untested. Not that my testing apparently has been much good.

Moneta, are you comfortable compiling a test-kernel, or does this need
to become a rawhide package?

Linus


Attachments:
patch.diff (442.00 B)

2013-09-10 18:32:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 11:29 AM, Mace Moneta <[email protected]> wrote:
> I can patch and compile a kernel if you post the patch you want tested.

It was attached in that previous email..

Apparently I'm not the only one who misses things like versions in
subject lines or other small "details" in emails ;)

Linus

2013-09-10 18:43:31

by Al Viro

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 11:25:44AM -0700, Linus Torvalds wrote:
> nd->flags &= ~LOOKUP_RCU;
> if (!(nd->flags & LOOKUP_ROOT))
> nd->root.mnt = NULL;
> unlock_rcu_walk();
>
> and my unlazy_walk() essentially terminated the walk _without_
> clearing that nd->root.mnt thing (it did clear the LOOKUP_RCU bit and
> unlock_rcy_walk(). So then later, we'd end up doing an extra
> path_put(). Explaining a zero d_lockref.count.
>
> The whole damn root.mnt behavior with !LOOKUP_ROOT is a mystery and
> needs more comments. But the attached trivial patch should do the
> missing portion of terminate_walk().
>
> Al, can you walk us through the rules for what "root.mnt == NULL"
> really means? It's basically used as a flag for whether we've gotten
> the root pointer or not. But it's pretty damn esoteric.

LOOKUP_ROOT: the caller has set nd->root and we shouldn't touch that
at all.

!LOOKUP_ROOT: we set nd->root the first time we need / (in the very
beginning if it's an absolute pathname, on the first absolute symlink
otherwise). In non-RCU mode we hold a reference to it; in RCU mode
we do not. As the result, leaving RCU mode should either grab
a reference to the damn thing (if we intend to go on) or zero it out.

2013-09-10 19:01:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 11:43 AM, Al Viro <[email protected]> wrote:
>
> !LOOKUP_ROOT: we set nd->root the first time we need / (in the very
> beginning if it's an absolute pathname, on the first absolute symlink
> otherwise). In non-RCU mode we hold a reference to it; in RCU mode
> we do not. As the result, leaving RCU mode should either grab
> a reference to the damn thing (if we intend to go on) or zero it out.

Yeah, that was what I was thinking. But in particular, I was wondering
if we could simplify unlazy_walk() to _not_ take that root reference
at all, and just always zero it out even if we succeed.

IOW, doing the attached patch (_instead_ of the one I sent out).

Or is there something in path lookup retrying that might get uphappy
if we go back to a NULL root.mnt, and doesn't check it?

Because this patch actually simplifies that nasty/complex
unlazy_walk() a lot, and makes it much more understnadable, I think.
It always looked really odd to me how it used to do "if LOOKUP_ROOT is
_not_ set, let's take a reference count to the root"). With this patch
in place, I really like how straightforward unlazy_walk() is. That
used to be just about _the_ most subtle part of the whole
rcu-to-refcount thing (as demonstrated by the number of bugs it has
exposed in the patches), and now it looks almost trivial.

Mace - ignore this newer patch for now, the one I want you to test is
the minimal one you already have, not this cleanup one. But since you
clearly saw the problem I never did, if you get bored _after_ testing
the first patch, feel free to give this one a whirl too. But notice
that this patch is a replacement for - not in addition to - the first
one.

Linus


Attachments:
patch.diff (1.31 kB)

2013-09-10 19:13:25

by Al Viro

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 12:01:22PM -0700, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 11:43 AM, Al Viro <[email protected]> wrote:
> >
> > !LOOKUP_ROOT: we set nd->root the first time we need / (in the very
> > beginning if it's an absolute pathname, on the first absolute symlink
> > otherwise). In non-RCU mode we hold a reference to it; in RCU mode
> > we do not. As the result, leaving RCU mode should either grab
> > a reference to the damn thing (if we intend to go on) or zero it out.
>
> Yeah, that was what I was thinking. But in particular, I was wondering
> if we could simplify unlazy_walk() to _not_ take that root reference
> at all, and just always zero it out even if we succeed.
>
> IOW, doing the attached patch (_instead_ of the one I sent out).
>
> Or is there something in path lookup retrying that might get uphappy
> if we go back to a NULL root.mnt, and doesn't check it?

Ugh... I really don't like that - your patch introduces the situations
when race with chroot can lead to two absolute symlinks in the same path
being interpreted wrt different roots. And yes, sure, anybody who gets
in that kind of races isn't going to get particulary sane results, but
still, I'd rather have simpler semantics here...

Anyway, I think I see how to tidy the things up in a different way; let's
go with your original fix for now and deal with the cleanups a bit later.

Another thing: could you merge vfs.git#for-linus? It's getting really
messy to keep track of changes both in mainline and in vfs.git, especially
since there's a pile of pending stuff (lru_list work) that needs to be
resolved wrt both...

2013-09-10 19:27:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 12:13 PM, Al Viro <[email protected]> wrote:
>
> Ugh... I really don't like that - your patch introduces the situations
> when race with chroot can lead to two absolute symlinks in the same path
> being interpreted wrt different roots. And yes, sure, anybody who gets
> in that kind of races isn't going to get particulary sane results, but
> still, I'd rather have simpler semantics here...

Well, that "root is whatever it happens to be when we encounter the
slash" semantics is actually what we used to have before nameidata
anyway, so I don't actually believe it's anything new.

But I'll commit the simple patch that doesn't clean things up. I guess
we can always revisit this later (although I fear that once it all
works, we'll just leave it in the ugly state).

> Another thing: could you merge vfs.git#for-linus? It's getting really
> messy to keep track of changes both in mainline and in vfs.git, especially
> since there's a pile of pending stuff (lru_list work) that needs to be
> resolved wrt both...

Will do. I was hoping to get this sorted out first.

Linus

2013-09-10 20:25:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 12:57 PM, Mace Moneta <[email protected]> wrote:
> The (first) patch looks good; no recurrence. It has only taken 3-5 minutes
> before, and I've been up for about half an hour now.

Ok, good. It's pushed out.

Al, your third pile of VFS stuff is also merged. Waiman, that means
that your RCU path creation stuff is in. What else did you have
pending for scalability?

Linus

2013-09-10 21:51:38

by Josh Boyer

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 4:25 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Sep 10, 2013 at 12:57 PM, Mace Moneta <[email protected]> wrote:
>> The (first) patch looks good; no recurrence. It has only taken 3-5 minutes
>> before, and I've been up for about half an hour now.
>
> Ok, good. It's pushed out.

Thanks to the both of you for the quick resolution.

josh

2013-09-11 01:49:09

by Waiman Long

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On 09/10/2013 04:25 PM, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 12:57 PM, Mace Moneta<[email protected]> wrote:
>> The (first) patch looks good; no recurrence. It has only taken 3-5 minutes
>> before, and I've been up for about half an hour now.
> Ok, good. It's pushed out.
>
> Al, your third pile of VFS stuff is also merged. Waiman, that means
> that your RCU path creation stuff is in. What else did you have
> pending for scalability?
>
> Linus

I need to clean up some comments in the code. The other thing that I
want to do is to introduce read_seqlock/read_sequnlock() primitives that
do the locking without incrementing the sequence number. Then all the
name lookup and translation code can use the new primitives as they
don't change any of the protected structures. This will prevent one
sequence number check failure from cascading into a series of failures
because of the sequence number change. I will have a patch ready by
tomorrow morning.

Regards,
Longman

2013-09-11 03:22:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 6:48 PM, Waiman Long <[email protected]> wrote:
>
> I need to clean up some comments in the code. The other thing that I want to
> do is to introduce read_seqlock/read_sequnlock() primitives that do the
> locking without incrementing the sequence number.

I don't understand. That's the whole *point* of sequence numbers. The
read-locker doesn't ever change the number, only the write-locker
does.

So I don't see what you would change.

Linus

2013-09-11 03:56:42

by Al Viro

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 08:22:56PM -0700, Linus Torvalds wrote:
> On Tue, Sep 10, 2013 at 6:48 PM, Waiman Long <[email protected]> wrote:
> >
> > I need to clean up some comments in the code. The other thing that I want to
> > do is to introduce read_seqlock/read_sequnlock() primitives that do the
> > locking without incrementing the sequence number.
>
> I don't understand. That's the whole *point* of sequence numbers. The
> read-locker doesn't ever change the number, only the write-locker
> does.
>
> So I don't see what you would change.

I do. What we need on the second pass (one where we currently
take seq_writelock()) is exclusion against writers; nothing we are
doing is worth disturbing the readers - we don't change any data
structures. And simple grabbing the spinlock, without touching the
sequence number would achieve exactly that. Writers will have to
wait and won't be able to disturb us, readers won't notice anything
happening. So yes, this extra primitive does make sense here.

2013-09-11 04:12:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/dcache.c:648! with v3.11-7890-ge5c832d

On Tue, Sep 10, 2013 at 8:56 PM, Al Viro <[email protected]> wrote:
>
> I do. What we need on the second pass (one where we currently
> take seq_writelock()) is exclusion against writers; nothing we are
> doing is worth disturbing the readers - we don't change any data
> structures. And simple grabbing the spinlock, without touching the
> sequence number would achieve exactly that. Writers will have to
> wait and won't be able to disturb us, readers won't notice anything
> happening. So yes, this extra primitive does make sense here.

Ahh. Yes, as a fallback from the reader-side sequence lock that makes
perfect sense..

Linus