2014-11-12 16:20:00

by Jeff Layton

[permalink] [raw]
Subject: BUG: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]

Hi Bruce,

I've seen the following crash happen on a couple of different kernels
recently. This one is from an older kernel that I just happened to be
testing some other code on, but I've seen the same oops as well on
v3.17.2 kernel from the fedora repos:

[ 891.382557] BUG: unable to handle kernel paging request at ffffffffa026b848
[ 891.383474] IP: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
[ 891.383474] PGD 1c0f067 PUD 1c10063 PMD da7ec067 PTE 0
[ 891.383474] Oops: 0000 [#1] SMP
[ 891.383474] Modules linked in: rpcsec_gss_krb5 microcode pvpanic virtio_net virtio_balloon nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 qxl drm_kms_helper ttm drm virtio_blk i2c_core
[ 891.383474] CPU: 1 PID: 653 Comm: nfsd Not tainted 3.14.15-.ds.1+ #5
[ 891.383474] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 891.383474] task: ffff8800d1aa0000 ti: ffff8800d1a9c000 task.ti: ffff8800d1a9c000
[ 891.383474] RIP: 0010:[<ffffffffa01dac02>] [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
[ 891.383474] RSP: 0018:ffff8800d1a9dda8 EFLAGS: 00010202
[ 891.383474] RAX: 0000000000075b40 RBX: ffff880118f47000 RCX: ffff880118f46080
[ 891.383474] RDX: 0000000000000006 RSI: ffff880118f46000 RDI: 0000000000000003
[ 891.383474] RBP: ffff8800d1a9ddf0 R08: ffffffffa01f6120 R09: 0000000000000000
[ 891.383474] R10: 0000000000000400 R11: ffff880118f46220 R12: ffff880118fc4000
[ 891.383474] R13: ffff880118f47040 R14: ffff880118f46000 R15: 0000000000000080
[ 891.383474] FS: 0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
[ 891.383474] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 891.383474] CR2: ffffffffa026b848 CR3: 0000000118d01000 CR4: 00000000000006e0
[ 891.383474] Stack:
[ 891.383474] ffff880118f46220 ffffffffa01f6120 ffff880118f46220 ffff880118f47180
[ 891.383474] ffff880118fc4000 ffffffffa01fe6b8 ffff8800d1ace018 000000000000001c
[ 891.383474] ffff8800d1ace000 ffff8800d1a9de28 ffffffffa01c7e1b ffff880118fc4000
[ 891.383474] Call Trace:
[ 891.383474] [<ffffffffa01c7e1b>] nfsd_dispatch+0xbb/0x210 [nfsd]
[ 891.383474] [<ffffffffa017dbf1>] svc_process_common+0x421/0x690 [sunrpc]
[ 891.383474] [<ffffffffa01c76b0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 891.383474] [<ffffffffa017df73>] svc_process+0x113/0x1a0 [sunrpc]
[ 891.383474] [<ffffffffa01c76b0>] ? nfsd_destroy+0x80/0x80 [nfsd]
[ 891.383474] [<ffffffffa01c7777>] nfsd+0xc7/0x130 [nfsd]
[ 891.383474] [<ffffffff810a5bfb>] kthread+0xdb/0x100
[ 891.383474] [<ffffffff810a5b20>] ? insert_kthread_work+0x40/0x40
[ 891.383474] [<ffffffff8162857c>] ret_from_fork+0x7c/0xb0
[ 891.383474] [<ffffffff810a5b20>] ? insert_kthread_work+0x40/0x40
[ 891.383474] Code: 5d 1f a0 10 0f 84 c6 fd ff ff 3b 46 6c 0f 84 bd fd ff ff 48 8d 14 40 48 8d 04 90 48 c1 e0 05 48 63 04 01 48 8d 04 40 48 c1 e0 04 <f6> 80 08 5d 1f a0 08 0f 85 98 fd ff ff 48 8b bb d0 00 00 00 4c
[ 891.383474] RIP [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
[ 891.383474] RSP <ffff8800d1a9dda8>
[ 891.383474] CR2: ffffffffa026b848
[ 891.528734] ---[ end trace cb3422188344cc64 ]---

gdb says:

(gdb) list *(nfsd4_proc_compound+0x432)
0x13c32 is in nfsd4_proc_compound (fs/nfsd/nfs4proc.c:1376).
1371 opdesc->op_set_currentstateid(cstate, &op->u);
1372
1373 if (opdesc->op_flags & OP_CLEAR_STATEID)
1374 clear_current_stateid(cstate);
1375
1376 if (need_wrongsec_check(rqstp))
1377 op->status = check_nfsd_access(cstate->current_fh.fh_export, rqstp);
1378 }
1379
1380 encode_op:

...and looking at the disassembly, it looks like it's falling down in
this line in need_wrongsec_check:

return !(nextd->op_flags & OP_HANDLES_WRONGSEC);

0xffffffffa01dabea <nfsd4_proc_compound+0x41a>: lea (%rax,%rax,2),%rdx
0xffffffffa01dabee <nfsd4_proc_compound+0x41e>: lea (%rax,%rdx,4),%rax
0xffffffffa01dabf2 <nfsd4_proc_compound+0x422>: shl $0x5,%rax
0xffffffffa01dabf6 <nfsd4_proc_compound+0x426>: movslq (%rcx,%rax,1),%rax
0xffffffffa01dabfa <nfsd4_proc_compound+0x42a>: lea (%rax,%rax,2),%rax
0xffffffffa01dabfe <nfsd4_proc_compound+0x42e>: shl $0x4,%rax
0xffffffffa01dac02 <nfsd4_proc_compound+0x432>: testb $0x8,-0x5fe0a2f8(%rax) <<<< CRASH HERE

I think that means that "nextd" was probably bogus, so maybe we walked
off the end of the argp->ops array? I haven't been able to reproduce it
at will yet, but both times it happened while using v4.1.


Seen anything like this before?

--
Jeff Layton <[email protected]>


2014-11-12 16:24:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: BUG: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]

On Wed, Nov 12, 2014 at 11:19:55AM -0500, Jeff Layton wrote:
> Hi Bruce,
>
> I've seen the following crash happen on a couple of different kernels
> recently. This one is from an older kernel that I just happened to be
> testing some other code on, but I've seen the same oops as well on
> v3.17.2 kernel from the fedora repos:

Do you have 51904b08072a "nfsd4: fix crash on unknown operation number"
?

--b.

>
> [ 891.382557] BUG: unable to handle kernel paging request at ffffffffa026b848
> [ 891.383474] IP: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
> [ 891.383474] PGD 1c0f067 PUD 1c10063 PMD da7ec067 PTE 0
> [ 891.383474] Oops: 0000 [#1] SMP
> [ 891.383474] Modules linked in: rpcsec_gss_krb5 microcode pvpanic virtio_net virtio_balloon nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 qxl drm_kms_helper ttm drm virtio_blk i2c_core
> [ 891.383474] CPU: 1 PID: 653 Comm: nfsd Not tainted 3.14.15-.ds.1+ #5
> [ 891.383474] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 891.383474] task: ffff8800d1aa0000 ti: ffff8800d1a9c000 task.ti: ffff8800d1a9c000
> [ 891.383474] RIP: 0010:[<ffffffffa01dac02>] [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
> [ 891.383474] RSP: 0018:ffff8800d1a9dda8 EFLAGS: 00010202
> [ 891.383474] RAX: 0000000000075b40 RBX: ffff880118f47000 RCX: ffff880118f46080
> [ 891.383474] RDX: 0000000000000006 RSI: ffff880118f46000 RDI: 0000000000000003
> [ 891.383474] RBP: ffff8800d1a9ddf0 R08: ffffffffa01f6120 R09: 0000000000000000
> [ 891.383474] R10: 0000000000000400 R11: ffff880118f46220 R12: ffff880118fc4000
> [ 891.383474] R13: ffff880118f47040 R14: ffff880118f46000 R15: 0000000000000080
> [ 891.383474] FS: 0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> [ 891.383474] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 891.383474] CR2: ffffffffa026b848 CR3: 0000000118d01000 CR4: 00000000000006e0
> [ 891.383474] Stack:
> [ 891.383474] ffff880118f46220 ffffffffa01f6120 ffff880118f46220 ffff880118f47180
> [ 891.383474] ffff880118fc4000 ffffffffa01fe6b8 ffff8800d1ace018 000000000000001c
> [ 891.383474] ffff8800d1ace000 ffff8800d1a9de28 ffffffffa01c7e1b ffff880118fc4000
> [ 891.383474] Call Trace:
> [ 891.383474] [<ffffffffa01c7e1b>] nfsd_dispatch+0xbb/0x210 [nfsd]
> [ 891.383474] [<ffffffffa017dbf1>] svc_process_common+0x421/0x690 [sunrpc]
> [ 891.383474] [<ffffffffa01c76b0>] ? nfsd_destroy+0x80/0x80 [nfsd]
> [ 891.383474] [<ffffffffa017df73>] svc_process+0x113/0x1a0 [sunrpc]
> [ 891.383474] [<ffffffffa01c76b0>] ? nfsd_destroy+0x80/0x80 [nfsd]
> [ 891.383474] [<ffffffffa01c7777>] nfsd+0xc7/0x130 [nfsd]
> [ 891.383474] [<ffffffff810a5bfb>] kthread+0xdb/0x100
> [ 891.383474] [<ffffffff810a5b20>] ? insert_kthread_work+0x40/0x40
> [ 891.383474] [<ffffffff8162857c>] ret_from_fork+0x7c/0xb0
> [ 891.383474] [<ffffffff810a5b20>] ? insert_kthread_work+0x40/0x40
> [ 891.383474] Code: 5d 1f a0 10 0f 84 c6 fd ff ff 3b 46 6c 0f 84 bd fd ff ff 48 8d 14 40 48 8d 04 90 48 c1 e0 05 48 63 04 01 48 8d 04 40 48 c1 e0 04 <f6> 80 08 5d 1f a0 08 0f 85 98 fd ff ff 48 8b bb d0 00 00 00 4c
> [ 891.383474] RIP [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
> [ 891.383474] RSP <ffff8800d1a9dda8>
> [ 891.383474] CR2: ffffffffa026b848
> [ 891.528734] ---[ end trace cb3422188344cc64 ]---
>
> gdb says:
>
> (gdb) list *(nfsd4_proc_compound+0x432)
> 0x13c32 is in nfsd4_proc_compound (fs/nfsd/nfs4proc.c:1376).
> 1371 opdesc->op_set_currentstateid(cstate, &op->u);
> 1372
> 1373 if (opdesc->op_flags & OP_CLEAR_STATEID)
> 1374 clear_current_stateid(cstate);
> 1375
> 1376 if (need_wrongsec_check(rqstp))
> 1377 op->status = check_nfsd_access(cstate->current_fh.fh_export, rqstp);
> 1378 }
> 1379
> 1380 encode_op:
>
> ...and looking at the disassembly, it looks like it's falling down in
> this line in need_wrongsec_check:
>
> return !(nextd->op_flags & OP_HANDLES_WRONGSEC);
>
> 0xffffffffa01dabea <nfsd4_proc_compound+0x41a>: lea (%rax,%rax,2),%rdx
> 0xffffffffa01dabee <nfsd4_proc_compound+0x41e>: lea (%rax,%rdx,4),%rax
> 0xffffffffa01dabf2 <nfsd4_proc_compound+0x422>: shl $0x5,%rax
> 0xffffffffa01dabf6 <nfsd4_proc_compound+0x426>: movslq (%rcx,%rax,1),%rax
> 0xffffffffa01dabfa <nfsd4_proc_compound+0x42a>: lea (%rax,%rax,2),%rax
> 0xffffffffa01dabfe <nfsd4_proc_compound+0x42e>: shl $0x4,%rax
> 0xffffffffa01dac02 <nfsd4_proc_compound+0x432>: testb $0x8,-0x5fe0a2f8(%rax) <<<< CRASH HERE
>
> I think that means that "nextd" was probably bogus, so maybe we walked
> off the end of the argp->ops array? I haven't been able to reproduce it
> at will yet, but both times it happened while using v4.1.
>
>
> Seen anything like this before?
>
> --
> Jeff Layton <[email protected]>

2014-11-12 16:53:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: BUG: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]

On Wed, Nov 12, 2014 at 11:28:21AM -0500, Jeff Layton wrote:
> On Wed, 12 Nov 2014 11:24:51 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Wed, Nov 12, 2014 at 11:19:55AM -0500, Jeff Layton wrote:
> > > Hi Bruce,
> > >
> > > I've seen the following crash happen on a couple of different kernels
> > > recently. This one is from an older kernel that I just happened to be
> > > testing some other code on, but I've seen the same oops as well on
> > > v3.17.2 kernel from the fedora repos:
> >
> > Do you have 51904b08072a "nfsd4: fix crash on unknown operation number"
> > ?
> >
> > --b.
> >
>
> Ahh no...neither one has that. I'll apply it to the test kernel I'm
> using and see if it goes away. Thanks!
>
> That said...wonder what the client would be sending that would cause
> this to fire?

There was a mis-merge, and CAP_SEEK is set in the wrong place in
fs/nfs/nfs4proc.c. If you grep for CAP_SEEK in that file you'll see the
problem.

Trond, Anna, that still seems to be wrong upstream? What happened to
that fix?

--b.

2014-11-12 16:28:26

by Jeff Layton

[permalink] [raw]
Subject: Re: BUG: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]

On Wed, 12 Nov 2014 11:24:51 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Wed, Nov 12, 2014 at 11:19:55AM -0500, Jeff Layton wrote:
> > Hi Bruce,
> >
> > I've seen the following crash happen on a couple of different kernels
> > recently. This one is from an older kernel that I just happened to be
> > testing some other code on, but I've seen the same oops as well on
> > v3.17.2 kernel from the fedora repos:
>
> Do you have 51904b08072a "nfsd4: fix crash on unknown operation number"
> ?
>
> --b.
>

Ahh no...neither one has that. I'll apply it to the test kernel I'm
using and see if it goes away. Thanks!

That said...wonder what the client would be sending that would cause
this to fire?

> >
> > [ 891.382557] BUG: unable to handle kernel paging request at ffffffffa026b848
> > [ 891.383474] IP: [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
> > [ 891.383474] PGD 1c0f067 PUD 1c10063 PMD da7ec067 PTE 0
> > [ 891.383474] Oops: 0000 [#1] SMP
> > [ 891.383474] Modules linked in: rpcsec_gss_krb5 microcode pvpanic virtio_net virtio_balloon nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 qxl drm_kms_helper ttm drm virtio_blk i2c_core
> > [ 891.383474] CPU: 1 PID: 653 Comm: nfsd Not tainted 3.14.15-.ds.1+ #5
> > [ 891.383474] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 891.383474] task: ffff8800d1aa0000 ti: ffff8800d1a9c000 task.ti: ffff8800d1a9c000
> > [ 891.383474] RIP: 0010:[<ffffffffa01dac02>] [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
> > [ 891.383474] RSP: 0018:ffff8800d1a9dda8 EFLAGS: 00010202
> > [ 891.383474] RAX: 0000000000075b40 RBX: ffff880118f47000 RCX: ffff880118f46080
> > [ 891.383474] RDX: 0000000000000006 RSI: ffff880118f46000 RDI: 0000000000000003
> > [ 891.383474] RBP: ffff8800d1a9ddf0 R08: ffffffffa01f6120 R09: 0000000000000000
> > [ 891.383474] R10: 0000000000000400 R11: ffff880118f46220 R12: ffff880118fc4000
> > [ 891.383474] R13: ffff880118f47040 R14: ffff880118f46000 R15: 0000000000000080
> > [ 891.383474] FS: 0000000000000000(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> > [ 891.383474] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 891.383474] CR2: ffffffffa026b848 CR3: 0000000118d01000 CR4: 00000000000006e0
> > [ 891.383474] Stack:
> > [ 891.383474] ffff880118f46220 ffffffffa01f6120 ffff880118f46220 ffff880118f47180
> > [ 891.383474] ffff880118fc4000 ffffffffa01fe6b8 ffff8800d1ace018 000000000000001c
> > [ 891.383474] ffff8800d1ace000 ffff8800d1a9de28 ffffffffa01c7e1b ffff880118fc4000
> > [ 891.383474] Call Trace:
> > [ 891.383474] [<ffffffffa01c7e1b>] nfsd_dispatch+0xbb/0x210 [nfsd]
> > [ 891.383474] [<ffffffffa017dbf1>] svc_process_common+0x421/0x690 [sunrpc]
> > [ 891.383474] [<ffffffffa01c76b0>] ? nfsd_destroy+0x80/0x80 [nfsd]
> > [ 891.383474] [<ffffffffa017df73>] svc_process+0x113/0x1a0 [sunrpc]
> > [ 891.383474] [<ffffffffa01c76b0>] ? nfsd_destroy+0x80/0x80 [nfsd]
> > [ 891.383474] [<ffffffffa01c7777>] nfsd+0xc7/0x130 [nfsd]
> > [ 891.383474] [<ffffffff810a5bfb>] kthread+0xdb/0x100
> > [ 891.383474] [<ffffffff810a5b20>] ? insert_kthread_work+0x40/0x40
> > [ 891.383474] [<ffffffff8162857c>] ret_from_fork+0x7c/0xb0
> > [ 891.383474] [<ffffffff810a5b20>] ? insert_kthread_work+0x40/0x40
> > [ 891.383474] Code: 5d 1f a0 10 0f 84 c6 fd ff ff 3b 46 6c 0f 84 bd fd ff ff 48 8d 14 40 48 8d 04 90 48 c1 e0 05 48 63 04 01 48 8d 04 40 48 c1 e0 04 <f6> 80 08 5d 1f a0 08 0f 85 98 fd ff ff 48 8b bb d0 00 00 00 4c
> > [ 891.383474] RIP [<ffffffffa01dac02>] nfsd4_proc_compound+0x432/0x6c0 [nfsd]
> > [ 891.383474] RSP <ffff8800d1a9dda8>
> > [ 891.383474] CR2: ffffffffa026b848
> > [ 891.528734] ---[ end trace cb3422188344cc64 ]---
> >
> > gdb says:
> >
> > (gdb) list *(nfsd4_proc_compound+0x432)
> > 0x13c32 is in nfsd4_proc_compound (fs/nfsd/nfs4proc.c:1376).
> > 1371 opdesc->op_set_currentstateid(cstate, &op->u);
> > 1372
> > 1373 if (opdesc->op_flags & OP_CLEAR_STATEID)
> > 1374 clear_current_stateid(cstate);
> > 1375
> > 1376 if (need_wrongsec_check(rqstp))
> > 1377 op->status = check_nfsd_access(cstate->current_fh.fh_export, rqstp);
> > 1378 }
> > 1379
> > 1380 encode_op:
> >
> > ...and looking at the disassembly, it looks like it's falling down in
> > this line in need_wrongsec_check:
> >
> > return !(nextd->op_flags & OP_HANDLES_WRONGSEC);
> >
> > 0xffffffffa01dabea <nfsd4_proc_compound+0x41a>: lea (%rax,%rax,2),%rdx
> > 0xffffffffa01dabee <nfsd4_proc_compound+0x41e>: lea (%rax,%rdx,4),%rax
> > 0xffffffffa01dabf2 <nfsd4_proc_compound+0x422>: shl $0x5,%rax
> > 0xffffffffa01dabf6 <nfsd4_proc_compound+0x426>: movslq (%rcx,%rax,1),%rax
> > 0xffffffffa01dabfa <nfsd4_proc_compound+0x42a>: lea (%rax,%rax,2),%rax
> > 0xffffffffa01dabfe <nfsd4_proc_compound+0x42e>: shl $0x4,%rax
> > 0xffffffffa01dac02 <nfsd4_proc_compound+0x432>: testb $0x8,-0x5fe0a2f8(%rax) <<<< CRASH HERE
> >
> > I think that means that "nextd" was probably bogus, so maybe we walked
> > off the end of the argp->ops array? I haven't been able to reproduce it
> > at will yet, but both times it happened while using v4.1.
> >
> >
> > Seen anything like this before?
> >
> > --
> > Jeff Layton <[email protected]>


--
Jeff Layton <[email protected]>