2018-11-08 21:44:54

by J. Bruce Fields

[permalink] [raw]
Subject: NULL dereference in rpcauth_lookup_credcache

Since -rc1 my regression tests crash my client. Is this a known
problem? I'll investigate some more, I haven't even looked at the code
yet or checked which test exactly is hitting this.

--b.

[ 164.109570] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 164.111207] PGD 0 P4D 0
[ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
[ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted 4.20.0-rc1-13223-gafb6d1c474ef #1898
[ 164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450 [sunrpc]
[ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45 c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48 8b 40 28 e8 cb d2
[ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
[ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX: 0000000000600040
[ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI: ffff88007cafb198
[ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09: 0000000000000000
[ 164.122464] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15: 0000000000000000
[ 164.124296] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4: 00000000001606e0
[ 164.126860] Call Trace:
[ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
[ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
[ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
[ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
[ 164.129281] process_one_work+0x227/0x630
[ 164.129684] worker_thread+0x3c/0x390
[ 164.130062] ? process_one_work+0x630/0x630
[ 164.130609] kthread+0x11d/0x140
[ 164.130936] ? kthread_park+0x80/0x80
[ 164.131339] ret_from_fork+0x3a/0x50
[ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd grace auth_rpcgss sunrpc
[ 164.132719] CR2: 0000000000000008
[ 164.133050] ---[ end trace b4028a6781a696ad ]---



2018-11-09 18:01:36

by Chuck Lever

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache



> On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <[email protected]> wrote:
>
> Since -rc1 my regression tests crash my client. Is this a known
> problem? I'll investigate some more, I haven't even looked at the code
> yet or checked which test exactly is hitting this.
>
> --b.
>
> [ 164.109570] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [ 164.111207] PGD 0 P4D 0
> [ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> [ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted 4.20.0-rc1-13223-gafb6d1c474ef #1898
> [ 164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> [ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450 [sunrpc]
> [ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45 c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48 8b 40 28 e8 cb d2
> [ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> [ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX: 0000000000600040
> [ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI: ffff88007cafb198
> [ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09: 0000000000000000
> [ 164.122464] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> [ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15: 0000000000000000
> [ 164.124296] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
> [ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4: 00000000001606e0
> [ 164.126860] Call Trace:
> [ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
> [ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> [ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> [ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
> [ 164.129281] process_one_work+0x227/0x630
> [ 164.129684] worker_thread+0x3c/0x390
> [ 164.130062] ? process_one_work+0x630/0x630
> [ 164.130609] kthread+0x11d/0x140
> [ 164.130936] ? kthread_park+0x80/0x80
> [ 164.131339] ret_from_fork+0x3a/0x50
> [ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd grace auth_rpcgss sunrpc
> [ 164.132719] CR2: 0000000000000008
> [ 164.133050] ---[ end trace b4028a6781a696ad ]---
>

I just encountered this repeatedly with cthon04 general tests.

MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"


--
Chuck Lever
[email protected]




2018-11-10 21:49:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

Looks like it's the fault of

07d02a67b7faae "SUNRPC: Simplify lookup code"

--b.

On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
>
>
> > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <[email protected]> wrote:
> >
> > Since -rc1 my regression tests crash my client. Is this a known
> > problem? I'll investigate some more, I haven't even looked at the code
> > yet or checked which test exactly is hitting this.
> >
> > --b.
> >
> > [ 164.109570] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> > [ 164.111207] PGD 0 P4D 0
> > [ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > [ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > [ 164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> > [ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450 [sunrpc]
> > [ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45 c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48 8b 40 28 e8 cb d2
> > [ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > [ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX: 0000000000600040
> > [ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI: ffff88007cafb198
> > [ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09: 0000000000000000
> > [ 164.122464] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> > [ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15: 0000000000000000
> > [ 164.124296] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > [ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4: 00000000001606e0
> > [ 164.126860] Call Trace:
> > [ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
> > [ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > [ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > [ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
> > [ 164.129281] process_one_work+0x227/0x630
> > [ 164.129684] worker_thread+0x3c/0x390
> > [ 164.130062] ? process_one_work+0x630/0x630
> > [ 164.130609] kthread+0x11d/0x140
> > [ 164.130936] ? kthread_park+0x80/0x80
> > [ 164.131339] ret_from_fork+0x3a/0x50
> > [ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd grace auth_rpcgss sunrpc
> > [ 164.132719] CR2: 0000000000000008
> > [ 164.133050] ---[ end trace b4028a6781a696ad ]---
> >
>
> I just encountered this repeatedly with cthon04 general tests.
>
> MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
>
>
> --
> Chuck Lever
> [email protected]
>
>

2018-11-12 17:59:40

by Trond Myklebust

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> Looks like it's the fault of
>
> 07d02a67b7faae "SUNRPC: Simplify lookup code"

I'm having trouble reproducing this bug. I've tried both cthon and
xfstests in a loop, so far without success (both NFSv3 and v4.1, but
only sec=sys). Is there anything else you're doing that I might try?

e.g. Are you running multiple workloads in parallel? Different users?..

>
> --b.
>
> On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> >
> > > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <[email protected]
> > > > wrote:
> > >
> > > Since -rc1 my regression tests crash my client. Is this a known
> > > problem? I'll investigate some more, I haven't even looked at
> > > the code
> > > yet or checked which test exactly is hitting this.
> > >
> > > --b.
> > >
> > > [ 164.109570] BUG: unable to handle kernel NULL pointer
> > > dereference at 0000000000000008
> > > [ 164.111207] PGD 0 P4D 0
> > > [ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > > [ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted
> > > 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > > [ 164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-
> > > 1.fc28 04/01/2014
> > > [ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > [ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
> > > [sunrpc]
> > > [ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0
> > > 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45
> > > c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48
> > > 8b 40 28 e8 cb d2
> > > [ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > > [ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX:
> > > 0000000000600040
> > > [ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI:
> > > ffff88007cafb198
> > > [ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [ 164.122464] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000001
> > > [ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15:
> > > 0000000000000000
> > > [ 164.124296] FS: 0000000000000000(0000)
> > > GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > > [ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4:
> > > 00000000001606e0
> > > [ 164.126860] Call Trace:
> > > [ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
> > > [ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > > [ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > > [ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
> > > [ 164.129281] process_one_work+0x227/0x630
> > > [ 164.129684] worker_thread+0x3c/0x390
> > > [ 164.130062] ? process_one_work+0x630/0x630
> > > [ 164.130609] kthread+0x11d/0x140
> > > [ 164.130936] ? kthread_park+0x80/0x80
> > > [ 164.131339] ret_from_fork+0x3a/0x50
> > > [ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd
> > > grace auth_rpcgss sunrpc
> > > [ 164.132719] CR2: 0000000000000008
> > > [ 164.133050] ---[ end trace b4028a6781a696ad ]---
> > >
> >
> > I just encountered this repeatedly with cthon04 general tests.
> >
> > MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> >
> >
> > --
> > Chuck Lever
> > [email protected]
> >
> >
--
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
http://www.hammer.space


2018-11-12 18:16:51

by Chuck Lever

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache



> On Nov 12, 2018, at 9:59 AM, Trond Myklebust <[email protected]> wrote:
>
> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
>> Looks like it's the fault of
>>
>> 07d02a67b7faae "SUNRPC: Simplify lookup code"
>
> I'm having trouble reproducing this bug. I've tried both cthon and
> xfstests in a loop, so far without success (both NFSv3 and v4.1, but
> only sec=sys). Is there anything else you're doing that I might try?
>
> e.g. Are you running multiple workloads in parallel? Different users?..

Some observations, for what they are worth:

Single user test running with no other NFS workload.

I see the BUG fire at umount time, not during the test.

My client is a two-node NUMA system with 12 cores, which
could be more likely to trigger races.

Export is tmpfs.


>> --b.
>>
>> On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
>>>
>>>> On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <[email protected]
>>>>> wrote:
>>>>
>>>> Since -rc1 my regression tests crash my client. Is this a known
>>>> problem? I'll investigate some more, I haven't even looked at
>>>> the code
>>>> yet or checked which test exactly is hitting this.
>>>>
>>>> --b.
>>>>
>>>> [ 164.109570] BUG: unable to handle kernel NULL pointer
>>>> dereference at 0000000000000008
>>>> [ 164.111207] PGD 0 P4D 0
>>>> [ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
>>>> [ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted
>>>> 4.20.0-rc1-13223-gafb6d1c474ef #1898
>>>> [ 164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>> 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-
>>>> 1.fc28 04/01/2014
>>>> [ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
>>>> [sunrpc]
>>>> [ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0
>>>> 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45
>>>> c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48
>>>> 8b 40 28 e8 cb d2
>>>> [ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
>>>> [ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX:
>>>> 0000000000600040
>>>> [ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI:
>>>> ffff88007cafb198
>>>> [ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09:
>>>> 0000000000000000
>>>> [ 164.122464] R10: 0000000000000000 R11: 0000000000000000 R12:
>>>> 0000000000000001
>>>> [ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15:
>>>> 0000000000000000
>>>> [ 164.124296] FS: 0000000000000000(0000)
>>>> GS:ffff88007fd00000(0000) knlGS:0000000000000000
>>>> [ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4:
>>>> 00000000001606e0
>>>> [ 164.126860] Call Trace:
>>>> [ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
>>>> [ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
>>>> [ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
>>>> [ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
>>>> [ 164.129281] process_one_work+0x227/0x630
>>>> [ 164.129684] worker_thread+0x3c/0x390
>>>> [ 164.130062] ? process_one_work+0x630/0x630
>>>> [ 164.130609] kthread+0x11d/0x140
>>>> [ 164.130936] ? kthread_park+0x80/0x80
>>>> [ 164.131339] ret_from_fork+0x3a/0x50
>>>> [ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd
>>>> grace auth_rpcgss sunrpc
>>>> [ 164.132719] CR2: 0000000000000008
>>>> [ 164.133050] ---[ end trace b4028a6781a696ad ]---
>>>>
>>>
>>> I just encountered this repeatedly with cthon04 general tests.
>>>
>>> MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
>>>
>>>
>>> --
>>> Chuck Lever
>>> [email protected]
>>>
>>>
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> http://www.hammer.space

--
Chuck Lever
[email protected]




2018-11-12 18:19:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Mon, 2018-11-12 at 10:16 -0800, Chuck Lever wrote:
> > On Nov 12, 2018, at 9:59 AM, Trond Myklebust <
> > [email protected]> wrote:
> >
> > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > Looks like it's the fault of
> > >
> > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> >
> > I'm having trouble reproducing this bug. I've tried both cthon and
> > xfstests in a loop, so far without success (both NFSv3 and v4.1,
> > but
> > only sec=sys). Is there anything else you're doing that I might
> > try?
> >
> > e.g. Are you running multiple workloads in parallel? Different
> > users?..
>
> Some observations, for what they are worth:
>
> Single user test running with no other NFS workload.
>
> I see the BUG fire at umount time, not during the test.
>
> My client is a two-node NUMA system with 12 cores, which
> could be more likely to trigger races.
>
> Export is tmpfs.
>

Thanks! That's useful info. Particularly the observation that you're
seeing it at umount time...

>
> > > --b.
> > >
> > > On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> > > > > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <
> > > > > [email protected]
> > > > > > wrote:
> > > > >
> > > > > Since -rc1 my regression tests crash my client. Is this a
> > > > > known
> > > > > problem? I'll investigate some more, I haven't even looked
> > > > > at
> > > > > the code
> > > > > yet or checked which test exactly is hitting this.
> > > > >
> > > > > --b.
> > > > >
> > > > > [ 164.109570] BUG: unable to handle kernel NULL pointer
> > > > > dereference at 0000000000000008
> > > > > [ 164.111207] PGD 0 P4D 0
> > > > > [ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > > [ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not
> > > > > tainted
> > > > > 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > > > > [ 164.113487] Hardware name: QEMU Standard PC (i440FX +
> > > > > PIIX,
> > > > > 1996), BIOS ?-20180531_142017-buildhw-
> > > > > 08.phx2.fedoraproject.org-
> > > > > 1.fc28 04/01/2014
> > > > > [ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > > > [ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
> > > > > [sunrpc]
> > > > > [ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89
> > > > > 4d b0
> > > > > 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48
> > > > > 8d 45
> > > > > c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef
> > > > > 48
> > > > > 8b 40 28 e8 cb d2
> > > > > [ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > > > > [ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180
> > > > > RCX:
> > > > > 0000000000600040
> > > > > [ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60
> > > > > RDI:
> > > > > ffff88007cafb198
> > > > > [ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000
> > > > > R09:
> > > > > 0000000000000000
> > > > > [ 164.122464] R10: 0000000000000000 R11: 0000000000000000
> > > > > R12:
> > > > > 0000000000000001
> > > > > [ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198
> > > > > R15:
> > > > > 0000000000000000
> > > > > [ 164.124296] FS: 0000000000000000(0000)
> > > > > GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > > > > [ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003
> > > > > CR4:
> > > > > 00000000001606e0
> > > > > [ 164.126860] Call Trace:
> > > > > [ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
> > > > > [ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > > > > [ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > > > > [ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
> > > > > [ 164.129281] process_one_work+0x227/0x630
> > > > > [ 164.129684] worker_thread+0x3c/0x390
> > > > > [ 164.130062] ? process_one_work+0x630/0x630
> > > > > [ 164.130609] kthread+0x11d/0x140
> > > > > [ 164.130936] ? kthread_park+0x80/0x80
> > > > > [ 164.131339] ret_from_fork+0x3a/0x50
> > > > > [ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs
> > > > > lockd
> > > > > grace auth_rpcgss sunrpc
> > > > > [ 164.132719] CR2: 0000000000000008
> > > > > [ 164.133050] ---[ end trace b4028a6781a696ad ]---
> > > > >
> > > >
> > > > I just encountered this repeatedly with cthon04 general tests.
> > > >
> > > > MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> > > >
> > > >
> > > > --
> > > > Chuck Lever
> > > > [email protected]
> > > >
> > > >
> > --
> > Trond Myklebust
> > CTO, Hammerspace Inc
> > 4300 El Camino Real, Suite 105
> > Los Altos, CA 94022
> > http://www.hammer.space
>
> --
> Chuck Lever
> [email protected]
>
>
>
--
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
http://www.hammer.space


2018-11-12 18:24:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > Looks like it's the fault of
> >
> > 07d02a67b7faae "SUNRPC: Simplify lookup code"
>
> I'm having trouble reproducing this bug. I've tried both cthon and
> xfstests in a loop, so far without success (both NFSv3 and v4.1, but
> only sec=sys). Is there anything else you're doing that I might try?
>
> e.g. Are you running multiple workloads in parallel? Different users?..

Nothing that interesting. Currently it's connectathon over v4, v3,
v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just serially
one after the other. Then some pynfs tests (which bypass the client),
then xfstests over v4.2/sys. And also a few one-off locking tests of my
own that probably aren't a factor here.

(Hah, I just realized I was mounting with vers=4 and assuming that meant
4.0, but actually it's changed over time depending on the defaults, so
currently those "v4" runs are actually all 4.2. Gah.)

--b.

>
> >
> > --b.
> >
> > On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> > >
> > > > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <[email protected]
> > > > > wrote:
> > > >
> > > > Since -rc1 my regression tests crash my client. Is this a known
> > > > problem? I'll investigate some more, I haven't even looked at
> > > > the code
> > > > yet or checked which test exactly is hitting this.
> > > >
> > > > --b.
> > > >
> > > > [ 164.109570] BUG: unable to handle kernel NULL pointer
> > > > dereference at 0000000000000008
> > > > [ 164.111207] PGD 0 P4D 0
> > > > [ 164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > [ 164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted
> > > > 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > > > [ 164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > > 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-
> > > > 1.fc28 04/01/2014
> > > > [ 164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > > [ 164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
> > > > [sunrpc]
> > > > [ 164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0
> > > > 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45
> > > > c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48
> > > > 8b 40 28 e8 cb d2
> > > > [ 164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > > > [ 164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX:
> > > > 0000000000600040
> > > > [ 164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI:
> > > > ffff88007cafb198
> > > > [ 164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09:
> > > > 0000000000000000
> > > > [ 164.122464] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > 0000000000000001
> > > > [ 164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15:
> > > > 0000000000000000
> > > > [ 164.124296] FS: 0000000000000000(0000)
> > > > GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > > > [ 164.125322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4:
> > > > 00000000001606e0
> > > > [ 164.126860] Call Trace:
> > > > [ 164.127045] ? call_retry_reserve+0x30/0x30 [sunrpc]
> > > > [ 164.127622] rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > > > [ 164.128200] rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > > > [ 164.128807] __rpc_execute+0xa9/0x460 [sunrpc]
> > > > [ 164.129281] process_one_work+0x227/0x630
> > > > [ 164.129684] worker_thread+0x3c/0x390
> > > > [ 164.130062] ? process_one_work+0x630/0x630
> > > > [ 164.130609] kthread+0x11d/0x140
> > > > [ 164.130936] ? kthread_park+0x80/0x80
> > > > [ 164.131339] ret_from_fork+0x3a/0x50
> > > > [ 164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd
> > > > grace auth_rpcgss sunrpc
> > > > [ 164.132719] CR2: 0000000000000008
> > > > [ 164.133050] ---[ end trace b4028a6781a696ad ]---
> > > >
> > >
> > > I just encountered this repeatedly with cthon04 general tests.
> > >
> > > MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> > >
> > >
> > > --
> > > Chuck Lever
> > > [email protected]
> > >
> > >
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> http://www.hammer.space
>
>

2018-11-12 21:17:22

by Trond Myklebust

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Mon, 2018-11-12 at 13:24 -0500, [email protected] wrote:
> On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > Looks like it's the fault of
> > >
> > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> >
> > I'm having trouble reproducing this bug. I've tried both cthon and
> > xfstests in a loop, so far without success (both NFSv3 and v4.1,
> > but
> > only sec=sys). Is there anything else you're doing that I might
> > try?
> >
> > e.g. Are you running multiple workloads in parallel? Different
> > users?..
>
> Nothing that interesting. Currently it's connectathon over v4, v3,
> v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
> serially
> one after the other. Then some pynfs tests (which bypass the
> client),
> then xfstests over v4.2/sys. And also a few one-off locking tests of
> my
> own that probably aren't a factor here.
>
> (Hah, I just realized I was mounting with vers=4 and assuming that
> meant
> 4.0, but actually it's changed over time depending on the defaults,
> so
> currently those "v4" runs are actually all 4.2. Gah.)

Are you perhaps both using RPCSEC_GSS w/ integrity checking for your
EXCHANGE_ID authentication? The client will attempt to use that by
default if rpc.gssd is running.

I ask because I think the issue might be with RPCSEC_GSS, specifically
with the RPCSEC_GSS context destroy code, hence the 2 patches that I
just sent out.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2018-11-12 23:01:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust wrote:
> On Mon, 2018-11-12 at 13:24 -0500, [email protected] wrote:
> > On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> > > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > > Looks like it's the fault of
> > > >
> > > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > >
> > > I'm having trouble reproducing this bug. I've tried both cthon and
> > > xfstests in a loop, so far without success (both NFSv3 and v4.1,
> > > but
> > > only sec=sys). Is there anything else you're doing that I might
> > > try?
> > >
> > > e.g. Are you running multiple workloads in parallel? Different
> > > users?..
> >
> > Nothing that interesting. Currently it's connectathon over v4, v3,
> > v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
> > serially
> > one after the other. Then some pynfs tests (which bypass the
> > client),
> > then xfstests over v4.2/sys. And also a few one-off locking tests of
> > my
> > own that probably aren't a factor here.
> >
> > (Hah, I just realized I was mounting with vers=4 and assuming that
> > meant
> > 4.0, but actually it's changed over time depending on the defaults,
> > so
> > currently those "v4" runs are actually all 4.2. Gah.)
>
> Are you perhaps both using RPCSEC_GSS w/ integrity checking for your
> EXCHANGE_ID authentication? The client will attempt to use that by
> default if rpc.gssd is running.

Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p mounts
are using krb5i for EXCHANGE_ID.

> I ask because I think the issue might be with RPCSEC_GSS, specifically
> with the RPCSEC_GSS context destroy code, hence the 2 patches that I
> just sent out.

Looks like my tests pass after applying those two patches.

--b.

2018-11-12 23:58:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Mon, 2018-11-12 at 18:01 -0500, [email protected] wrote:
> On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust wrote:
> > On Mon, 2018-11-12 at 13:24 -0500, [email protected] wrote:
> > > On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> > > > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > > > Looks like it's the fault of
> > > > >
> > > > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > > >
> > > > I'm having trouble reproducing this bug. I've tried both cthon
> > > > and
> > > > xfstests in a loop, so far without success (both NFSv3 and
> > > > v4.1,
> > > > but
> > > > only sec=sys). Is there anything else you're doing that I might
> > > > try?
> > > >
> > > > e.g. Are you running multiple workloads in parallel? Different
> > > > users?..
> > >
> > > Nothing that interesting. Currently it's connectathon over v4,
> > > v3,
> > > v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
> > > serially
> > > one after the other. Then some pynfs tests (which bypass the
> > > client),
> > > then xfstests over v4.2/sys. And also a few one-off locking
> > > tests of
> > > my
> > > own that probably aren't a factor here.
> > >
> > > (Hah, I just realized I was mounting with vers=4 and assuming
> > > that
> > > meant
> > > 4.0, but actually it's changed over time depending on the
> > > defaults,
> > > so
> > > currently those "v4" runs are actually all 4.2. Gah.)
> >
> > Are you perhaps both using RPCSEC_GSS w/ integrity checking for
> > your
> > EXCHANGE_ID authentication? The client will attempt to use that by
> > default if rpc.gssd is running.
>
> Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
> mounts
> are using krb5i for EXCHANGE_ID.
>
> > I ask because I think the issue might be with RPCSEC_GSS,
> > specifically
> > with the RPCSEC_GSS context destroy code, hence the 2 patches that
> > I
> > just sent out.
>
> Looks like my tests pass after applying those two patches.
>

Cool! Thanks for testing.

Chuck, do you think the above might also explain your sighting of the
same Oops?

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2018-11-13 00:01:01

by Chuck Lever III

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache


> On Nov 12, 2018, at 3:57 PM, Trond Myklebust <[email protected]> wrote:
>
>> On Mon, 2018-11-12 at 18:01 -0500, [email protected] wrote:
>>> On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust wrote:
>>>> On Mon, 2018-11-12 at 13:24 -0500, [email protected] wrote:
>>>>> On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
>>>>>> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
>>>>>> Looks like it's the fault of
>>>>>>
>>>>>> 07d02a67b7faae "SUNRPC: Simplify lookup code"
>>>>>
>>>>> I'm having trouble reproducing this bug. I've tried both cthon
>>>>> and
>>>>> xfstests in a loop, so far without success (both NFSv3 and
>>>>> v4.1,
>>>>> but
>>>>> only sec=sys). Is there anything else you're doing that I might
>>>>> try?
>>>>>
>>>>> e.g. Are you running multiple workloads in parallel? Different
>>>>> users?..
>>>>
>>>> Nothing that interesting. Currently it's connectathon over v4,
>>>> v3,
>>>> v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
>>>> serially
>>>> one after the other. Then some pynfs tests (which bypass the
>>>> client),
>>>> then xfstests over v4.2/sys. And also a few one-off locking
>>>> tests of
>>>> my
>>>> own that probably aren't a factor here.
>>>>
>>>> (Hah, I just realized I was mounting with vers=4 and assuming
>>>> that
>>>> meant
>>>> 4.0, but actually it's changed over time depending on the
>>>> defaults,
>>>> so
>>>> currently those "v4" runs are actually all 4.2. Gah.)
>>>
>>> Are you perhaps both using RPCSEC_GSS w/ integrity checking for
>>> your
>>> EXCHANGE_ID authentication? The client will attempt to use that by
>>> default if rpc.gssd is running.
>>
>> Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
>> mounts
>> are using krb5i for EXCHANGE_ID.
>>
>>> I ask because I think the issue might be with RPCSEC_GSS,
>>> specifically
>>> with the RPCSEC_GSS context destroy code, hence the 2 patches that
>>> I
>>> just sent out.
>>
>> Looks like my tests pass after applying those two patches.
>>
>
> Cool! Thanks for testing.
>
> Chuck, do you think the above might also explain your sighting of the
> same Oops?

Could be, I don’t think I saw it until I started testing NFSv4.
I won’t be able to confirm that until next week.


> Cheers
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>


2018-11-13 00:08:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache

On Mon, 2018-11-12 at 16:00 -0800, Chuck Lever wrote:
> > On Nov 12, 2018, at 3:57 PM, Trond Myklebust <
> > [email protected]> wrote:
> >
> > > On Mon, 2018-11-12 at 18:01 -0500, [email protected] wrote:
> > > > On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust
> > > > wrote:
> > > > > On Mon, 2018-11-12 at 13:24 -0500, [email protected]
> > > > > wrote:
> > > > > > On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > > > > > Looks like it's the fault of
> > > > > > >
> > > > > > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > > > > >
> > > > > > I'm having trouble reproducing this bug. I've tried both
> > > > > > cthon
> > > > > > and
> > > > > > xfstests in a loop, so far without success (both NFSv3 and
> > > > > > v4.1,
> > > > > > but
> > > > > > only sec=sys). Is there anything else you're doing that I
> > > > > > might
> > > > > > try?
> > > > > >
> > > > > > e.g. Are you running multiple workloads in parallel?
> > > > > > Different
> > > > > > users?..
> > > > >
> > > > > Nothing that interesting. Currently it's connectathon over
> > > > > v4,
> > > > > v3,
> > > > > v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but
> > > > > just
> > > > > serially
> > > > > one after the other. Then some pynfs tests (which bypass the
> > > > > client),
> > > > > then xfstests over v4.2/sys. And also a few one-off locking
> > > > > tests of
> > > > > my
> > > > > own that probably aren't a factor here.
> > > > >
> > > > > (Hah, I just realized I was mounting with vers=4 and assuming
> > > > > that
> > > > > meant
> > > > > 4.0, but actually it's changed over time depending on the
> > > > > defaults,
> > > > > so
> > > > > currently those "v4" runs are actually all 4.2. Gah.)
> > > >
> > > > Are you perhaps both using RPCSEC_GSS w/ integrity checking for
> > > > your
> > > > EXCHANGE_ID authentication? The client will attempt to use that
> > > > by
> > > > default if rpc.gssd is running.
> > >
> > > Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
> > > mounts
> > > are using krb5i for EXCHANGE_ID.
> > >
> > > > I ask because I think the issue might be with RPCSEC_GSS,
> > > > specifically
> > > > with the RPCSEC_GSS context destroy code, hence the 2 patches
> > > > that
> > > > I
> > > > just sent out.
> > >
> > > Looks like my tests pass after applying those two patches.
> > >
> >
> > Cool! Thanks for testing.
> >
> > Chuck, do you think the above might also explain your sighting of
> > the
> > same Oops?
>
> Could be, I don’t think I saw it until I started testing NFSv4.
> I won’t be able to confirm that until next week.
>

OK. Either way, I know that part of the GSS code needs to be fixed in
order to deal with the reference count being 0, so I think it is worth
merging this patch now, and then we can see if there is more to the
regression when you can get back to your test rig.

Thanks
Trond
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2018-11-13 00:17:23

by Chuck Lever III

[permalink] [raw]
Subject: Re: NULL dereference in rpcauth_lookup_credcache


> On Nov 12, 2018, at 4:08 PM, Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2018-11-12 at 16:00 -0800, Chuck Lever wrote:
>>> On Nov 12, 2018, at 3:57 PM, Trond Myklebust <
>>> [email protected]> wrote:
>>>
>>>>> On Mon, 2018-11-12 at 18:01 -0500, [email protected] wrote:
>>>>> On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust
>>>>> wrote:
>>>>>> On Mon, 2018-11-12 at 13:24 -0500, [email protected]
>>>>>> wrote:
>>>>>>> On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust
>>>>>>> wrote:
>>>>>>>> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
>>>>>>>> Looks like it's the fault of
>>>>>>>>
>>>>>>>> 07d02a67b7faae "SUNRPC: Simplify lookup code"
>>>>>>>
>>>>>>> I'm having trouble reproducing this bug. I've tried both
>>>>>>> cthon
>>>>>>> and
>>>>>>> xfstests in a loop, so far without success (both NFSv3 and
>>>>>>> v4.1,
>>>>>>> but
>>>>>>> only sec=sys). Is there anything else you're doing that I
>>>>>>> might
>>>>>>> try?
>>>>>>>
>>>>>>> e.g. Are you running multiple workloads in parallel?
>>>>>>> Different
>>>>>>> users?..
>>>>>>
>>>>>> Nothing that interesting. Currently it's connectathon over
>>>>>> v4,
>>>>>> v3,
>>>>>> v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but
>>>>>> just
>>>>>> serially
>>>>>> one after the other. Then some pynfs tests (which bypass the
>>>>>> client),
>>>>>> then xfstests over v4.2/sys. And also a few one-off locking
>>>>>> tests of
>>>>>> my
>>>>>> own that probably aren't a factor here.
>>>>>>
>>>>>> (Hah, I just realized I was mounting with vers=4 and assuming
>>>>>> that
>>>>>> meant
>>>>>> 4.0, but actually it's changed over time depending on the
>>>>>> defaults,
>>>>>> so
>>>>>> currently those "v4" runs are actually all 4.2. Gah.)
>>>>>
>>>>> Are you perhaps both using RPCSEC_GSS w/ integrity checking for
>>>>> your
>>>>> EXCHANGE_ID authentication? The client will attempt to use that
>>>>> by
>>>>> default if rpc.gssd is running.
>>>>
>>>> Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
>>>> mounts
>>>> are using krb5i for EXCHANGE_ID.
>>>>
>>>>> I ask because I think the issue might be with RPCSEC_GSS,
>>>>> specifically
>>>>> with the RPCSEC_GSS context destroy code, hence the 2 patches
>>>>> that
>>>>> I
>>>>> just sent out.
>>>>
>>>> Looks like my tests pass after applying those two patches.
>>>>
>>>
>>> Cool! Thanks for testing.
>>>
>>> Chuck, do you think the above might also explain your sighting of
>>> the
>>> same Oops?
>>
>> Could be, I don’t think I saw it until I started testing NFSv4.
>> I won’t be able to confirm that until next week.
>>
>
> OK. Either way, I know that part of the GSS code needs to be fixed in
> order to deal with the reference count being 0, so I think it is worth
> merging this patch now, and then we can see if there is more to the
> regression when you can get back to your test rig.

Sounds fine to me.


> Thanks
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>