2022-01-14 21:24:59

by Jonathan Woithe

[permalink] [raw]
Subject: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

Hi all

Recently we migrated an NFS server from a 32-bit environment running
kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
unchanged between the two systems.

On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
under 5.15.12) the kernel has oopsed at around the time that an NFS client
machine is turned on for the day. On both occasions the call trace was
essentially identical. The full oops sequence is at the end of this email.
The oops was not observed when running the 4.14.128 kernel.

Is there anything more I can provide to help track down the cause of the
oops?

Regards
jonathan

Oops under 5.15.12:

Jan 14 08:48:30 nfssvr kernel: BUG: kernel NULL pointer dereference, address: 0000000000000110
Jan 14 08:48:30 nfssvr kernel: #PF: supervisor read access in kernel mode
Jan 14 08:48:30 nfssvr kernel: #PF: error_code(0x0000) - not-present page
Jan 14 08:48:30 nfssvr kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Jan 14 08:48:30 nfssvr kernel: CPU: 0 PID: 2935 Comm: lockd Not tainted 5.15.12 #1
Jan 14 08:48:30 nfssvr kernel: Hardware name: /DG31PR, BIOS PRG3110H.86A.0038.2007.1221.1757 12/21/2007
Jan 14 08:48:30 nfssvr kernel: RIP: 0010:vfs_lock_file+0x5/0x30
Jan 14 08:48:30 nfssvr kernel: Code: ff ff 41 89 c4 85 c0 0f 84 42 ff ff ff e9 f8 fe ff ff 0f 0b e8 2c bc d2 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 f3 dc
Jan 14 08:48:30 nfssvr kernel: RSP: 0018:ffffa478401a3c38 EFLAGS: 00010246
Jan 14 08:48:30 nfssvr kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
Jan 14 08:48:30 nfssvr kernel: RDX: ffffa478401a3c40 RSI: 0000000000000006 RDI: 00000000000000e8
Jan 14 08:48:30 nfssvr kernel: RBP: ffff946ead1ecc00 R08: ffff946f88ab1000 R09: ffff946f88b33a00
Jan 14 08:48:30 nfssvr kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa657ff30
Jan 14 08:48:30 nfssvr kernel: R13: ffff946e99df7c40 R14: ffff946e82fb0510 R15: ffff946ead1ecc00
Jan 14 08:48:30 nfssvr kernel: FS: 0000000000000000(0000) GS:ffff946fabc00000(0000) knlGS:0000000000000000
Jan 14 08:48:30 nfssvr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110 CR3: 000000010083a000 CR4: 00000000000006f0
Jan 14 08:48:30 nfssvr kernel: Call Trace:
Jan 14 08:48:30 nfssvr kernel: <TASK>
Jan 14 08:48:30 nfssvr kernel: nlm_unlock_files+0x6e/0xb0
Jan 14 08:48:30 nfssvr kernel: ? __skb_recv_udp+0x198/0x330
Jan 14 08:48:30 nfssvr kernel: ? _raw_spin_lock+0x13/0x2e
Jan 14 08:48:30 nfssvr kernel: ? nlmsvc_traverse_blocks+0x36/0x120
Jan 14 08:48:30 nfssvr kernel: ? preempt_count_add+0x68/0xa0
Jan 14 08:48:30 nfssvr kernel: nlm_traverse_files+0x152/0x280
Jan 14 08:48:30 nfssvr kernel: nlmsvc_free_host_resources+0x27/0x40
Jan 14 08:48:30 nfssvr kernel: nlm_host_rebooted+0x23/0x90
Jan 14 08:48:30 nfssvr kernel: nlmsvc_proc_sm_notify+0xae/0x110
Jan 14 08:48:30 nfssvr kernel: ? nlmsvc_decode_reboot+0x8b/0xc0
Jan 14 08:48:30 nfssvr kernel: nlmsvc_dispatch+0x89/0x180
Jan 14 08:48:30 nfssvr kernel: svc_process_common+0x3ce/0x6f0
Jan 14 08:48:30 nfssvr kernel: ? lockd_inet6addr_event+0xf0/0xf0
Jan 14 08:48:30 nfssvr kernel: svc_process+0xb7/0xf0
Jan 14 08:48:30 nfssvr kernel: lockd+0xca/0x1b0
Jan 14 08:48:30 nfssvr kernel: ? preempt_count_add+0x68/0xa0
Jan 14 08:48:30 nfssvr kernel: ? _raw_spin_lock_irqsave+0x19/0x40
Jan 14 08:48:30 nfssvr kernel: ? set_grace_period+0x90/0x90
Jan 14 08:48:30 nfssvr kernel: kthread+0x141/0x170
Jan 14 08:48:30 nfssvr kernel: ? set_kthread_struct+0x40/0x40
Jan 14 08:48:30 nfssvr kernel: ret_from_fork+0x22/0x30
Jan 14 08:48:30 nfssvr kernel: </TASK>
Jan 14 08:48:30 nfssvr kernel: Modules linked in: tun nf_nat_ftp nf_conntrack_ftp xt_REDIRECT xt_nat xt_conntrack xt_tcpudp xt_NFLOG nfnetlink_log nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables ipv6 hid_generic usbhid hi
Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110
Jan 14 08:48:30 nfssvr kernel: ---[ end trace f8f28acee6f24340 ]---
Jan 14 08:48:30 nfssvr kernel: RIP: 0010:vfs_lock_file+0x5/0x30
Jan 14 08:48:30 nfssvr kernel: Code: ff ff 41 89 c4 85 c0 0f 84 42 ff ff ff e9 f8 fe ff ff 0f 0b e8 2c bc d2 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 f3 dc
Jan 14 08:48:30 nfssvr kernel: RSP: 0018:ffffa478401a3c38 EFLAGS: 00010246
Jan 14 08:48:30 nfssvr kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
Jan 14 08:48:30 nfssvr kernel: RDX: ffffa478401a3c40 RSI: 0000000000000006 RDI: 00000000000000e8
Jan 14 08:48:30 nfssvr kernel: RBP: ffff946ead1ecc00 R08: ffff946f88ab1000 R09: ffff946f88b33a00
Jan 14 08:48:30 nfssvr kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa657ff30
Jan 14 08:48:30 nfssvr kernel: R13: ffff946e99df7c40 R14: ffff946e82fb0510 R15: ffff946ead1ecc00
Jan 14 08:48:30 nfssvr kernel: FS: 0000000000000000(0000) GS:ffff946fabc00000(0000) knlGS:0000000000000000
Jan 14 08:48:30 nfssvr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110 CR3: 000000010083a000 CR4: 00000000000006f0


2022-01-14 22:11:07

by Chuck Lever III

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

Hi Jonathan-

> On Jan 14, 2022, at 5:39 AM, Jonathan Woithe <[email protected]> wrote:
>
> Hi all
>
> Recently we migrated an NFS server from a 32-bit environment running
> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> unchanged between the two systems.
>
> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> under 5.15.12) the kernel has oopsed at around the time that an NFS client
> machine is turned on for the day. On both occasions the call trace was
> essentially identical. The full oops sequence is at the end of this email.
> The oops was not observed when running the 4.14.128 kernel.
>
> Is there anything more I can provide to help track down the cause of the
> oops?

A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each nlm_file"),
which was introduced in or around v5.15. You could try a simple test and back
the server down to v5.14.y to see if the problem persists.

Otherwise, Bruce, can you have a look at this?


> Regards
> jonathan
>
> Oops under 5.15.12:
>
> Jan 14 08:48:30 nfssvr kernel: BUG: kernel NULL pointer dereference, address: 0000000000000110
> Jan 14 08:48:30 nfssvr kernel: #PF: supervisor read access in kernel mode
> Jan 14 08:48:30 nfssvr kernel: #PF: error_code(0x0000) - not-present page
> Jan 14 08:48:30 nfssvr kernel: Oops: 0000 [#1] PREEMPT SMP PTI
> Jan 14 08:48:30 nfssvr kernel: CPU: 0 PID: 2935 Comm: lockd Not tainted 5.15.12 #1
> Jan 14 08:48:30 nfssvr kernel: Hardware name: /DG31PR, BIOS PRG3110H.86A.0038.2007.1221.1757 12/21/2007
> Jan 14 08:48:30 nfssvr kernel: RIP: 0010:vfs_lock_file+0x5/0x30
> Jan 14 08:48:30 nfssvr kernel: Code: ff ff 41 89 c4 85 c0 0f 84 42 ff ff ff e9 f8 fe ff ff 0f 0b e8 2c bc d2 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 f3 dc
> Jan 14 08:48:30 nfssvr kernel: RSP: 0018:ffffa478401a3c38 EFLAGS: 00010246
> Jan 14 08:48:30 nfssvr kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
> Jan 14 08:48:30 nfssvr kernel: RDX: ffffa478401a3c40 RSI: 0000000000000006 RDI: 00000000000000e8
> Jan 14 08:48:30 nfssvr kernel: RBP: ffff946ead1ecc00 R08: ffff946f88ab1000 R09: ffff946f88b33a00
> Jan 14 08:48:30 nfssvr kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa657ff30
> Jan 14 08:48:30 nfssvr kernel: R13: ffff946e99df7c40 R14: ffff946e82fb0510 R15: ffff946ead1ecc00
> Jan 14 08:48:30 nfssvr kernel: FS: 0000000000000000(0000) GS:ffff946fabc00000(0000) knlGS:0000000000000000
> Jan 14 08:48:30 nfssvr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110 CR3: 000000010083a000 CR4: 00000000000006f0
> Jan 14 08:48:30 nfssvr kernel: Call Trace:
> Jan 14 08:48:30 nfssvr kernel: <TASK>
> Jan 14 08:48:30 nfssvr kernel: nlm_unlock_files+0x6e/0xb0
> Jan 14 08:48:30 nfssvr kernel: ? __skb_recv_udp+0x198/0x330
> Jan 14 08:48:30 nfssvr kernel: ? _raw_spin_lock+0x13/0x2e
> Jan 14 08:48:30 nfssvr kernel: ? nlmsvc_traverse_blocks+0x36/0x120
> Jan 14 08:48:30 nfssvr kernel: ? preempt_count_add+0x68/0xa0
> Jan 14 08:48:30 nfssvr kernel: nlm_traverse_files+0x152/0x280
> Jan 14 08:48:30 nfssvr kernel: nlmsvc_free_host_resources+0x27/0x40
> Jan 14 08:48:30 nfssvr kernel: nlm_host_rebooted+0x23/0x90
> Jan 14 08:48:30 nfssvr kernel: nlmsvc_proc_sm_notify+0xae/0x110
> Jan 14 08:48:30 nfssvr kernel: ? nlmsvc_decode_reboot+0x8b/0xc0
> Jan 14 08:48:30 nfssvr kernel: nlmsvc_dispatch+0x89/0x180
> Jan 14 08:48:30 nfssvr kernel: svc_process_common+0x3ce/0x6f0
> Jan 14 08:48:30 nfssvr kernel: ? lockd_inet6addr_event+0xf0/0xf0
> Jan 14 08:48:30 nfssvr kernel: svc_process+0xb7/0xf0
> Jan 14 08:48:30 nfssvr kernel: lockd+0xca/0x1b0
> Jan 14 08:48:30 nfssvr kernel: ? preempt_count_add+0x68/0xa0
> Jan 14 08:48:30 nfssvr kernel: ? _raw_spin_lock_irqsave+0x19/0x40
> Jan 14 08:48:30 nfssvr kernel: ? set_grace_period+0x90/0x90
> Jan 14 08:48:30 nfssvr kernel: kthread+0x141/0x170
> Jan 14 08:48:30 nfssvr kernel: ? set_kthread_struct+0x40/0x40
> Jan 14 08:48:30 nfssvr kernel: ret_from_fork+0x22/0x30
> Jan 14 08:48:30 nfssvr kernel: </TASK>
> Jan 14 08:48:30 nfssvr kernel: Modules linked in: tun nf_nat_ftp nf_conntrack_ftp xt_REDIRECT xt_nat xt_conntrack xt_tcpudp xt_NFLOG nfnetlink_log nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables ipv6 hid_generic usbhid hi
> Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110
> Jan 14 08:48:30 nfssvr kernel: ---[ end trace f8f28acee6f24340 ]---
> Jan 14 08:48:30 nfssvr kernel: RIP: 0010:vfs_lock_file+0x5/0x30
> Jan 14 08:48:30 nfssvr kernel: Code: ff ff 41 89 c4 85 c0 0f 84 42 ff ff ff e9 f8 fe ff ff 0f 0b e8 2c bc d2 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 f3 dc
> Jan 14 08:48:30 nfssvr kernel: RSP: 0018:ffffa478401a3c38 EFLAGS: 00010246
> Jan 14 08:48:30 nfssvr kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
> Jan 14 08:48:30 nfssvr kernel: RDX: ffffa478401a3c40 RSI: 0000000000000006 RDI: 00000000000000e8
> Jan 14 08:48:30 nfssvr kernel: RBP: ffff946ead1ecc00 R08: ffff946f88ab1000 R09: ffff946f88b33a00
> Jan 14 08:48:30 nfssvr kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa657ff30
> Jan 14 08:48:30 nfssvr kernel: R13: ffff946e99df7c40 R14: ffff946e82fb0510 R15: ffff946ead1ecc00
> Jan 14 08:48:30 nfssvr kernel: FS: 0000000000000000(0000) GS:ffff946fabc00000(0000) knlGS:0000000000000000
> Jan 14 08:48:30 nfssvr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110 CR3: 000000010083a000 CR4: 00000000000006f0

--
Chuck Lever



2022-01-15 20:39:32

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

Hi Chuck

Thanks for your response.

On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> > Recently we migrated an NFS server from a 32-bit environment running
> > kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> > unchanged between the two systems.
> >
> > On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> > under 5.15.12) the kernel has oopsed at around the time that an NFS client
> > machine is turned on for the day. On both occasions the call trace was
> > essentially identical. The full oops sequence is at the end of this email.
> > The oops was not observed when running the 4.14.128 kernel.
> >
> > Is there anything more I can provide to help track down the cause of the
> > oops?
>
> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> nlm_file"), which was introduced in or around v5.15. You could try a
> simple test and back the server down to v5.14.y to see if the problem
> persists.

I could do this, but only perhaps on Monday when I'm next on site. It may
take a while to get an answer though, since it seems we hit the fault only
around once every 2 weeks. Since it's a production server we are of course
limited in the things I can do.

I *may* be able to set up another system as an NFS server and hit that with
repeated mount requests. That could help reduce the time we have to wait
for an answer.

Is it worth considering a revert of 7f024fcd5c97? I guess it depends on how
many later patches depended on it.

> Otherwise, Bruce, can you have a look at this?

Regards
jonathan

2022-01-16 16:23:15

by Chuck Lever III

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount


> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
>
> Hi Chuck
>
> Thanks for your response.
>
> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
>>> Recently we migrated an NFS server from a 32-bit environment running
>>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
>>> unchanged between the two systems.
>>>
>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
>>> machine is turned on for the day. On both occasions the call trace was
>>> essentially identical. The full oops sequence is at the end of this email.
>>> The oops was not observed when running the 4.14.128 kernel.
>>>
>>> Is there anything more I can provide to help track down the cause of the
>>> oops?
>>
>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
>> nlm_file"), which was introduced in or around v5.15. You could try a
>> simple test and back the server down to v5.14.y to see if the problem
>> persists.
>
> I could do this, but only perhaps on Monday when I'm next on site. It may
> take a while to get an answer though, since it seems we hit the fault only
> around once every 2 weeks. Since it's a production server we are of course
> limited in the things I can do.
>
> I *may* be able to set up another system as an NFS server and hit that with
> repeated mount requests. That could help reduce the time we have to wait
> for an answer.

Given the callback information you provided, I believe that the problem
is due to a client reboot, not a mount request. The callback shows the
crash occurs while your server is processing an SM_NOTIFY request from
one of your clients.


> Is it worth considering a revert of 7f024fcd5c97? I guess it depends on how
> many later patches depended on it.

You can try reverting 7f024fcd5c97, but as I recall there are some
subsequent changes that depend on that one.

--
Chuck Lever



2022-01-16 16:23:15

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
> > On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
> > On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> >>> Recently we migrated an NFS server from a 32-bit environment running
> >>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> >>> unchanged between the two systems.
> >>>
> >>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> >>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
> >>> machine is turned on for the day. On both occasions the call trace was
> >>> essentially identical. The full oops sequence is at the end of this email.
> >>> The oops was not observed when running the 4.14.128 kernel.
> >>>
> >>> Is there anything more I can provide to help track down the cause of the
> >>> oops?
> >>
> >> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> >> nlm_file"), which was introduced in or around v5.15. You could try a
> >> simple test and back the server down to v5.14.y to see if the problem
> >> persists.
> >
> > I could do this, but only perhaps on Monday when I'm next on site. It may
> > take a while to get an answer though, since it seems we hit the fault only
> > around once every 2 weeks. Since it's a production server we are of course
> > limited in the things I can do.
> >
> > I *may* be able to set up another system as an NFS server and hit that with
> > repeated mount requests. That could help reduce the time we have to wait
> > for an answer.
>
> Given the callback information you provided, I believe that the problem
> is due to a client reboot, not a mount request. The callback shows the
> crash occurs while your server is processing an SM_NOTIFY request from
> one of your clients.

FYI the situation we have is that most clients (being desktops) are shut
down (cleanly, including umount of all NFS volumes) at the end of each work
day and rebooted the following morning. I had assumed that it would be the
mount on reboot that was the potential trigger, but it seems there might be
something else going on during the boot process before the mount if I
understand you correctly.

If that's the case it will make it a little trickier to set up a test system
since I don't think I have a second machine I can repeatedly reboot. I'll
check.

The times of the two crashes we've seen so far do correspond to the time
when the clients tend to get turned on. Could the problem be dependent on
the client version in any way? We do have a little variation in this across
the network. I could check to see which client's boot triggered the oops if
that would be useful.

> > Is it worth considering a revert of 7f024fcd5c97? I guess it depends on how
> > many later patches depended on it.
>
> You can try reverting 7f024fcd5c97, but as I recall there are some
> subsequent changes that depend on that one.

Right, that's what I was wondering. Thus, it sounds like trying a 5.14.y is
perhaps the best option at this stage.

Regards
jonathan

2022-01-17 11:06:41

by Chuck Lever III

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

Hi Jonathan-

> On Jan 16, 2022, at 5:06 PM, Jonathan Woithe <[email protected]> wrote:
>
> On Sun, Jan 16, 2022 at 07:53:36AM +1030, Jonathan Woithe wrote:
>> On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
>>>> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
>>>> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
>>>>>> Recently we migrated an NFS server from a 32-bit environment running
>>>>>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
>>>>>> unchanged between the two systems.
>>>>>>
>>>>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
>>>>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
>>>>>> machine is turned on for the day. On both occasions the call trace was
>>>>>> essentially identical. The full oops sequence is at the end of this email.
>>>>>> The oops was not observed when running the 4.14.128 kernel.
>>>>>>
>>>>>> Is there anything more I can provide to help track down the cause of the
>>>>>> oops?
>>>>>
>>>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
>>>>> nlm_file"), which was introduced in or around v5.15. You could try a
>>>>> simple test and back the server down to v5.14.y to see if the problem
>>>>> persists.
>
> FYI I have now put the kernel.org 5.14.21 kernel on the affected system and
> booted it. Since the oops has taken between 1 and 2 weeks to be triggered
> in the past, we may have to wait a few weeks to be certain of an outcome.
> If there's anything else you need from me in the interim please ask.

If you identify a particular client that triggers the issue, it would be
helpful to know:

- The client's kernel version
- What was running on the client before it was shut down
- Whether the application and client shut down was clean


--
Chuck Lever



2022-01-17 11:06:41

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

On Sun, Jan 16, 2022 at 07:53:36AM +1030, Jonathan Woithe wrote:
> On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
> > > On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
> > > On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> > >>> Recently we migrated an NFS server from a 32-bit environment running
> > >>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> > >>> unchanged between the two systems.
> > >>>
> > >>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> > >>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
> > >>> machine is turned on for the day. On both occasions the call trace was
> > >>> essentially identical. The full oops sequence is at the end of this email.
> > >>> The oops was not observed when running the 4.14.128 kernel.
> > >>>
> > >>> Is there anything more I can provide to help track down the cause of the
> > >>> oops?
> > >>
> > >> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> > >> nlm_file"), which was introduced in or around v5.15. You could try a
> > >> simple test and back the server down to v5.14.y to see if the problem
> > >> persists.

FYI I have now put the kernel.org 5.14.21 kernel on the affected system and
booted it. Since the oops has taken between 1 and 2 weeks to be triggered
in the past, we may have to wait a few weeks to be certain of an outcome.
If there's anything else you need from me in the interim please ask.

Regards
jonathan

2022-01-17 16:02:18

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

Hi Chuck

On Sun, Jan 16, 2022 at 10:30:43PM +0000, Chuck Lever III wrote:
> > On Jan 16, 2022, at 5:06 PM, Jonathan Woithe <[email protected]> wrote:
> >
> > On Sun, Jan 16, 2022 at 07:53:36AM +1030, Jonathan Woithe wrote:
> >> On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
> >>>> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
> >>>> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> >>>>>> Recently we migrated an NFS server from a 32-bit environment running
> >>>>>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> >>>>>> unchanged between the two systems.
> >>>>>>
> >>>>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> >>>>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
> >>>>>> machine is turned on for the day. On both occasions the call trace was
> >>>>>> essentially identical. The full oops sequence is at the end of this email.
> >>>>>> The oops was not observed when running the 4.14.128 kernel.
> >>>>>>
> >>>>>> Is there anything more I can provide to help track down the cause of the
> >>>>>> oops?
> >>>>>
> >>>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> >>>>> nlm_file"), which was introduced in or around v5.15. You could try a
> >>>>> simple test and back the server down to v5.14.y to see if the problem
> >>>>> persists.
> >
> > FYI I have now put the kernel.org 5.14.21 kernel on the affected system and
> > booted it. Since the oops has taken between 1 and 2 weeks to be triggered
> > in the past, we may have to wait a few weeks to be certain of an outcome.
> > If there's anything else you need from me in the interim please ask.
>
> If you identify a particular client that triggers the issue, it would be
> helpful to know:
>
> - The client's kernel version
> - What was running on the client before it was shut down
> - Whether the application and client shut down was clean

I have been able to identify the client involved. It was the same client
on both occasions. That client is running the 4.4.14 kernel.

Unfortunately I have no way to determine what was running on the client when
it was shut down. However, the logs to tell me that the client was NOT
cleanly shut down prior to both oopses being triggered on the server with
the next boot. These are the only times when the client wasn't shut down
cleanly; the client WAS shut down cleanly on every other day since 23
December (when the server was moved to the 5.15.x kernel). It is therefore
possible that the server oops was triggered only when the client was not
shut down cleanly.

I will ask the user if they remember anything happening differently on the
days of the server oops. My suspicion is that there wasn't anything, and
that the power to the bench which supplies the client was turned off
accidently before shutting the computer down. We have a new staff member
who knows the correct procedure, but maybe they forgot on a couple of
occasions. If this is the case the PC is unlikely to have had much running
at the time of the shutdown. The xfce4 desktop is perhaps a given. Other
possibilities are firefox, thunderbird and libreoffce.

With the server running 5.14.21, I did a reset of the client (that is,
unclean shutdown) just before I left this evening. The server did not oops
when the client was rebooted a minute or so later. I will see if I can
repeat the test with 5.15.12 tomorrow morning before others get in if you
think that will be helpful in light of the above observations.

Regards
jonathan

2022-01-18 02:36:07

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
>
> > On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
> >
> > Hi Chuck
> >
> > Thanks for your response.
> >
> > On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> >>> Recently we migrated an NFS server from a 32-bit environment running
> >>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> >>> unchanged between the two systems.
> >>>
> >>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> >>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
> >>> machine is turned on for the day. On both occasions the call trace was
> >>> essentially identical. The full oops sequence is at the end of this email.
> >>> The oops was not observed when running the 4.14.128 kernel.
> >>>
> >>> Is there anything more I can provide to help track down the cause of the
> >>> oops?
> >>
> >> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> >> nlm_file"), which was introduced in or around v5.15. You could try a
> >> simple test and back the server down to v5.14.y to see if the problem
> >> persists.
> >
> > I could do this, but only perhaps on Monday when I'm next on site. It may
> > take a while to get an answer though, since it seems we hit the fault only
> > around once every 2 weeks. Since it's a production server we are of course
> > limited in the things I can do.
> >
> > I *may* be able to set up another system as an NFS server and hit that with
> > repeated mount requests. That could help reduce the time we have to wait
> > for an answer.
>
> Given the callback information you provided, I believe that the problem
> is due to a client reboot, not a mount request. The callback shows the
> crash occurs while your server is processing an SM_NOTIFY request from
> one of your clients.
>
>
> > Is it worth considering a revert of 7f024fcd5c97? I guess it depends on how
> > many later patches depended on it.
>
> You can try reverting 7f024fcd5c97, but as I recall there are some
> subsequent changes that depend on that one.

NLM locking on reexports would stop working. Which is a new (and
imperfect) feature, so less important than avoiding this NULL
dereference, if push came to shove. But, let's see if we can just fix
it.....

--b.

2022-01-18 02:36:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
> Hi Jonathan-
>
> > On Jan 14, 2022, at 5:39 AM, Jonathan Woithe <[email protected]> wrote:
> >
> > Hi all
> >
> > Recently we migrated an NFS server from a 32-bit environment running
> > kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
> > unchanged between the two systems.
> >
> > On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
> > under 5.15.12) the kernel has oopsed at around the time that an NFS client
> > machine is turned on for the day. On both occasions the call trace was
> > essentially identical. The full oops sequence is at the end of this email.
> > The oops was not observed when running the 4.14.128 kernel.
> >
> > Is there anything more I can provide to help track down the cause of the
> > oops?
>
> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each nlm_file"),
> which was introduced in or around v5.15.

Almost definitely it, yeah.

We should really have nlm reboot tests. I test nlm and v4 reboot but
not nlm reboot....

> You could try a simple test and back
> the server down to v5.14.y to see if the problem persists.
>
> Otherwise, Bruce, can you have a look at this?

Yep, just catching up....

Given my lack of nlm reboot testing (sorry) I wouldn't be suprised if
it's reproduceable with something really simple, like: take a lock, then
restart the client (so that it notifies the server). Could still be
rare in production if rebooting while holding a lock is rare.

--b.

>
>
> > Regards
> > jonathan
> >
> > Oops under 5.15.12:
> >
> > Jan 14 08:48:30 nfssvr kernel: BUG: kernel NULL pointer dereference, address: 0000000000000110
> > Jan 14 08:48:30 nfssvr kernel: #PF: supervisor read access in kernel mode
> > Jan 14 08:48:30 nfssvr kernel: #PF: error_code(0x0000) - not-present page
> > Jan 14 08:48:30 nfssvr kernel: Oops: 0000 [#1] PREEMPT SMP PTI
> > Jan 14 08:48:30 nfssvr kernel: CPU: 0 PID: 2935 Comm: lockd Not tainted 5.15.12 #1
> > Jan 14 08:48:30 nfssvr kernel: Hardware name: /DG31PR, BIOS PRG3110H.86A.0038.2007.1221.1757 12/21/2007
> > Jan 14 08:48:30 nfssvr kernel: RIP: 0010:vfs_lock_file+0x5/0x30
> > Jan 14 08:48:30 nfssvr kernel: Code: ff ff 41 89 c4 85 c0 0f 84 42 ff ff ff e9 f8 fe ff ff 0f 0b e8 2c bc d2 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 f3 dc
> > Jan 14 08:48:30 nfssvr kernel: RSP: 0018:ffffa478401a3c38 EFLAGS: 00010246
> > Jan 14 08:48:30 nfssvr kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
> > Jan 14 08:48:30 nfssvr kernel: RDX: ffffa478401a3c40 RSI: 0000000000000006 RDI: 00000000000000e8
> > Jan 14 08:48:30 nfssvr kernel: RBP: ffff946ead1ecc00 R08: ffff946f88ab1000 R09: ffff946f88b33a00
> > Jan 14 08:48:30 nfssvr kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa657ff30
> > Jan 14 08:48:30 nfssvr kernel: R13: ffff946e99df7c40 R14: ffff946e82fb0510 R15: ffff946ead1ecc00
> > Jan 14 08:48:30 nfssvr kernel: FS: 0000000000000000(0000) GS:ffff946fabc00000(0000) knlGS:0000000000000000
> > Jan 14 08:48:30 nfssvr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110 CR3: 000000010083a000 CR4: 00000000000006f0
> > Jan 14 08:48:30 nfssvr kernel: Call Trace:
> > Jan 14 08:48:30 nfssvr kernel: <TASK>
> > Jan 14 08:48:30 nfssvr kernel: nlm_unlock_files+0x6e/0xb0
> > Jan 14 08:48:30 nfssvr kernel: ? __skb_recv_udp+0x198/0x330
> > Jan 14 08:48:30 nfssvr kernel: ? _raw_spin_lock+0x13/0x2e
> > Jan 14 08:48:30 nfssvr kernel: ? nlmsvc_traverse_blocks+0x36/0x120
> > Jan 14 08:48:30 nfssvr kernel: ? preempt_count_add+0x68/0xa0
> > Jan 14 08:48:30 nfssvr kernel: nlm_traverse_files+0x152/0x280
> > Jan 14 08:48:30 nfssvr kernel: nlmsvc_free_host_resources+0x27/0x40
> > Jan 14 08:48:30 nfssvr kernel: nlm_host_rebooted+0x23/0x90
> > Jan 14 08:48:30 nfssvr kernel: nlmsvc_proc_sm_notify+0xae/0x110
> > Jan 14 08:48:30 nfssvr kernel: ? nlmsvc_decode_reboot+0x8b/0xc0
> > Jan 14 08:48:30 nfssvr kernel: nlmsvc_dispatch+0x89/0x180
> > Jan 14 08:48:30 nfssvr kernel: svc_process_common+0x3ce/0x6f0
> > Jan 14 08:48:30 nfssvr kernel: ? lockd_inet6addr_event+0xf0/0xf0
> > Jan 14 08:48:30 nfssvr kernel: svc_process+0xb7/0xf0
> > Jan 14 08:48:30 nfssvr kernel: lockd+0xca/0x1b0
> > Jan 14 08:48:30 nfssvr kernel: ? preempt_count_add+0x68/0xa0
> > Jan 14 08:48:30 nfssvr kernel: ? _raw_spin_lock_irqsave+0x19/0x40
> > Jan 14 08:48:30 nfssvr kernel: ? set_grace_period+0x90/0x90
> > Jan 14 08:48:30 nfssvr kernel: kthread+0x141/0x170
> > Jan 14 08:48:30 nfssvr kernel: ? set_kthread_struct+0x40/0x40
> > Jan 14 08:48:30 nfssvr kernel: ret_from_fork+0x22/0x30
> > Jan 14 08:48:30 nfssvr kernel: </TASK>
> > Jan 14 08:48:30 nfssvr kernel: Modules linked in: tun nf_nat_ftp nf_conntrack_ftp xt_REDIRECT xt_nat xt_conntrack xt_tcpudp xt_NFLOG nfnetlink_log nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables ipv6 hid_generic usbhid hi
> > Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110
> > Jan 14 08:48:30 nfssvr kernel: ---[ end trace f8f28acee6f24340 ]---
> > Jan 14 08:48:30 nfssvr kernel: RIP: 0010:vfs_lock_file+0x5/0x30
> > Jan 14 08:48:30 nfssvr kernel: Code: ff ff 41 89 c4 85 c0 0f 84 42 ff ff ff e9 f8 fe ff ff 0f 0b e8 2c bc d2 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 f3 dc
> > Jan 14 08:48:30 nfssvr kernel: RSP: 0018:ffffa478401a3c38 EFLAGS: 00010246
> > Jan 14 08:48:30 nfssvr kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
> > Jan 14 08:48:30 nfssvr kernel: RDX: ffffa478401a3c40 RSI: 0000000000000006 RDI: 00000000000000e8
> > Jan 14 08:48:30 nfssvr kernel: RBP: ffff946ead1ecc00 R08: ffff946f88ab1000 R09: ffff946f88b33a00
> > Jan 14 08:48:30 nfssvr kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa657ff30
> > Jan 14 08:48:30 nfssvr kernel: R13: ffff946e99df7c40 R14: ffff946e82fb0510 R15: ffff946ead1ecc00
> > Jan 14 08:48:30 nfssvr kernel: FS: 0000000000000000(0000) GS:ffff946fabc00000(0000) knlGS:0000000000000000
> > Jan 14 08:48:30 nfssvr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Jan 14 08:48:30 nfssvr kernel: CR2: 0000000000000110 CR3: 000000010083a000 CR4: 00000000000006f0
>
> --
> Chuck Lever
>
>

2022-01-18 02:56:26

by Chuck Lever III

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount



> On Jan 17, 2022, at 10:50 AM, Bruce Fields <[email protected]> wrote:
>
> On Sat, Jan 15, 2022 at 07:46:06PM +0000, Chuck Lever III wrote:
>>
>>> On Jan 15, 2022, at 3:14 AM, Jonathan Woithe <[email protected]> wrote:
>>>
>>> Hi Chuck
>>>
>>> Thanks for your response.
>>>
>>> On Fri, Jan 14, 2022 at 03:18:01PM +0000, Chuck Lever III wrote:
>>>>> Recently we migrated an NFS server from a 32-bit environment running
>>>>> kernel 4.14.128 to a 64-bit 5.15.x kernel. The NFS configuration remained
>>>>> unchanged between the two systems.
>>>>>
>>>>> On two separate occasions since the upgrade (5 Jan under 5.15.10, 14 Jan
>>>>> under 5.15.12) the kernel has oopsed at around the time that an NFS client
>>>>> machine is turned on for the day. On both occasions the call trace was
>>>>> essentially identical. The full oops sequence is at the end of this email.
>>>>> The oops was not observed when running the 4.14.128 kernel.
>>>>>
>>>>> Is there anything more I can provide to help track down the cause of the
>>>>> oops?
>>>>
>>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
>>>> nlm_file"), which was introduced in or around v5.15. You could try a
>>>> simple test and back the server down to v5.14.y to see if the problem
>>>> persists.
>>>
>>> I could do this, but only perhaps on Monday when I'm next on site. It may
>>> take a while to get an answer though, since it seems we hit the fault only
>>> around once every 2 weeks. Since it's a production server we are of course
>>> limited in the things I can do.
>>>
>>> I *may* be able to set up another system as an NFS server and hit that with
>>> repeated mount requests. That could help reduce the time we have to wait
>>> for an answer.
>>
>> Given the callback information you provided, I believe that the problem
>> is due to a client reboot, not a mount request. The callback shows the
>> crash occurs while your server is processing an SM_NOTIFY request from
>> one of your clients.
>>
>>
>>> Is it worth considering a revert of 7f024fcd5c97? I guess it depends on how
>>> many later patches depended on it.
>>
>> You can try reverting 7f024fcd5c97, but as I recall there are some
>> subsequent changes that depend on that one.
>
> NLM locking on reexports would stop working. Which is a new (and
> imperfect) feature, so less important than avoiding this NULL
> dereference, if push came to shove. But, let's see if we can just fix
> it.....

Agreed. I was suggested reverting only as an experiment.

--
Chuck Lever



2022-01-18 03:03:20

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

On Mon, Jan 17, 2022 at 06:14:30PM +1030, Jonathan Woithe wrote:
> > >>>>> A possible culprit is 7f024fcd5c97 ("Keep read and write fds with each
> > >>>>> nlm_file"), which was introduced in or around v5.15. You could try a
> > >>>>> simple test and back the server down to v5.14.y to see if the problem
> > >>>>> persists.
> > >
> > > FYI I have now put the kernel.org 5.14.21 kernel on the affected system and
> > > booted it. Since the oops has taken between 1 and 2 weeks to be triggered
> > > in the past, we may have to wait a few weeks to be certain of an outcome.
> > > If there's anything else you need from me in the interim please ask.
> >
> > If you identify a particular client that triggers the issue, it would be
> > helpful to know:
> >
> > - The client's kernel version
> > - What was running on the client before it was shut down
> > - Whether the application and client shut down was clean
>
> I have been able to identify the client involved. It was the same client
> on both occasions. That client is running the 4.4.14 kernel.
> :
> I will ask the user if they remember anything happening differently on the
> days of the server oops.

I have asked the user, and certainly in the case of the most recent oops the
previous day's usage (that is, the day of the unclean shutdown, the day
before the boot which triggered the server oops) was nothing out of the
ordinary. Firefox, thunderbird and libreoffice were the only applications
used, with the desktop file browser also getting an outing. The desktop is
xfce4. These programs would have been used variously over the course of the
day (roughly 7.5 hours on this particular date).

> With the server running 5.14.21, I did a reset of the client (that is,
> unclean shutdown) just before I left this evening. The server did not oops
> when the client was rebooted a minute or so later. I will see if I can
> repeat the test with 5.15.12 tomorrow morning before others get in if you
> think that will be helpful in light of the above observations.

I did this test this morning before others came in. The server (with
5.15.12 running) did not oops. However, with the recent mention of locking
this may not be surprising since no NFS locking had been attempted on the
client during the test (mainly because I had no easy way to elicit a lock).
I merely booted the client, reset it and let it boot again.

During the course of the day the client will run firefox, thunderbird and
libreoffice, all of which probably involve locking of various descriptions.
Thus a test without locking is perhaps not perfect.

I am happy to run further tests if it will help. Let me know if I can do
anything else.

Regards
jonathan

2022-01-18 03:03:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Bug report] Recurring oops, 5.15.x, possibly during or soon after client mount

On Tue, Jan 18, 2022 at 08:38:52AM +1030, Jonathan Woithe wrote:
> I am happy to run further tests if it will help. Let me know if I can do
> anything else.

I think we're good. Thanks for the reporting. I can reproduce just by
doing a v3 mount, getting a lock, then power-cycling the client and
waiting for it to come back up; testing a fix now.

--b.

2022-01-21 09:35:31

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock

From: "J. Bruce Fields" <[email protected]>

I thought I was iterating over the array when actually the iteration is
over the values contained in the array?

Ugh, keep it simple.

Symptoms were a null deference in vfs_lock_file() when an NFSv3 client
that previously held a lock came back up and sent a notify.

Reported-by: Jonathan Woithe <[email protected]>
Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file")
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/lockd/svcsubs.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
index cb3a7512c33e..54c2e42130ca 100644
--- a/fs/lockd/svcsubs.c
+++ b/fs/lockd/svcsubs.c
@@ -179,19 +179,20 @@ nlm_delete_file(struct nlm_file *file)
static int nlm_unlock_files(struct nlm_file *file)
{
struct file_lock lock;
- struct file *f;

lock.fl_type = F_UNLCK;
lock.fl_start = 0;
lock.fl_end = OFFSET_MAX;
- for (f = file->f_file[0]; f <= file->f_file[1]; f++) {
- if (f && vfs_lock_file(f, F_SETLK, &lock, NULL) < 0) {
- pr_warn("lockd: unlock failure in %s:%d\n",
- __FILE__, __LINE__);
- return 1;
- }
- }
+ if (file->f_file[O_RDONLY] &&
+ vfs_lock_file(file->f_file[O_RDONLY], F_SETLK, &lock, NULL))
+ goto out_err;
+ if (file->f_file[O_WRONLY] &&
+ vfs_lock_file(file->f_file[O_WRONLY], F_SETLK, &lock, NULL))
+ goto out_err;
return 0;
+out_err:
+ pr_warn("lockd: unlock failure in %s:%d\n", __FILE__, __LINE__);
+ return 1;
}

/*
--
2.34.1

2022-01-21 10:11:30

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock

On Tue, Jan 18, 2022 at 05:00:16PM -0500, Bruce Fields wrote:
> From: "J. Bruce Fields" <[email protected]>
>
> I thought I was iterating over the array when actually the iteration is
> over the values contained in the array?
> :

Would you like me to apply this against a 5.15.x kernel and test locally?
Or should I just wait for a 5.15.x stable series update which includes it?

Regards
jonathan

2022-01-21 10:12:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock

On Wed, Jan 19, 2022 at 08:50:50AM +1030, Jonathan Woithe wrote:
> On Tue, Jan 18, 2022 at 05:00:16PM -0500, Bruce Fields wrote:
> > From: "J. Bruce Fields" <[email protected]>
> >
> > I thought I was iterating over the array when actually the iteration is
> > over the values contained in the array?
> > :
>
> Would you like me to apply this against a 5.15.x kernel and test locally?
> Or should I just wait for a 5.15.x stable series update which includes it?

I'm pretty confident I'm reproducing the same problem you saw, so it'd
be fine to just wait for an update.

(But if you do test these patches, let us know, one more confirmation
never hurts.)

--b.

2022-01-21 19:47:50

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock



> On Jan 18, 2022, at 5:00 PM, Bruce Fields <[email protected]> wrote:
>
> From: "J. Bruce Fields" <[email protected]>
>
> I thought I was iterating over the array when actually the iteration is
> over the values contained in the array?
>
> Ugh, keep it simple.
>
> Symptoms were a null deference in vfs_lock_file() when an NFSv3 client
> that previously held a lock came back up and sent a notify.
>
> Reported-by: Jonathan Woithe <[email protected]>
> Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file")
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> fs/lockd/svcsubs.c | 17 +++++++++--------
> 1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> index cb3a7512c33e..54c2e42130ca 100644
> --- a/fs/lockd/svcsubs.c
> +++ b/fs/lockd/svcsubs.c
> @@ -179,19 +179,20 @@ nlm_delete_file(struct nlm_file *file)
> static int nlm_unlock_files(struct nlm_file *file)
> {
> struct file_lock lock;
> - struct file *f;
>
> lock.fl_type = F_UNLCK;
> lock.fl_start = 0;
> lock.fl_end = OFFSET_MAX;
> - for (f = file->f_file[0]; f <= file->f_file[1]; f++) {
> - if (f && vfs_lock_file(f, F_SETLK, &lock, NULL) < 0) {
> - pr_warn("lockd: unlock failure in %s:%d\n",
> - __FILE__, __LINE__);
> - return 1;
> - }
> - }
> + if (file->f_file[O_RDONLY] &&
> + vfs_lock_file(file->f_file[O_RDONLY], F_SETLK, &lock, NULL))
> + goto out_err;
> + if (file->f_file[O_WRONLY] &&
> + vfs_lock_file(file->f_file[O_WRONLY], F_SETLK, &lock, NULL))
> + goto out_err;
> return 0;
> +out_err:
> + pr_warn("lockd: unlock failure in %s:%d\n", __FILE__, __LINE__);
> + return 1;
> }
>
> /*
> --
> 2.34.1
>

Hi Bruce, thanks for the fixes. They've passed my basic smoke tests.
I've applied both patches for the very next nfsd PR. See:

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git for-next

--
Chuck Lever



2022-02-01 20:51:56

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock

On Wed, Jan 19, 2022 at 04:18:10PM +0000, Chuck Lever III wrote:
> > On Jan 18, 2022, at 5:00 PM, Bruce Fields <[email protected]> wrote:
> >
> > From: "J. Bruce Fields" <[email protected]>
> >
> > I thought I was iterating over the array when actually the iteration is
> > over the values contained in the array?
> >
> > Ugh, keep it simple.
> >
> > Symptoms were a null deference in vfs_lock_file() when an NFSv3 client
> > that previously held a lock came back up and sent a notify.
> >
> > Reported-by: Jonathan Woithe <[email protected]>
> > Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file")
> > Signed-off-by: J. Bruce Fields <[email protected]>
> > ---
> > :
> Hi Bruce, thanks for the fixes. They've passed my basic smoke tests.
> I've applied both patches for the very next nfsd PR. See:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git for-next

OOI, is it expected that these fixes will appear in a 5.15.x stable branch
patch at some point? I've looked at the 5.15.17 and 5.15.18 changelogs and
they don't appear to be mentioned yet.

Regards
jonathan

2022-02-02 02:45:16

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock



> On Jan 31, 2022, at 5:20 PM, Jonathan Woithe <[email protected]> wrote:
>
> On Wed, Jan 19, 2022 at 04:18:10PM +0000, Chuck Lever III wrote:
>>> On Jan 18, 2022, at 5:00 PM, Bruce Fields <[email protected]> wrote:
>>>
>>> From: "J. Bruce Fields" <[email protected]>
>>>
>>> I thought I was iterating over the array when actually the iteration is
>>> over the values contained in the array?
>>>
>>> Ugh, keep it simple.
>>>
>>> Symptoms were a null deference in vfs_lock_file() when an NFSv3 client
>>> that previously held a lock came back up and sent a notify.
>>>
>>> Reported-by: Jonathan Woithe <[email protected]>
>>> Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file")
>>> Signed-off-by: J. Bruce Fields <[email protected]>
>>> ---
>>> :
>> Hi Bruce, thanks for the fixes. They've passed my basic smoke tests.
>> I've applied both patches for the very next nfsd PR. See:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git for-next
>
> OOI, is it expected that these fixes will appear in a 5.15.x stable branch
> patch at some point? I've looked at the 5.15.17 and 5.15.18 changelogs and
> they don't appear to be mentioned yet.

These are in linux-next right now. I intend to send a pull request this
week. I was waiting for some other fixes, but those are not going as
quickly as I hoped, so I'm going to send these two along with another
fix that are all ready now.

Once they are in v5.17-rc, they will be picked up automatically and
applied to open stable branches. If they do not apply cleanly, then
someone will have to construct and test a version of the fixes
specifically for each of the stable branches.

--
Chuck Lever



2022-03-24 14:32:58

by Jonathan Woithe

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock

On Tue, Jan 18, 2022 at 05:27:03PM -0500, Bruce Fields wrote:
> On Wed, Jan 19, 2022 at 08:50:50AM +1030, Jonathan Woithe wrote:
> > On Tue, Jan 18, 2022 at 05:00:16PM -0500, Bruce Fields wrote:
> > > From: "J. Bruce Fields" <[email protected]>
> > >
> > > I thought I was iterating over the array when actually the iteration is
> > > over the values contained in the array?
> > > :
> >
> > Would you like me to apply this against a 5.15.x kernel and test locally?
> > Or should I just wait for a 5.15.x stable series update which includes it?
>
> I'm pretty confident I'm reproducing the same problem you saw, so it'd
> be fine to just wait for an update.
>
> (But if you do test these patches, let us know, one more confirmation
> never hurts.)

The shift back to a 5.15.x kernel ended up being delayed for a while for
various reasons. The server concerned was eventually upgraded to 5.15.27 on
9 March 2022. In that time, client machines have been turned on and off and
inevitably the conditions which caused the crash have been exercised many
times (libreoffice, firefox and thunderbird are used daily on almost all of
the clients). The server has not experienced the crash since the upgrade.
On the basis of this I think it's fair to consider our problem solved.

Thanks for your quick response to the report and the rapid provision of the
fix.

Regards
jonathan

2022-03-25 11:06:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 1/2] lockd: fix server crash on reboot of client holding lock

On Thu, Mar 24, 2022 at 10:03:23AM +1030, Jonathan Woithe wrote:
> On Tue, Jan 18, 2022 at 05:27:03PM -0500, Bruce Fields wrote:
> > On Wed, Jan 19, 2022 at 08:50:50AM +1030, Jonathan Woithe wrote:
> > > On Tue, Jan 18, 2022 at 05:00:16PM -0500, Bruce Fields wrote:
> > > > From: "J. Bruce Fields" <[email protected]>
> > > >
> > > > I thought I was iterating over the array when actually the iteration is
> > > > over the values contained in the array?
> > > > :
> > >
> > > Would you like me to apply this against a 5.15.x kernel and test locally?
> > > Or should I just wait for a 5.15.x stable series update which includes it?
> >
> > I'm pretty confident I'm reproducing the same problem you saw, so it'd
> > be fine to just wait for an update.
> >
> > (But if you do test these patches, let us know, one more confirmation
> > never hurts.)
>
> The shift back to a 5.15.x kernel ended up being delayed for a while for
> various reasons. The server concerned was eventually upgraded to 5.15.27 on
> 9 March 2022. In that time, client machines have been turned on and off and
> inevitably the conditions which caused the crash have been exercised many
> times (libreoffice, firefox and thunderbird are used daily on almost all of
> the clients). The server has not experienced the crash since the upgrade.
> On the basis of this I think it's fair to consider our problem solved.

Thanks for the confirmation.--b.