2020-06-07 15:34:52

by Hans-Peter Jansen

[permalink] [raw]
Subject: general protection fault, probably for non-canonical address in nfsd

Hi,

after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular
crashes of nfsd here:

2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:303 for /work (/work)
2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:304 for /work/vmware (/work)
2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:305 for /work/vSphere (/work)
2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general protection fault, probably for non-canonical address 0xb9159d506ba40000: 0000 [#1] SMP PTI
2020-06-07T01:32:43.606284+02:00 server kernel: [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware name: System manufacturer System Product Name/P7F-E, BIOS 0906 09/20/2010
2020-06-07T01:32:43.606287+02:00 server kernel: [51901.089247] RIP: 0010:cgroup_sk_free+0x26/0x80
2020-06-07T01:32:43.606288+02:00 server kernel: [51901.089257] Code: 00 00 00 00 66 66 66 66 90 53 48 8b 07 48 c7 c3 30 72 07 b6 a8 01 75 07 48 85 c0 48 0f 45 d8 48 8b 83 18 09 00 00 a8 03
75 1a <65> 48 ff 08 f6 43 7c 01 74 02 5b c3 48 8b 43 18 a8 03 75 26 65 48
2020-06-07T01:32:43.606290+02:00 server kernel: [51901.089276] RSP: 0018:ffffb248c21e7e10 EFLAGS: 00010246
2020-06-07T01:32:43.606291+02:00 server kernel: [51901.089280] RAX: b91603a504000000 RBX: ffff99ab141a0000 RCX: 0000000000000021
2020-06-07T01:32:43.606292+02:00 server kernel: [51901.089284] RDX: ffffffffb6135ec4 RSI: 0000000000010080 RDI: ffff99a7159c1490
2020-06-07T01:32:43.606293+02:00 server kernel: [51901.089287] RBP: ffff99a7159c1200 R08: ffff99ab67a60c60 R09: 000000000002eb00
2020-06-07T01:32:43.606294+02:00 server kernel: [51901.089291] R10: ffffb248c0087dc0 R11: 00000000000000c6 R12: 0000000000000000
2020-06-07T01:32:43.606295+02:00 server kernel: [51901.089294] R13: 0000000000000103 R14: ffff99aae4934238 R15: ffff99ab31902000
2020-06-07T01:32:43.606296+02:00 server kernel: [51901.089299] FS: 0000000000000000(0000) GS:ffff99ab67a40000(0000) knlGS:0000000000000000
2020-06-07T01:32:43.606297+02:00 server kernel: [51901.089303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-06-07T01:32:43.606303+02:00 server kernel: [51901.089305] CR2: 00000000008e0000 CR3: 00000004df60a000 CR4: 00000000000026e0
2020-06-07T01:32:43.606304+02:00 server kernel: [51901.089307] Call Trace:
2020-06-07T01:32:43.606305+02:00 server kernel: [51901.089315] __sk_destruct+0x10d/0x1d0
2020-06-07T01:32:43.606306+02:00 server kernel: [51901.089319] inet_release+0x34/0x60
2020-06-07T01:32:43.606307+02:00 server kernel: [51901.089325] __sock_release+0x81/0xb0
2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089358] svc_sock_free+0x38/0x60 [sunrpc]
2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089374] svc_xprt_put+0x99/0xe0 [sunrpc]
2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089389] svc_recv+0x9c0/0xa40 [sunrpc]
2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089410] ? nfsd_destroy+0x60/0x60 [nfsd]
2020-06-07T01:32:43.606311+02:00 server kernel: [51901.089417] nfsd+0xd1/0x150 [nfsd]
2020-06-07T01:32:43.606312+02:00 server kernel: [51901.089420] kthread+0x10d/0x130
2020-06-07T01:32:43.606313+02:00 server kernel: [51901.089423] ? kthread_park+0x90/0x90
2020-06-07T01:32:43.606314+02:00 server kernel: [51901.089426] ret_from_fork+0x35/0x40

A vSphere 5.5 host accesses this linux server with nfs v3 for backup
purposes (a Veeam backup server want to store a new backup here).

The kernel is tainted due to vboxdrv. The OS is openSUSE Leap 15.1,
with the kernel and Virtualbox replaced with uptodate versions from
proper rpm packages (built on that very vSphere host in a OBS server
VM..).

I used to be subscribed to this ML, but that subscription has been
lost 04/09, thus I cannot reply properly to the general prot. fault
thread, started 05/12 from syzbot with Bruce looking into it.

It seems somewhat related.

Interestingly, we're using a couple of NFS v4 mounts for subsets of
home here, and mount /work and other shares from various
Tumbleweed systems with NFS v4 here without any undesired effects.

Since the kernel upgrade, every time, this Veeam thing triggers these
v3 mounts, the crash happens. I've disabled this backup target for now
until the problem is resolved, because it effectively prevents further
nfs accesses to this server, and blocks our desktops until the server
is rebooted.

A cursory look into 5.6.{15,16} changelogs seems to imply, that this
issue is still pending.

Let me know, if I can provide any further info's.

Thanks,
Pete



2020-06-07 16:10:59

by Anthony Joseph Messina

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in nfsd

On Sunday, June 7, 2020 10:32:44 AM CDT Hans-Peter Jansen wrote:
> Hi,
>
> after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular
> crashes of nfsd here:
>
> 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated
> mount request from 192.168.3.16:303 for /work (/work)
> 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated
> mount request from 192.168.3.16:304 for /work/vmware (/work)
> 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated
> mount request from 192.168.3.16:305 for /work/vSphere (/work)
> 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general
> protection fault, probably for non-canonical address 0xb9159d506ba40000:
> 0000 [#1] SMP PTI 2020-06-07T01:32:43.606284+02:00 server kernel:
> [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O
> 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
> 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware
> name: System manufacturer System Product Name/P7F-E, BIOS 0906
> 09/20/2010 2020-06-07T01:32:43.606287+02:00 server kernel: [51901.089247]
> RIP: 0010:cgroup_sk_free+0x26/0x80 2020-06-07T01:32:43.606288+02:00 server
> kernel: [51901.089257] Code: 00 00 00 00 66 66 66 66 90 53 48 8b 07 48 c7
> c3 30 72 07 b6 a8 01 75 07 48 85 c0 48 0f 45 d8 48 8b 83 18 09 00 00 a8 03
> 75 1a <65> 48 ff 08 f6 43 7c 01 74 02 5b c3 48 8b 43 18 a8 03 75 26 65 48
> 2020-06-07T01:32:43.606290+02:00 server kernel: [51901.089276] RSP:
> 0018:ffffb248c21e7e10 EFLAGS: 00010246 2020-06-07T01:32:43.606291+02:00
> server kernel: [51901.089280] RAX: b91603a504000000 RBX: ffff99ab141a0000
> RCX: 0000000000000021 2020-06-07T01:32:43.606292+02:00 server kernel:
> [51901.089284] RDX: ffffffffb6135ec4 RSI: 0000000000010080 RDI:
> ffff99a7159c1490 2020-06-07T01:32:43.606293+02:00 server kernel:
> [51901.089287] RBP: ffff99a7159c1200 R08: ffff99ab67a60c60 R09:
> 000000000002eb00 2020-06-07T01:32:43.606294+02:00 server kernel:
> [51901.089291] R10: ffffb248c0087dc0 R11: 00000000000000c6 R12:
> 0000000000000000 2020-06-07T01:32:43.606295+02:00 server kernel:
> [51901.089294] R13: 0000000000000103 R14: ffff99aae4934238 R15:
> ffff99ab31902000 2020-06-07T01:32:43.606296+02:00 server kernel:
> [51901.089299] FS: 0000000000000000(0000) GS:ffff99ab67a40000(0000)
> knlGS:0000000000000000 2020-06-07T01:32:43.606297+02:00 server kernel:
> [51901.089303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2020-06-07T01:32:43.606303+02:00 server kernel: [51901.089305] CR2:
> 00000000008e0000 CR3: 00000004df60a000 CR4: 00000000000026e0
> 2020-06-07T01:32:43.606304+02:00 server kernel: [51901.089307] Call Trace:
> 2020-06-07T01:32:43.606305+02:00 server kernel: [51901.089315]
> __sk_destruct+0x10d/0x1d0 2020-06-07T01:32:43.606306+02:00 server kernel:
> [51901.089319] inet_release+0x34/0x60 2020-06-07T01:32:43.606307+02:00
> server kernel: [51901.089325] __sock_release+0x81/0xb0
> 2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089358]
> svc_sock_free+0x38/0x60 [sunrpc] 2020-06-07T01:32:43.606308+02:00 server
> kernel: [51901.089374] svc_xprt_put+0x99/0xe0 [sunrpc]
> 2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089389]
> svc_recv+0x9c0/0xa40 [sunrpc] 2020-06-07T01:32:43.606310+02:00 server
> kernel: [51901.089410] ? nfsd_destroy+0x60/0x60 [nfsd]
> 2020-06-07T01:32:43.606311+02:00 server kernel: [51901.089417]
> nfsd+0xd1/0x150 [nfsd] 2020-06-07T01:32:43.606312+02:00 server kernel:
> [51901.089420] kthread+0x10d/0x130 2020-06-07T01:32:43.606313+02:00 server
> kernel: [51901.089423] ? kthread_park+0x90/0x90
> 2020-06-07T01:32:43.606314+02:00 server kernel: [51901.089426]
> ret_from_fork+0x35/0x40
>
> A vSphere 5.5 host accesses this linux server with nfs v3 for backup
> purposes (a Veeam backup server want to store a new backup here).
>
> The kernel is tainted due to vboxdrv. The OS is openSUSE Leap 15.1,
> with the kernel and Virtualbox replaced with uptodate versions from
> proper rpm packages (built on that very vSphere host in a OBS server
> VM..).
>
> I used to be subscribed to this ML, but that subscription has been
> lost 04/09, thus I cannot reply properly to the general prot. fault
> thread, started 05/12 from syzbot with Bruce looking into it.
>
> It seems somewhat related.
>
> Interestingly, we're using a couple of NFS v4 mounts for subsets of
> home here, and mount /work and other shares from various
> Tumbleweed systems with NFS v4 here without any undesired effects.
>
> Since the kernel upgrade, every time, this Veeam thing triggers these
> v3 mounts, the crash happens. I've disabled this backup target for now
> until the problem is resolved, because it effectively prevents further
> nfs accesses to this server, and blocks our desktops until the server
> is rebooted.
>
> A cursory look into 5.6.{15,16} changelogs seems to imply, that this
> issue is still pending.
>
> Let me know, if I can provide any further info's.
>
> Thanks,
> Pete

I see similar issues in Fedora kernels 5.6.14 through 5.6.16
https://bugzilla.redhat.com/show_bug.cgi?id=1839287

On the client I mount /home with sec=krb5p, and /mnt/koji with sec=krb5

--
Anthony - https://messinet.com
F9B6 560E 68EA 037D 8C3D D1C9 FF31 3BDB D9D8 99B6


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part.

2020-06-07 17:45:26

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in nfsd

Am Sonntag, 7. Juni 2020, 18:01:55 CEST schrieb Anthony Joseph Messina:
> On Sunday, June 7, 2020 10:32:44 AM CDT Hans-Peter Jansen wrote:
> > Hi,
> >
> > after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular
> > crashes of nfsd here:
> >
> > 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated
> > mount request from 192.168.3.16:303 for /work (/work)
> > 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated
> > mount request from 192.168.3.16:304 for /work/vmware (/work)
> > 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated
> > mount request from 192.168.3.16:305 for /work/vSphere (/work)
> > 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general
> > protection fault, probably for non-canonical address 0xb9159d506ba40000:
> > 0000 [#1] SMP PTI 2020-06-07T01:32:43.606284+02:00 server kernel:
> > [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O
> > 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
> > 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware
> > name: System manufacturer System Product Name/P7F-E, BIOS 0906
>
> I see similar issues in Fedora kernels 5.6.14 through 5.6.16
> https://bugzilla.redhat.com/show_bug.cgi?id=1839287
>
> On the client I mount /home with sec=krb5p, and /mnt/koji with sec=krb5

Thanks for confirmation.

Apart from the hassle with server reboots, this issue has some DOS potential,
I'm afraid.

Cheers,
Pete


2020-06-08 15:32:20

by Chuck Lever III

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in nfsd



> On Jun 7, 2020, at 1:44 PM, Hans-Peter Jansen <[email protected]> wrote:
>
> Am Sonntag, 7. Juni 2020, 18:01:55 CEST schrieb Anthony Joseph Messina:
>> On Sunday, June 7, 2020 10:32:44 AM CDT Hans-Peter Jansen wrote:
>>> Hi,
>>>
>>> after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular
>>> crashes of nfsd here:
>>>
>>> 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated
>>> mount request from 192.168.3.16:303 for /work (/work)
>>> 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated
>>> mount request from 192.168.3.16:304 for /work/vmware (/work)
>>> 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated
>>> mount request from 192.168.3.16:305 for /work/vSphere (/work)
>>> 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general
>>> protection fault, probably for non-canonical address 0xb9159d506ba40000:
>>> 0000 [#1] SMP PTI 2020-06-07T01:32:43.606284+02:00 server kernel:
>>> [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O
>>> 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
>>> 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware
>>> name: System manufacturer System Product Name/P7F-E, BIOS 0906
>>
>> I see similar issues in Fedora kernels 5.6.14 through 5.6.16
>> https://bugzilla.redhat.com/show_bug.cgi?id=1839287
>>
>> On the client I mount /home with sec=krb5p, and /mnt/koji with sec=krb5
>
> Thanks for confirmation.
>
> Apart from the hassle with server reboots, this issue has some DOS potential,
> I'm afraid.

If you have a reproducer (even a partial one) then bisecting between a
known good kernel and v5.6.14 (or 16) would be helpful.


--
Chuck Lever



2020-06-08 17:58:21

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in nfsd

Am Montag, 8. Juni 2020, 17:28:53 CEST schrieb Chuck Lever:
> > On Jun 7, 2020, at 1:44 PM, Hans-Peter Jansen <[email protected]> wrote:
> >
> > Am Sonntag, 7. Juni 2020, 18:01:55 CEST schrieb Anthony Joseph Messina:
> >> On Sunday, June 7, 2020 10:32:44 AM CDT Hans-Peter Jansen wrote:
> >>> Hi,
> >>>
> >>> after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular
> >>> crashes of nfsd here:
> >>>
> >>> 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated
> >>> mount request from 192.168.3.16:303 for /work (/work)
> >>> 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated
> >>> mount request from 192.168.3.16:304 for /work/vmware (/work)
> >>> 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated
> >>> mount request from 192.168.3.16:305 for /work/vSphere (/work)
> >>> 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general
> >>> protection fault, probably for non-canonical address 0xb9159d506ba40000:
> >>> 0000 [#1] SMP PTI 2020-06-07T01:32:43.606284+02:00 server kernel:
> >>> [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O
> >>> 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
> >>> 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware
> >>> name: System manufacturer System Product Name/P7F-E, BIOS 0906
> >>
> >> I see similar issues in Fedora kernels 5.6.14 through 5.6.16
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1839287
> >>
> >> On the client I mount /home with sec=krb5p, and /mnt/koji with sec=krb5
> >
> > Thanks for confirmation.
> >
> > Apart from the hassle with server reboots, this issue has some DOS
> > potential, I'm afraid.
>
> If you have a reproducer (even a partial one) then bisecting between a
> known good kernel and v5.6.14 (or 16) would be helpful.

I would love to bisect, but this is my primary production machine, that needs
to be up as much as possible. Apart from that, I'm about to leave the site for
a week and been severely time constrained for the next couple of weeks..

Sorry.

Anthony?
--
Pete


2020-06-08 18:35:27

by Anthony Joseph Messina

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in nfsd

On Monday, June 8, 2020 12:53:26 PM CDT Hans-Peter Jansen wrote:
> Am Montag, 8. Juni 2020, 17:28:53 CEST schrieb Chuck Lever:
> > > On Jun 7, 2020, at 1:44 PM, Hans-Peter Jansen <[email protected]> wrote:
> > >
> > > Am Sonntag, 7. Juni 2020, 18:01:55 CEST schrieb Anthony Joseph Messina:
> > >> On Sunday, June 7, 2020 10:32:44 AM CDT Hans-Peter Jansen wrote:
> > >>> Hi,
> > >>>
> > >>> after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from
> > >>> regular
> > >>> crashes of nfsd here:
> > >>>
> > >>> 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]:
> > >>> authenticated
> > >>> mount request from 192.168.3.16:303 for /work (/work)
> > >>> 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]:
> > >>> authenticated
> > >>> mount request from 192.168.3.16:304 for /work/vmware (/work)
> > >>> 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]:
> > >>> authenticated
> > >>> mount request from 192.168.3.16:305 for /work/vSphere (/work)
> > >>> 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general
> > >>> protection fault, probably for non-canonical address
> > >>> 0xb9159d506ba40000:
> > >>> 0000 [#1] SMP PTI 2020-06-07T01:32:43.606284+02:00 server kernel:
> > >>> [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O
> > >>> 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
> > >>> 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234]
> > >>> Hardware
> > >>> name: System manufacturer System Product Name/P7F-E, BIOS 0906
> > >>
> > >> I see similar issues in Fedora kernels 5.6.14 through 5.6.16
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1839287
> > >>
> > >> On the client I mount /home with sec=krb5p, and /mnt/koji with sec=krb5
> > >
> > > Thanks for confirmation.
> > >
> > > Apart from the hassle with server reboots, this issue has some DOS
> > > potential, I'm afraid.
> >
> > If you have a reproducer (even a partial one) then bisecting between a
> > known good kernel and v5.6.14 (or 16) would be helpful.
>
> I would love to bisect, but this is my primary production machine, that
> needs to be up as much as possible. Apart from that, I'm about to leave the
> site for a week and been severely time constrained for the next couple of
> weeks..
>
> Sorry.
>
> Anthony?

Unfortunately, this is also my main workstation and I have no experience
building custom kernels. The diff in net/sunrpc between v5.6.13 and v5.6.14
is relatively small, though it may not point to the root issue. I'm typically
only able to "follow along" code like this to spot issues, being a nurse, not
a kernel programmer.

Thank you for your help. -A

--
Anthony - https://messinet.com
F9B6 560E 68EA 037D 8C3D D1C9 FF31 3BDB D9D8 99B6


Attachments:
net-sunrpc-v5.6.13_v5.6.14.txt (17.16 kB)
signature.asc (849.00 B)
This is a digitally signed message part.
Download all attachments

2020-06-08 19:30:22

by Chuck Lever III

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in nfsd



> On Jun 7, 2020, at 11:32 AM, Hans-Peter Jansen <[email protected]> wrote:
>
> Hi,
>
> after upgrading the kernel from 5.6.11 to 5.6.14, we suffer from regular
> crashes of nfsd here:
>
> 2020-06-07T01:32:43.600306+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:303 for /work (/work)
> 2020-06-07T01:32:43.602594+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:304 for /work/vmware (/work)
> 2020-06-07T01:32:43.602971+02:00 server rpc.mountd[2664]: authenticated mount request from 192.168.3.16:305 for /work/vSphere (/work)
> 2020-06-07T01:32:43.606276+02:00 server kernel: [51901.089211] general protection fault, probably for non-canonical address 0xb9159d506ba40000: 0000 [#1] SMP PTI
> 2020-06-07T01:32:43.606284+02:00 server kernel: [51901.089226] CPU: 1 PID: 3190 Comm: nfsd Tainted: G O 5.6.14-lp151.2-default #1 openSUSE Tumbleweed (unreleased)
> 2020-06-07T01:32:43.606286+02:00 server kernel: [51901.089234] Hardware name: System manufacturer System Product Name/P7F-E, BIOS 0906 09/20/2010
> 2020-06-07T01:32:43.606287+02:00 server kernel: [51901.089247] RIP: 0010:cgroup_sk_free+0x26/0x80
> 2020-06-07T01:32:43.606288+02:00 server kernel: [51901.089257] Code: 00 00 00 00 66 66 66 66 90 53 48 8b 07 48 c7 c3 30 72 07 b6 a8 01 75 07 48 85 c0 48 0f 45 d8 48 8b 83 18 09 00 00 a8 03
> 75 1a <65> 48 ff 08 f6 43 7c 01 74 02 5b c3 48 8b 43 18 a8 03 75 26 65 48
> 2020-06-07T01:32:43.606290+02:00 server kernel: [51901.089276] RSP: 0018:ffffb248c21e7e10 EFLAGS: 00010246
> 2020-06-07T01:32:43.606291+02:00 server kernel: [51901.089280] RAX: b91603a504000000 RBX: ffff99ab141a0000 RCX: 0000000000000021
> 2020-06-07T01:32:43.606292+02:00 server kernel: [51901.089284] RDX: ffffffffb6135ec4 RSI: 0000000000010080 RDI: ffff99a7159c1490
> 2020-06-07T01:32:43.606293+02:00 server kernel: [51901.089287] RBP: ffff99a7159c1200 R08: ffff99ab67a60c60 R09: 000000000002eb00
> 2020-06-07T01:32:43.606294+02:00 server kernel: [51901.089291] R10: ffffb248c0087dc0 R11: 00000000000000c6 R12: 0000000000000000
> 2020-06-07T01:32:43.606295+02:00 server kernel: [51901.089294] R13: 0000000000000103 R14: ffff99aae4934238 R15: ffff99ab31902000
> 2020-06-07T01:32:43.606296+02:00 server kernel: [51901.089299] FS: 0000000000000000(0000) GS:ffff99ab67a40000(0000) knlGS:0000000000000000
> 2020-06-07T01:32:43.606297+02:00 server kernel: [51901.089303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2020-06-07T01:32:43.606303+02:00 server kernel: [51901.089305] CR2: 00000000008e0000 CR3: 00000004df60a000 CR4: 00000000000026e0
> 2020-06-07T01:32:43.606304+02:00 server kernel: [51901.089307] Call Trace:
> 2020-06-07T01:32:43.606305+02:00 server kernel: [51901.089315] __sk_destruct+0x10d/0x1d0
> 2020-06-07T01:32:43.606306+02:00 server kernel: [51901.089319] inet_release+0x34/0x60
> 2020-06-07T01:32:43.606307+02:00 server kernel: [51901.089325] __sock_release+0x81/0xb0
> 2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089358] svc_sock_free+0x38/0x60 [sunrpc]
> 2020-06-07T01:32:43.606308+02:00 server kernel: [51901.089374] svc_xprt_put+0x99/0xe0 [sunrpc]
> 2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089389] svc_recv+0x9c0/0xa40 [sunrpc]
> 2020-06-07T01:32:43.606310+02:00 server kernel: [51901.089410] ? nfsd_destroy+0x60/0x60 [nfsd]
> 2020-06-07T01:32:43.606311+02:00 server kernel: [51901.089417] nfsd+0xd1/0x150 [nfsd]
> 2020-06-07T01:32:43.606312+02:00 server kernel: [51901.089420] kthread+0x10d/0x130
> 2020-06-07T01:32:43.606313+02:00 server kernel: [51901.089423] ? kthread_park+0x90/0x90
> 2020-06-07T01:32:43.606314+02:00 server kernel: [51901.089426] ret_from_fork+0x35/0x40
>
> A vSphere 5.5 host accesses this linux server with nfs v3 for backup
> purposes (a Veeam backup server want to store a new backup here).
>
> The kernel is tainted due to vboxdrv. The OS is openSUSE Leap 15.1,
> with the kernel and Virtualbox replaced with uptodate versions from
> proper rpm packages (built on that very vSphere host in a OBS server
> VM..).
>
> I used to be subscribed to this ML, but that subscription has been
> lost 04/09, thus I cannot reply properly to the general prot. fault
> thread, started 05/12 from syzbot with Bruce looking into it.
>
> It seems somewhat related.

Your backtrace doesn't look anything like the syzbot crashes Bruce
is looking at, and there are no fs/nfsd/ changes between v5.6.11 and
v5.6.14. His crashes appear to be related entirely to the order of
destruction of net namespaces and NFS server data structures --
nothing at the socket layer.

The net/sunrpc/ changes in that commit range have nothing to do with
socket allocation. However, this:

[51901.089247] RIP: 0010:cgroup_sk_free+0x26/0x80

suggests something else. There is a cgroup/sk related change in that
commit range:

e2d928d5ee43 ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")

I'm not sure how to help you further, since you are not available to
test this theory for a few weeks. The best I can suggest for others is
to stick with v5.6.11-based kernels until someone with a reproducer
can bisect between .11 and .14 to confirm the theory.


> Interestingly, we're using a couple of NFS v4 mounts for subsets of
> home here, and mount /work and other shares from various
> Tumbleweed systems with NFS v4 here without any undesired effects.
>
> Since the kernel upgrade, every time, this Veeam thing triggers these
> v3 mounts, the crash happens. I've disabled this backup target for now
> until the problem is resolved, because it effectively prevents further
> nfs accesses to this server, and blocks our desktops until the server
> is rebooted.
>
> A cursory look into 5.6.{15,16} changelogs seems to imply, that this
> issue is still pending.
>
> Let me know, if I can provide any further info's.

--
Chuck Lever