Hi all,
since several weeks, I am seeing, on a regular basis, errors like the following in the system log of one of my NFSv4 file servers:
Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
Looks like premature closing of client connections.
The security flavor of the NFS export is set to krb5p (integrity+privacy).
Anyone a hint how to efficiently track down the problem?
Best and thanks
Sebastian
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
On Fri, Jun 19, 2020 at 09:24:27PM +0000, Kraus, Sebastian wrote:
> Hi all,
> since several weeks, I am seeing, on a regular basis, errors like the following in the system log of one of my NFSv4 file servers:
>
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
I'm not sure what exactly is happening.
Are the log messages the only problem you're seeing, or is there some
other problem?
--b.
>
> Looks like premature closing of client connections.
> The security flavor of the NFS export is set to krb5p (integrity+privacy).
>
> Anyone a hint how to efficiently track down the problem?
>
>
> Best and thanks
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
Hi Bruce,
OK, here the somehow longer story 8-O:
I am maintaining virtualized several NFS server instances running on VMware ESXi hypervisor. The operating system is Debian Stretch/Buster.
Most of the time, the NFS servers are nearly idling and there is only moderate CPU load during rush hours. So, servers far from being overloaded.
Anyway, more than a year ago, the rpc.gssd daemon started getting unstable in production use.
The daemon provokes segmentations violations, serveral times a day and on an irregular basis. Unfortunately without any obvious reason. :-(
The observed violations look like this:
Jun 11 21:52:08 all kernel: rpc.gssd[12043]: segfault at 0 ip 000056065d50e38e sp 00007fde27ffe880 error 4 in rpc.gssd[56065d50b000+9000]
or that:
Mar 17 10:32:10 all kernel: rpc.gssd[25793]: segfault at ffffffffffffffc0 ip 00007ffa61f246e4 sp 00007ffa6145f0f8 error 5 in libc-2.24.so[7ffa61ea4000+195000]
In order to manage the problem in a quick and dirty way, I activated automatic restart of the rpc-gssd.service unit for "on-fail" reasons.
Several monthes ago, I decided to investigate the problem further by launching rpc.svcgssd and rpc.gssd daemons with enhanced debug level from their service units.
Sadly, this didn't help me to get any clue of the root cause of these strange segmentations violations.
Some of my colleagues urged me to migrate the server instances from Debian Stretch (current oldstable) to Debian Buster (current stable).
They argued, rpc.gssd's crashes possibly being rooted in NFS stack instabilities. And about three weeks ago, I upgraded two of my server instances.
Unexpectedly, not only the problem did not disappear, but moreover frequency of the segmentation violations increased slightly.
Debian Stretch ships with nfs-common v1.3.4-2.1 and Buster with nfs-common v1.3.4-2.5 . So, both based the same nfs-common point release.
In consequence, about a week ago, I decided to investigate the problem in a deep manner by stracing the rpc.gssd daemon while running.
Since then, the segementation violations were gone, but now lots of complaints of the following type appear in the system log:
Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
This behaviour seems somehow strange to me.
But, one possible explanation could be: The execution speed of rpc.gssd slows down while being straced and the "true" reason for the segmentation violations pops up.
I would argue, rpc.gssd trying to parse non-existing files points anyway to an insane and defective behaviour of the RPC GSS user space daemon implementation.
Best and a nice weekend
Sebastian
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
________________________________________
From: J. Bruce Fields <[email protected]>
Sent: Saturday, June 20, 2020 00:04
To: Kraus, Sebastian
Cc: [email protected]
Subject: Re: RPC Pipefs: Frequent parsing errors in client database
On Fri, Jun 19, 2020 at 09:24:27PM +0000, Kraus, Sebastian wrote:
> Hi all,
> since several weeks, I am seeing, on a regular basis, errors like the following in the system log of one of my NFSv4 file servers:
>
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
I'm not sure what exactly is happening.
Are the log messages the only problem you're seeing, or is there some
other problem?
--b.
>
> Looks like premature closing of client connections.
> The security flavor of the NFS export is set to krb5p (integrity+privacy).
>
> Anyone a hint how to efficiently track down the problem?
>
>
> Best and thanks
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut f?r Chemie
> Geb?ude C, Stra?e des 17. Juni 115, Raum C7
>
> Technische Universit?t Berlin
> Fakult?t II
> Institut f?r Chemie
> Sekretariat C3
> Stra?e des 17. Juni 135
> 10623 Berlin
On Sat, Jun 20, 2020 at 11:35:56AM +0000, Kraus, Sebastian wrote:
> In consequence, about a week ago, I decided to investigate the problem
> in a deep manner by stracing the rpc.gssd daemon while running. Since
> then, the segementation violations were gone, but now lots of
> complaints of the following type appear in the system log:
>
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open
> nfsd4_cb/clnt3bb/info: No such file or directory Jun 19 11:14:00 all
> rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
>
>
> This behaviour seems somehow strange to me. But, one possible
> explanation could be: The execution speed of rpc.gssd slows down while
> being straced and the "true" reason for the segmentation violations
> pops up. I would argue, rpc.gssd trying to parse non-existing files
> points anyway to an insane and defective behaviour of the RPC GSS user
> space daemon implementation.
Those files under rpc_pipefs come and go. rpc.gssd monitors directories
under rpc_pipefs to learn about changes, and then tries to parse the
files under any new directories.
The thing is, if rpc.gssd is a little fast, I think it's possible that
it could get the notification about clnt3bb/ being created, and try to
look up "info", before "info" itself is actually created.
Or alternatively, if clnt3bb/ is short-lived, it might not look up
"info" until the directory's already been deleted again.
Neither problem should be fatal--rpc.gssd will get another update and
adjust to the new situation soon enough.
So it may be that the reall error here is an unconditional log message
in a case that's expected, not actually an error.
Or I could be wrong and maybe something else is happening.
But I think it'd be more useful to stay focused on the segfaults.
--b.
Hi Bruce,
>> But I think it'd be more useful to stay focused on the segfaults.
is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?
I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
There are several things I do not really understand about the trace shown below:
1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?
2) Why are there two identifiers clnt36e and clnt36f being used for the same client?
3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?
Jun 18 18:55:07 server rpcbind[30464]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server rpcbind[30465]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server rpcbind[30466]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server rpcbind[30467]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server rpcbind[30468]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server rpcbind[30469]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server rpcbind[30470]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server rpcbind[30471]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server rpc.svcgssd[20418]: leaving poll
Jun 18 18:55:07 server rpc.svcgssd[20418]: handling null request
Jun 18 18:55:07 server rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
Jun 18 18:55:07 server rpc.svcgssd[20418]: sname = host/[email protected]
Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_getpwnam: name 'host/[email protected]' domain '(null)': resulting localname 'host/test0815.chem.tu-berlin.de'
Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_ldap: reconnecting to LDAP server...
Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_ldap: reconnected to LDAP server ldaps://ldap-server.tu-berlin.de after 1 attempt
Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
Jun 18 18:55:07 server rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
Jun 18 18:55:07 server rpc.svcgssd[20418]: doing downcall
Jun 18 18:55:07 server rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: [email protected], uid: -1, gid: -1, num aux grps: 0:
Jun 18 18:55:07 server rpc.svcgssd[20418]: sending null reply
Jun 18 18:55:07 server rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
Jun 18 18:55:07 server rpc.svcgssd[20418]: finished handling null request
Jun 18 18:55:07 server rpc.svcgssd[20418]: entering poll
Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36e) ev->mask (0x40000100)
Jun 18 18:55:07 server rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt36e/info: No such file or directory
Jun 18 18:55:07 server rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt36e/info
Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (krb5) ev->mask (0x00000200)
Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (gssd) ev->mask (0x00000200)
Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (info) ev->mask (0x00000200)
Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36f) ev->mask (0x40000100)
Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)
Jun 18 18:55:07 server rpc.svcgssd[20418]: leaving poll
Jun 18 18:55:07 server rpc.svcgssd[20418]: handling null request
Jun 18 18:55:07 server rpc.gssd[23620]:
handle_gssd_upcall: 'mech=krb5 uid=0 [email protected] service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt36f)
Jun 18 18:55:07 server rpc.gssd[23620]: krb5_use_machine_creds: uid 0 tgtname [email protected]
Jun 18 18:55:07 server rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
Jun 18 18:55:07 server rpc.gssd[23620]: Full hostname for 'client.domain.tu-berlin.de' is 'client.domain.tu-berlin.de'
Jun 18 18:55:07 server rpc.gssd[23620]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
Jun 18 18:55:07 server rpc.gssd[23620]: Success getting keytab entry for 'nfs/[email protected]'
Jun 18 18:55:07 server rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
Jun 18 18:55:07 server rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
Jun 18 18:55:07 server rpc.gssd[23620]: creating tcp client for server client.domain.tu-berlin.de
Jun 18 18:55:07 server rpc.gssd[23620]: DEBUG: port already set to 39495
Jun 18 18:55:07 server rpc.gssd[23620]: creating context with server [email protected]
Jun 18 18:55:07 server rpc.gssd[23620]: in authgss_create_default()
Jun 18 18:55:07 server rpc.gssd[23620]: in authgss_create()
Jun 18 18:55:07 server rpc.gssd[23620]: authgss_create: name is 0x7fa5ac011970
Jun 18 18:55:07 server rpc.gssd[23620]: authgss_create: gd->name is 0x7fa5ac005770
Jun 18 18:55:07 server rpc.gssd[23620]: in authgss_refresh()
Jun 18 18:55:07 server rpc.svcgssd[20418]: sname = host/[email protected]
Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_getpwnam: name 'host/[email protected]' domain '(null)': resulting localname 'host/client.domain.tu-berlin.de'
Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
Jun 18 18:55:07 server rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
Jun 18 18:55:07 server rpc.svcgssd[20418]: doing downcall
Jun 18 18:55:07 server rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: [email protected], uid: -1, gid: -1, num aux grps: 0:
Jun 18 18:55:07 server rpc.svcgssd[20418]: sending null reply
Jun 18 18:55:07 server rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
Jun 18 18:55:07 server rpc.svcgssd[20418]: finished handling null request
Jun 18 18:55:07 server rpc.svcgssd[20418]: entering poll
Best and keep well
Sebastian
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
________________________________________
From: J. Bruce Fields <[email protected]>
Sent: Saturday, June 20, 2020 19:03
To: Kraus, Sebastian
Cc: [email protected]
Subject: Re: RPC Pipefs: Frequent parsing errors in client database
On Sat, Jun 20, 2020 at 11:35:56AM +0000, Kraus, Sebastian wrote:
> In consequence, about a week ago, I decided to investigate the problem
> in a deep manner by stracing the rpc.gssd daemon while running. Since
> then, the segementation violations were gone, but now lots of
> complaints of the following type appear in the system log:
>
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open
> nfsd4_cb/clnt3bb/info: No such file or directory Jun 19 11:14:00 all
> rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
>
>
> This behaviour seems somehow strange to me. But, one possible
> explanation could be: The execution speed of rpc.gssd slows down while
> being straced and the "true" reason for the segmentation violations
> pops up. I would argue, rpc.gssd trying to parse non-existing files
> points anyway to an insane and defective behaviour of the RPC GSS user
> space daemon implementation.
Those files under rpc_pipefs come and go. rpc.gssd monitors directories
under rpc_pipefs to learn about changes, and then tries to parse the
files under any new directories.
The thing is, if rpc.gssd is a little fast, I think it's possible that
it could get the notification about clnt3bb/ being created, and try to
look up "info", before "info" itself is actually created.
Or alternatively, if clnt3bb/ is short-lived, it might not look up
"info" until the directory's already been deleted again.
Neither problem should be fatal--rpc.gssd will get another update and
adjust to the new situation soon enough.
So it may be that the reall error here is an unconditional log message
in a case that's expected, not actually an error.
Or I could be wrong and maybe something else is happening.
But I think it'd be more useful to stay focused on the segfaults.
--b.
On Sat, Jun 20, 2020 at 09:08:55PM +0000, Kraus, Sebastian wrote:
> Hi Bruce,
>
> >> But I think it'd be more useful to stay focused on the segfaults.
>
> is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?
If we could at least get a backtrace out of the core dump that could be
useful.
> I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
>
> There are several things I do not really understand about the trace shown below:
>
> 1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?
It doesn't know beforehand, in the scenarios I described.
> 2) Why are there two identifiers clnt36e and clnt36f being used for the same client?
This is actually happening on an NFS server, the rpc client in question
is the callback client used to do things like send delegation recalls
back to the NFS client.
I'm not sure why two different callback clients are being created here,
but there's nothing inherently weird about that.
> 3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?
Off the top of my head, I don't know, we'd probably need to look through
header files or inotify man pages for the definitions of those masks.
--b.
>
>
> Jun 18 18:55:07 server rpcbind[30464]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server rpcbind[30465]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server rpcbind[30466]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server rpcbind[30467]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server rpcbind[30468]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server rpcbind[30469]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server rpcbind[30470]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server rpcbind[30471]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server rpc.svcgssd[20418]: leaving poll
> Jun 18 18:55:07 server rpc.svcgssd[20418]: handling null request
> Jun 18 18:55:07 server rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
> Jun 18 18:55:07 server rpc.svcgssd[20418]: sname = host/[email protected]
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_getpwnam: name 'host/[email protected]' domain '(null)': resulting localname 'host/test0815.chem.tu-berlin.de'
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_ldap: reconnecting to LDAP server...
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_ldap: reconnected to LDAP server ldaps://ldap-server.tu-berlin.de after 1 attempt
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
> Jun 18 18:55:07 server rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
> Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
> Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
> Jun 18 18:55:07 server rpc.svcgssd[20418]: doing downcall
> Jun 18 18:55:07 server rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: [email protected], uid: -1, gid: -1, num aux grps: 0:
> Jun 18 18:55:07 server rpc.svcgssd[20418]: sending null reply
> Jun 18 18:55:07 server rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
> Jun 18 18:55:07 server rpc.svcgssd[20418]: finished handling null request
> Jun 18 18:55:07 server rpc.svcgssd[20418]: entering poll
>
> Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36e) ev->mask (0x40000100)
> Jun 18 18:55:07 server rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt36e/info: No such file or directory
> Jun 18 18:55:07 server rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt36e/info
>
> Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (krb5) ev->mask (0x00000200)
> Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (gssd) ev->mask (0x00000200)
> Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (info) ev->mask (0x00000200)
> Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36f) ev->mask (0x40000100)
> Jun 18 18:55:07 server rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)
> Jun 18 18:55:07 server rpc.svcgssd[20418]: leaving poll
> Jun 18 18:55:07 server rpc.svcgssd[20418]: handling null request
> Jun 18 18:55:07 server rpc.gssd[23620]:
> handle_gssd_upcall: 'mech=krb5 uid=0 [email protected] service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt36f)
> Jun 18 18:55:07 server rpc.gssd[23620]: krb5_use_machine_creds: uid 0 tgtname [email protected]
> Jun 18 18:55:07 server rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
> Jun 18 18:55:07 server rpc.gssd[23620]: Full hostname for 'client.domain.tu-berlin.de' is 'client.domain.tu-berlin.de'
> Jun 18 18:55:07 server rpc.gssd[23620]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
> Jun 18 18:55:07 server rpc.gssd[23620]: Success getting keytab entry for 'nfs/[email protected]'
> Jun 18 18:55:07 server rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
> Jun 18 18:55:07 server rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
> Jun 18 18:55:07 server rpc.gssd[23620]: creating tcp client for server client.domain.tu-berlin.de
> Jun 18 18:55:07 server rpc.gssd[23620]: DEBUG: port already set to 39495
> Jun 18 18:55:07 server rpc.gssd[23620]: creating context with server [email protected]
> Jun 18 18:55:07 server rpc.gssd[23620]: in authgss_create_default()
> Jun 18 18:55:07 server rpc.gssd[23620]: in authgss_create()
> Jun 18 18:55:07 server rpc.gssd[23620]: authgss_create: name is 0x7fa5ac011970
> Jun 18 18:55:07 server rpc.gssd[23620]: authgss_create: gd->name is 0x7fa5ac005770
> Jun 18 18:55:07 server rpc.gssd[23620]: in authgss_refresh()
> Jun 18 18:55:07 server rpc.svcgssd[20418]: sname = host/[email protected]
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nss_getpwnam: name 'host/[email protected]' domain '(null)': resulting localname 'host/client.domain.tu-berlin.de'
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
> Jun 18 18:55:07 server rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
> Jun 18 18:55:07 server rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
> Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
> Jun 18 18:55:07 server rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
> Jun 18 18:55:07 server rpc.svcgssd[20418]: doing downcall
> Jun 18 18:55:07 server rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: [email protected], uid: -1, gid: -1, num aux grps: 0:
> Jun 18 18:55:07 server rpc.svcgssd[20418]: sending null reply
> Jun 18 18:55:07 server rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
> Jun 18 18:55:07 server rpc.svcgssd[20418]: finished handling null request
> Jun 18 18:55:07 server rpc.svcgssd[20418]: entering poll
>
>
>
> Best and keep well
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> ________________________________________
> From: J. Bruce Fields <[email protected]>
> Sent: Saturday, June 20, 2020 19:03
> To: Kraus, Sebastian
> Cc: [email protected]
> Subject: Re: RPC Pipefs: Frequent parsing errors in client database
>
> On Sat, Jun 20, 2020 at 11:35:56AM +0000, Kraus, Sebastian wrote:
> > In consequence, about a week ago, I decided to investigate the problem
> > in a deep manner by stracing the rpc.gssd daemon while running. Since
> > then, the segementation violations were gone, but now lots of
> > complaints of the following type appear in the system log:
> >
> > Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open
> > nfsd4_cb/clnt3bb/info: No such file or directory Jun 19 11:14:00 all
> > rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
> >
> >
> > This behaviour seems somehow strange to me. But, one possible
> > explanation could be: The execution speed of rpc.gssd slows down while
> > being straced and the "true" reason for the segmentation violations
> > pops up. I would argue, rpc.gssd trying to parse non-existing files
> > points anyway to an insane and defective behaviour of the RPC GSS user
> > space daemon implementation.
>
> Those files under rpc_pipefs come and go. rpc.gssd monitors directories
> under rpc_pipefs to learn about changes, and then tries to parse the
> files under any new directories.
>
> The thing is, if rpc.gssd is a little fast, I think it's possible that
> it could get the notification about clnt3bb/ being created, and try to
> look up "info", before "info" itself is actually created.
>
> Or alternatively, if clnt3bb/ is short-lived, it might not look up
> "info" until the directory's already been deleted again.
>
> Neither problem should be fatal--rpc.gssd will get another update and
> adjust to the new situation soon enough.
>
> So it may be that the reall error here is an unconditional log message
> in a case that's expected, not actually an error.
>
> Or I could be wrong and maybe something else is happening.
>
> But I think it'd be more useful to stay focused on the segfaults.
>
> --b.
Dear Bruce,
I got the following stack and back trace:
root@all:~# coredumpctl debug
PID: 6356 (rpc.gssd)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Thu 2020-06-25 11:46:08 CEST (3h 4min ago)
Command Line: /usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10
Executable: /usr/sbin/rpc.gssd
Control Group: /system.slice/rpc-gssd.service
Unit: rpc-gssd.service
Slice: system.slice
Boot ID: XXXXXXXXXXXXXXXXXXXXXXXXXXX
Machine ID: YYYYYYYYYYYYYYYYYYYYYYYYYYYY
Hostname: XYZ
Storage: /var/lib/systemd/coredump/core.rpc\x2egssd.0.7f31136228274af0a1a855b91ad1e75c.6356.1593078368000000.lz4
Message: Process 6356 (rpc.gssd) of user 0 dumped core.
Stack trace of thread 14174:
#0 0x000056233fff038e n/a (rpc.gssd)
#1 0x000056233fff09f8 n/a (rpc.gssd)
#2 0x000056233fff0b92 n/a (rpc.gssd)
#3 0x000056233fff13b3 n/a (rpc.gssd)
#4 0x00007fb2eb8dbfa3 start_thread (libpthread.so.0)
#5 0x00007fb2eb80c4cf __clone (libc.so.6)
Stack trace of thread 6356:
#0 0x00007fb2eb801819 __GI___poll (libc.so.6)
#1 0x00007fb2eb6e7207 send_dg (libresolv.so.2)
#2 0x00007fb2eb6e4c43 __GI___res_context_query (libresolv.so.2)
#3 0x00007fb2eb6bf536 __GI__nss_dns_gethostbyaddr2_r (libnss_dns.so.2)
#4 0x00007fb2eb6bf823 _nss_dns_gethostbyaddr_r (libnss_dns.so.2)
#5 0x00007fb2eb81dee2 __gethostbyaddr_r (libc.so.6)
#6 0x00007fb2eb8267d5 gni_host_inet_name (libc.so.6)
#7 0x000056233ffef455 n/a (rpc.gssd)
#8 0x000056233ffef82c n/a (rpc.gssd)
#9 0x000056233fff01d0 n/a (rpc.gssd)
#10 0x00007fb2ebab49ba n/a (libevent-2.1.so.6)
#11 0x00007fb2ebab5537 event_base_loop (libevent-2.1.so.6)
#12 0x000056233ffedeaa n/a (rpc.gssd)
#13 0x00007fb2eb73709b __libc_start_main (libc.so.6)
#14 0x000056233ffee03a n/a (rpc.gssd)
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
[...]
Reading symbols from /usr/sbin/rpc.gssd...(no debugging symbols found)...done.
[New LWP 14174]
[New LWP 6356]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000056233fff038e in ?? ()
[Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
(gdb) bt
#0 0x000056233fff038e in ?? ()
#1 0x000056233fff09f8 in ?? ()
#2 0x000056233fff0b92 in ?? ()
#3 0x000056233fff13b3 in ?? ()
#4 0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5 0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) quit
I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
As far as I see, thread 14174 caused the segmentation violation just after its birth on clone.
Please correct me, if I am in error.
Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable.
So far, I was not able to find such a package.
What's your opinon about the trace?
Best and Thanks
Sebastian
_____________________________
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
Email: [email protected]
________________________________________
From: [email protected] <[email protected]> on behalf of J. Bruce Fields <[email protected]>
Sent: Tuesday, June 23, 2020 00:36
To: Kraus, Sebastian
Cc: [email protected]
Subject: Re: RPC Pipefs: Frequent parsing errors in client database
On Sat, Jun 20, 2020 at 09:08:55PM +0000, Kraus, Sebastian wrote:
> Hi Bruce,
>
> >> But I think it'd be more useful to stay focused on the segfaults.
>
> is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?
If we could at least get a backtrace out of the core dump that could be
useful.
> I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
>
> There are several things I do not really understand about the trace shown below:
>
> 1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?
It doesn't know beforehand, in the scenarios I described.
> 2) Why are there two identifiers clnt36e and clnt36f being used for the same client?
This is actually happening on an NFS server, the rpc client in question
is the callback client used to do things like send delegation recalls
back to the NFS client.
I'm not sure why two different callback clients are being created here,
but there's nothing inherently weird about that.
> 3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?
Off the top of my head, I don't know, we'd probably need to look through
header files or inotify man pages for the definitions of those masks.
--b.
On Thu, Jun 25, 2020 at 05:43:53PM +0000, Kraus, Sebastian wrote:
> Dear Bruce,
> I got the following stack and back trace:
>
> root@all:~# coredumpctl debug
> PID: 6356 (rpc.gssd)
> UID: 0 (root)
> GID: 0 (root)
> Signal: 11 (SEGV)
> Timestamp: Thu 2020-06-25 11:46:08 CEST (3h 4min ago)
> Command Line: /usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10
> Executable: /usr/sbin/rpc.gssd
> Control Group: /system.slice/rpc-gssd.service
> Unit: rpc-gssd.service
> Slice: system.slice
> Boot ID: XXXXXXXXXXXXXXXXXXXXXXXXXXX
> Machine ID: YYYYYYYYYYYYYYYYYYYYYYYYYYYY
> Hostname: XYZ
> Storage: /var/lib/systemd/coredump/core.rpc\x2egssd.0.7f31136228274af0a1a855b91ad1e75c.6356.1593078368000000.lz4
> Message: Process 6356 (rpc.gssd) of user 0 dumped core.
>
> Stack trace of thread 14174:
> #0 0x000056233fff038e n/a (rpc.gssd)
> #1 0x000056233fff09f8 n/a (rpc.gssd)
> #2 0x000056233fff0b92 n/a (rpc.gssd)
> #3 0x000056233fff13b3 n/a (rpc.gssd)
> #4 0x00007fb2eb8dbfa3 start_thread (libpthread.so.0)
> #5 0x00007fb2eb80c4cf __clone (libc.so.6)
>
> Stack trace of thread 6356:
> #0 0x00007fb2eb801819 __GI___poll (libc.so.6)
> #1 0x00007fb2eb6e7207 send_dg (libresolv.so.2)
> #2 0x00007fb2eb6e4c43 __GI___res_context_query (libresolv.so.2)
> #3 0x00007fb2eb6bf536 __GI__nss_dns_gethostbyaddr2_r (libnss_dns.so.2)
> #4 0x00007fb2eb6bf823 _nss_dns_gethostbyaddr_r (libnss_dns.so.2)
> #5 0x00007fb2eb81dee2 __gethostbyaddr_r (libc.so.6)
> #6 0x00007fb2eb8267d5 gni_host_inet_name (libc.so.6)
> #7 0x000056233ffef455 n/a (rpc.gssd)
> #8 0x000056233ffef82c n/a (rpc.gssd)
> #9 0x000056233fff01d0 n/a (rpc.gssd)
> #10 0x00007fb2ebab49ba n/a (libevent-2.1.so.6)
> #11 0x00007fb2ebab5537 event_base_loop (libevent-2.1.so.6)
> #12 0x000056233ffedeaa n/a (rpc.gssd)
> #13 0x00007fb2eb73709b __libc_start_main (libc.so.6)
> #14 0x000056233ffee03a n/a (rpc.gssd)
>
> GNU gdb (Debian 8.2.1-2+b3) 8.2.1
> [...]
> Reading symbols from /usr/sbin/rpc.gssd...(no debugging symbols found)...done.
> [New LWP 14174]
> [New LWP 6356]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `/usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x000056233fff038e in ?? ()
> [Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
> (gdb) bt
> #0 0x000056233fff038e in ?? ()
> #1 0x000056233fff09f8 in ?? ()
> #2 0x000056233fff0b92 in ?? ()
> #3 0x000056233fff13b3 in ?? ()
> #4 0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
> #5 0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) quit
>
>
> I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
> As far as I see, thread 14174 caused the segmentation violation just after its birth on clone.
> Please correct me, if I am in error.
> Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable.
Have you reported a debian bug? They might know how to get a good trace
out of it.
--b.
> So far, I was not able to find such a package.
> What's your opinon about the trace?
>
>
> Best and Thanks
> Sebastian
>
> _____________________________
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> Email: [email protected]
>
> ________________________________________
> From: [email protected] <[email protected]> on behalf of J. Bruce Fields <[email protected]>
> Sent: Tuesday, June 23, 2020 00:36
> To: Kraus, Sebastian
> Cc: [email protected]
> Subject: Re: RPC Pipefs: Frequent parsing errors in client database
>
> On Sat, Jun 20, 2020 at 09:08:55PM +0000, Kraus, Sebastian wrote:
> > Hi Bruce,
> >
> > >> But I think it'd be more useful to stay focused on the segfaults.
> >
> > is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?
>
> If we could at least get a backtrace out of the core dump that could be
> useful.
>
> > I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
> >
> > There are several things I do not really understand about the trace shown below:
> >
> > 1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?
>
> It doesn't know beforehand, in the scenarios I described.
>
> > 2) Why are there two identifiers clnt36e and clnt36f being used for the same client?
>
> This is actually happening on an NFS server, the rpc client in question
> is the callback client used to do things like send delegation recalls
> back to the NFS client.
>
> I'm not sure why two different callback clients are being created here,
> but there's nothing inherently weird about that.
>
> > 3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?
>
> Off the top of my head, I don't know, we'd probably need to look through
> header files or inotify man pages for the definitions of those masks.
>
> --b.
On 2020-06-25 13:43, Kraus, Sebastian wrote:
> [Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
> (gdb) bt
> #0 0x000056233fff038e in ?? ()
> #1 0x000056233fff09f8 in ?? ()
> #2 0x000056233fff0b92 in ?? ()
> #3 0x000056233fff13b3 in ?? ()
> #4 0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
> #5 0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) quit
>
>
> I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
> As far as I see, thread 14174 caused the segmentation violation just after its birth on clone.
> Please correct me, if I am in error.
> Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable.
> So far, I was not able to find such a package.
> What's your opinon about the trace?
You'll need to install the debug symbols for your distribution/package.
A quick google links to https://wiki.debian.org/HowToGetABacktrace.
Those ?? lines should then be replaced with function, file & line numbers.
I've been following this with interest since it used to happen to me a
lot. It hasn't recently, even though every so often I spend a few hours
trying to re-create it to try debug it.
Doug
Hi Doug, Dear Bruce,
FYI, I filed a bug with nfs-common package in Debian Buster:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=963746
Hope this helps to track down the problem further and shed some light on the mysterious segfaults of rpc.gssd.
>> A quick google links to https://wiki.debian.org/HowToGetABacktrace.
BTW: Thanks very much for the link. This helped to find the right dbgsym packages in order to produce a more readable and hopefully valuable backtrace. ;-)
I also passed by a similar and still open issue in the CentOS bug tracker:
https://bugs.centos.org/view.php?id=15895
Best and Thanks for your support up to now
Sebastian
___________________
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
Email: [email protected]
________________________________________
From: Doug Nazar <[email protected]>
Sent: Thursday, June 25, 2020 23:44
To: Kraus, Sebastian; J. Bruce Fields
Cc: [email protected]
Subject: Re: Strange segmentation violations of rpc.gssd in Debian Buster
On 2020-06-25 13:43, Kraus, Sebastian wrote:
> [Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
> (gdb) bt
> #0 0x000056233fff038e in ?? ()
> #1 0x000056233fff09f8 in ?? ()
> #2 0x000056233fff0b92 in ?? ()
> #3 0x000056233fff13b3 in ?? ()
> #4 0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
> #5 0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) quit
>
>
> I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
> As far as I see, thread 14174 caused the segmentation violation just after its birth on clone.
> Please correct me, if I am in error.
> Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable.
> So far, I was not able to find such a package.
> What's your opinon about the trace?
You'll need to install the debug symbols for your distribution/package.
A quick google links to https://wiki.debian.org/HowToGetABacktrace.
Those ?? lines should then be replaced with function, file & line numbers.
I've been following this with interest since it used to happen to me a
lot. It hasn't recently, even though every so often I spend a few hours
trying to re-create it to try debug it.
Doug
On 2020-06-26 08:31, Kraus, Sebastian wrote:
> Hi Doug, Dear Bruce,
> FYI, I filed a bug with nfs-common package in Debian Buster:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=963746
>
> Hope this helps to track down the problem further and shed some light on the mysterious segfaults of rpc.gssd.
Ok, I think I see what's going on. The struct clnt_info is getting freed
out from under the upcall thread. In this case it immediately got reused
for another client which zeroed the struct and was in the process of
looking up the info for it's client, hence the protocol & server fields
were null in the upcall thread.
Explains why I haven't been able to recreate it. Thanks for the stack
trace Sebastian.
Bruce, I can't see any locking/reference counting around struct
clnt_info. It just gets destroyed when receiving the inotify. Or should
it be deep copied when starting an upcall? Am I missing something?
Doug
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50e) ev->mask (0x40000100)
Jun 25 11:46:08 server rpc.gssd[6356]: handle_gssd_upcall: 'mech=krb5 uid=0 [email protected] service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt50e)
Jun 25 11:46:08 server rpc.gssd[6356]: krb5_use_machine_creds: uid 0 tgtname [email protected]
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (krb5) ev->mask (0x00000200)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (gssd) ev->mask (0x00000200)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (info) ev->mask (0x00000200)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (<?>) ev->mask (0x00008000)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50f) ev->mask (0x40000100)
Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for '' is 'client.domain.tu-berlin.de'
Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
Jun 25 11:46:08 server rpc.gssd[6356]: Success getting keytab entry for 'nfs/[email protected]'
Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
Jun 25 11:46:08 server rpc.gssd[6356]: creating (null) client for server (null)
Jun 25 11:46:08 all kernel: rpc.gssd[14174]: segfault at 0 ip 000056233fff038e sp 00007fb2eaeb9880 error 4 in rpc.gssd[56233ffed000+9000]
Thread 1 (Thread 0x7fb2eaeba700 (LWP 14174)):
#0 0x000056233fff038e in create_auth_rpc_client (clp=clp@entry=0x562341008fa0, tgtname=tgtname@entry=0x562341011c8f "[email protected]", clnt_return=clnt_return@entry=0x7fb2eaeb9de8, auth_return=auth_return@entry=0x7fb2eaeb9d50, uid=uid@entry=0, cred=cred@entry=0x0, authtype=0) at gssd_proc.c:352
Thread 2 (Thread 0x7fb2eb6d9740 (LWP 6356)):
#12 0x000056233ffef82c in gssd_read_service_info (clp=0x562341008fa0, dirfd=11) at gssd.c:326
On Fri, Jun 26, 2020 at 01:23:54PM -0400, Doug Nazar wrote:
> Ok, I think I see what's going on. The struct clnt_info is getting
> freed out from under the upcall thread. In this case it immediately
> got reused for another client which zeroed the struct and was in the
> process of looking up the info for it's client, hence the protocol &
> server fields were null in the upcall thread.
>
> Explains why I haven't been able to recreate it. Thanks for the
> stack trace Sebastian.
>
> Bruce, I can't see any locking/reference counting around struct
> clnt_info. It just gets destroyed when receiving the inotify. Or
> should it be deep copied when starting an upcall? Am I missing
> something?
Thanks for finding that!
Staring at that code in an attempt to catch up here....
Looks like there's one main thread that watches for upcalls and other
events, then creates a new short-lived thread for each upcall.
The main thread is the only one that really manipulates the data
structure with all the clients. So that data structure shouldn't need
any locking. Except, as you point out, to keep the clnt_info from
disappearing out from under them.
So, yeah, either a reference count or a deep copy is probably all that's
needed, in alloc_upcall_info() and at the end of handle_krb5_upcall().
--b.
> Doug
>
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50e) ev->mask (0x40000100)
> Jun 25 11:46:08 server rpc.gssd[6356]: handle_gssd_upcall: 'mech=krb5 uid=0 [email protected] service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt50e)
> Jun 25 11:46:08 server rpc.gssd[6356]: krb5_use_machine_creds: uid 0 tgtname [email protected]
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (krb5) ev->mask (0x00000200)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (gssd) ev->mask (0x00000200)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (info) ev->mask (0x00000200)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (<?>) ev->mask (0x00008000)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50f) ev->mask (0x40000100)
> Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for '' is 'client.domain.tu-berlin.de'
> Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
> Jun 25 11:46:08 server rpc.gssd[6356]: Success getting keytab entry for 'nfs/[email protected]'
> Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
> Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
> Jun 25 11:46:08 server rpc.gssd[6356]: creating (null) client for server (null)
> Jun 25 11:46:08 all kernel: rpc.gssd[14174]: segfault at 0 ip 000056233fff038e sp 00007fb2eaeb9880 error 4 in rpc.gssd[56233ffed000+9000]
>
>
> Thread 1 (Thread 0x7fb2eaeba700 (LWP 14174)):
> #0 0x000056233fff038e in create_auth_rpc_client (clp=clp@entry=0x562341008fa0, tgtname=tgtname@entry=0x562341011c8f "[email protected]", clnt_return=clnt_return@entry=0x7fb2eaeb9de8, auth_return=auth_return@entry=0x7fb2eaeb9d50, uid=uid@entry=0, cred=cred@entry=0x0, authtype=0) at gssd_proc.c:352
>
> Thread 2 (Thread 0x7fb2eb6d9740 (LWP 6356)):
> #12 0x000056233ffef82c in gssd_read_service_info (clp=0x562341008fa0, dirfd=11) at gssd.c:326
>
On 2020-06-26 15:46, J. Bruce Fields wrote:
> So, yeah, either a reference count or a deep copy is probably all that's
> needed, in alloc_upcall_info() and at the end of handle_krb5_upcall().
Slightly more complex, to handle the error cases & event tear-down but
this seems to work. I'm not sure how much of a hot spot this is so I
just went with a global mutex. Strangely there was an existing unused
mutex & thread_started flag.
Survives basic testing but I don't have a reproducer. Maybe blocking
access to the kdc. If I get time I'll try to setup a test environment.
Doug
On Fri, Jun 26, 2020 at 04:15:46PM -0400, Doug Nazar wrote:
> On 2020-06-26 15:46, J. Bruce Fields wrote:
> >So, yeah, either a reference count or a deep copy is probably all that's
> >needed, in alloc_upcall_info() and at the end of handle_krb5_upcall().
>
> Slightly more complex, to handle the error cases & event tear-down
> but this seems to work. I'm not sure how much of a hot spot this is
> so I just went with a global mutex. Strangely there was an existing
> unused mutex & thread_started flag.
>
> Survives basic testing but I don't have a reproducer. Maybe blocking
> access to the kdc. If I get time I'll try to setup a test
> environment.
Thanks, looks good. The only thing I wonder about:
> @@ -359,9 +357,37 @@ out:
> free(port);
> }
>
> +/* Actually frees clp and fields that might be used from other
> + * threads if was last reference.
> + */
> +static void
> +gssd_free_client(struct clnt_info *clp)
> +{
> + int refcnt;
> +
> + pthread_mutex_lock(&clp_lock);
> + refcnt = --clp->refcount;
> + pthread_mutex_unlock(&clp_lock);
> + if (refcnt > 0)
> + return;
> +
> + printerr(3, "freeing client %s\n", clp->relpath);
> +
> + free(clp->relpath);
> + free(clp->servicename);
> + free(clp->servername);
> + free(clp->protocol);
> + free(clp);
> +}
> +
> +/* Called when removing from clnt_list to tear down event handling.
> + * Will then free clp if was last reference.
> + */
> static void
> gssd_destroy_client(struct clnt_info *clp)
> {
> + printerr(3, "destroying client %s\n", clp->relpath);
> +
> if (clp->krb5_fd >= 0) {
> close(clp->krb5_fd);
Unless I'm missing something--an upcall thread could still be using this
file descriptor.
If we're particularly unlucky, we could do a new open in a moment and
reuse this file descriptor number, and then then writes in do_downcall()
could end up going to some other random file.
I think we want these closes done by gssd_free_client() in the !refcnt
case?
--b.
> event_del(&clp->krb5_ev);
> @@ -373,11 +399,7 @@ gssd_destroy_client(struct clnt_info *clp)
> }
>
> inotify_rm_watch(inotify_fd, clp->wd);
> - free(clp->relpath);
> - free(clp->servicename);
> - free(clp->servername);
> - free(clp->protocol);
> - free(clp);
> + gssd_free_client(clp);
> }
>
> static void gssd_scan(void);
> @@ -416,11 +438,21 @@ static struct clnt_upcall_info *alloc_upcall_info(struct clnt_info *clp)
> info = malloc(sizeof(struct clnt_upcall_info));
> if (info == NULL)
> return NULL;
> +
> + pthread_mutex_lock(&clp_lock);
> + clp->refcount++;
> + pthread_mutex_unlock(&clp_lock);
> info->clp = clp;
>
> return info;
> }
>
> +void free_upcall_info(struct clnt_upcall_info *info)
> +{
> + gssd_free_client(info->clp);
> + free(info);
> +}
> +
> /* For each upcall read the upcall info into the buffer, then create a
> * thread in a detached state so that resources are released back into
> * the system without the need for a join.
> @@ -438,13 +470,13 @@ gssd_clnt_gssd_cb(int UNUSED(fd), short UNUSED(which), void *data)
> info->lbuflen = read(clp->gssd_fd, info->lbuf, sizeof(info->lbuf));
> if (info->lbuflen <= 0 || info->lbuf[info->lbuflen-1] != '\n') {
> printerr(0, "WARNING: %s: failed reading request\n", __func__);
> - free(info);
> + free_upcall_info(info);
> return;
> }
> info->lbuf[info->lbuflen-1] = 0;
>
> if (start_upcall_thread(handle_gssd_upcall, info))
> - free(info);
> + free_upcall_info(info);
> }
>
> static void
> @@ -461,12 +493,12 @@ gssd_clnt_krb5_cb(int UNUSED(fd), short UNUSED(which), void *data)
> sizeof(info->uid)) < (ssize_t)sizeof(info->uid)) {
> printerr(0, "WARNING: %s: failed reading uid from krb5 "
> "upcall pipe: %s\n", __func__, strerror(errno));
> - free(info);
> + free_upcall_info(info);
> return;
> }
>
> if (start_upcall_thread(handle_krb5_upcall, info))
> - free(info);
> + free_upcall_info(info);
> }
>
> static struct clnt_info *
> @@ -501,6 +533,7 @@ gssd_get_clnt(struct topdir *tdi, const char *name)
> clp->name = clp->relpath + strlen(tdi->name) + 1;
> clp->krb5_fd = -1;
> clp->gssd_fd = -1;
> + clp->refcount = 1;
>
> TAILQ_INSERT_HEAD(&tdi->clnt_list, clp, list);
> return clp;
> @@ -651,7 +684,7 @@ gssd_scan_topdir(const char *name)
> if (clp->scanned)
> continue;
>
> - printerr(3, "destroying client %s\n", clp->relpath);
> + printerr(3, "orphaned client %s\n", clp->relpath);
> saveprev = clp->list.tqe_prev;
> TAILQ_REMOVE(&tdi->clnt_list, clp, list);
> gssd_destroy_client(clp);
> diff --git a/utils/gssd/gssd.h b/utils/gssd/gssd.h
> index f4f59754..d33e4dff 100644
> --- a/utils/gssd/gssd.h
> +++ b/utils/gssd/gssd.h
> @@ -63,12 +63,10 @@ extern unsigned int context_timeout;
> extern unsigned int rpc_timeout;
> extern char *preferred_realm;
> extern pthread_mutex_t ple_lock;
> -extern pthread_cond_t pcond;
> -extern pthread_mutex_t pmutex;
> -extern int thread_started;
>
> struct clnt_info {
> TAILQ_ENTRY(clnt_info) list;
> + int refcount;
> int wd;
> bool scanned;
> char *name;
> @@ -94,6 +92,7 @@ struct clnt_upcall_info {
>
> void handle_krb5_upcall(struct clnt_upcall_info *clp);
> void handle_gssd_upcall(struct clnt_upcall_info *clp);
> +void free_upcall_info(struct clnt_upcall_info *info);
>
>
> #endif /* _RPC_GSSD_H_ */
> diff --git a/utils/gssd/gssd_proc.c b/utils/gssd/gssd_proc.c
> index 8fe6605b..05c1da64 100644
> --- a/utils/gssd/gssd_proc.c
> +++ b/utils/gssd/gssd_proc.c
> @@ -730,7 +730,7 @@ handle_krb5_upcall(struct clnt_upcall_info *info)
> printerr(2, "\n%s: uid %d (%s)\n", __func__, info->uid, clp->relpath);
>
> process_krb5_upcall(clp, info->uid, clp->krb5_fd, NULL, NULL, NULL);
> - free(info);
> + free_upcall_info(info);
> }
>
> void
> @@ -830,6 +830,6 @@ handle_gssd_upcall(struct clnt_upcall_info *info)
> out:
> free(upcall_str);
> out_nomem:
> - free(info);
> + free_upcall_info(info);
> return;
> }
> --
> 2.26.2
>
On 2020-06-26 17:02, J. Bruce Fields wrote:
> Unless I'm missing something--an upcall thread could still be using this
> file descriptor.
>
> If we're particularly unlucky, we could do a new open in a moment and
> reuse this file descriptor number, and then then writes in do_downcall()
> could end up going to some other random file.
>
> I think we want these closes done by gssd_free_client() in the !refcnt
> case?
Makes sense. I was thinking more that it was an abort situation and we
shouldn't be sending any data to the kernel but re-use is definitely a
concern.
I've split it so that we are removed from the event loop in destroy()
but the close happens in free().
Doug
On Fri, Jun 26, 2020 at 05:30:15PM -0400, Doug Nazar wrote:
> On 2020-06-26 17:02, J. Bruce Fields wrote:
> >Unless I'm missing something--an upcall thread could still be using this
> >file descriptor.
> >
> >If we're particularly unlucky, we could do a new open in a moment and
> >reuse this file descriptor number, and then then writes in do_downcall()
> >could end up going to some other random file.
> >
> >I think we want these closes done by gssd_free_client() in the !refcnt
> >case?
>
> Makes sense. I was thinking more that it was an abort situation and
> we shouldn't be sending any data to the kernel but re-use is
> definitely a concern.
>
> I've split it so that we are removed from the event loop in
> destroy() but the close happens in free().
Looks good to me. Steve?
--b.
>
> Doug
>
> From 8ef49081e8a42bfa05bb63265cd4f951e2b23413 Mon Sep 17 00:00:00 2001
> From: Doug Nazar <[email protected]>
> Date: Fri, 26 Jun 2020 16:02:04 -0400
> Subject: [PATCH] gssd: Refcount struct clnt_info to protect multithread usage
>
> Struct clnt_info is shared with the various upcall threads so
> we need to ensure that it stays around even if the client dir
> gets removed.
>
> Reported-by: Sebastian Kraus <[email protected]>
> Signed-off-by: Doug Nazar <[email protected]>
> ---
> utils/gssd/gssd.c | 67 ++++++++++++++++++++++++++++++++----------
> utils/gssd/gssd.h | 5 ++--
> utils/gssd/gssd_proc.c | 4 +--
> 3 files changed, 55 insertions(+), 21 deletions(-)
>
> diff --git a/utils/gssd/gssd.c b/utils/gssd/gssd.c
> index 588da0fb..b40c3220 100644
> --- a/utils/gssd/gssd.c
> +++ b/utils/gssd/gssd.c
> @@ -90,9 +90,7 @@ char *ccachedir = NULL;
> /* Avoid DNS reverse lookups on server names */
> static bool avoid_dns = true;
> static bool use_gssproxy = false;
> -int thread_started = false;
> -pthread_mutex_t pmutex = PTHREAD_MUTEX_INITIALIZER;
> -pthread_cond_t pcond = PTHREAD_COND_INITIALIZER;
> +pthread_mutex_t clp_lock = PTHREAD_MUTEX_INITIALIZER;
>
> TAILQ_HEAD(topdir_list_head, topdir) topdir_list;
>
> @@ -359,20 +357,28 @@ out:
> free(port);
> }
>
> +/* Actually frees clp and fields that might be used from other
> + * threads if was last reference.
> + */
> static void
> -gssd_destroy_client(struct clnt_info *clp)
> +gssd_free_client(struct clnt_info *clp)
> {
> - if (clp->krb5_fd >= 0) {
> + int refcnt;
> +
> + pthread_mutex_lock(&clp_lock);
> + refcnt = --clp->refcount;
> + pthread_mutex_unlock(&clp_lock);
> + if (refcnt > 0)
> + return;
> +
> + printerr(3, "freeing client %s\n", clp->relpath);
> +
> + if (clp->krb5_fd >= 0)
> close(clp->krb5_fd);
> - event_del(&clp->krb5_ev);
> - }
>
> - if (clp->gssd_fd >= 0) {
> + if (clp->gssd_fd >= 0)
> close(clp->gssd_fd);
> - event_del(&clp->gssd_ev);
> - }
>
> - inotify_rm_watch(inotify_fd, clp->wd);
> free(clp->relpath);
> free(clp->servicename);
> free(clp->servername);
> @@ -380,6 +386,24 @@ gssd_destroy_client(struct clnt_info *clp)
> free(clp);
> }
>
> +/* Called when removing from clnt_list to tear down event handling.
> + * Will then free clp if was last reference.
> + */
> +static void
> +gssd_destroy_client(struct clnt_info *clp)
> +{
> + printerr(3, "destroying client %s\n", clp->relpath);
> +
> + if (clp->krb5_fd >= 0)
> + event_del(&clp->krb5_ev);
> +
> + if (clp->gssd_fd >= 0)
> + event_del(&clp->gssd_ev);
> +
> + inotify_rm_watch(inotify_fd, clp->wd);
> + gssd_free_client(clp);
> +}
> +
> static void gssd_scan(void);
>
> static int
> @@ -416,11 +440,21 @@ static struct clnt_upcall_info *alloc_upcall_info(struct clnt_info *clp)
> info = malloc(sizeof(struct clnt_upcall_info));
> if (info == NULL)
> return NULL;
> +
> + pthread_mutex_lock(&clp_lock);
> + clp->refcount++;
> + pthread_mutex_unlock(&clp_lock);
> info->clp = clp;
>
> return info;
> }
>
> +void free_upcall_info(struct clnt_upcall_info *info)
> +{
> + gssd_free_client(info->clp);
> + free(info);
> +}
> +
> /* For each upcall read the upcall info into the buffer, then create a
> * thread in a detached state so that resources are released back into
> * the system without the need for a join.
> @@ -438,13 +472,13 @@ gssd_clnt_gssd_cb(int UNUSED(fd), short UNUSED(which), void *data)
> info->lbuflen = read(clp->gssd_fd, info->lbuf, sizeof(info->lbuf));
> if (info->lbuflen <= 0 || info->lbuf[info->lbuflen-1] != '\n') {
> printerr(0, "WARNING: %s: failed reading request\n", __func__);
> - free(info);
> + free_upcall_info(info);
> return;
> }
> info->lbuf[info->lbuflen-1] = 0;
>
> if (start_upcall_thread(handle_gssd_upcall, info))
> - free(info);
> + free_upcall_info(info);
> }
>
> static void
> @@ -461,12 +495,12 @@ gssd_clnt_krb5_cb(int UNUSED(fd), short UNUSED(which), void *data)
> sizeof(info->uid)) < (ssize_t)sizeof(info->uid)) {
> printerr(0, "WARNING: %s: failed reading uid from krb5 "
> "upcall pipe: %s\n", __func__, strerror(errno));
> - free(info);
> + free_upcall_info(info);
> return;
> }
>
> if (start_upcall_thread(handle_krb5_upcall, info))
> - free(info);
> + free_upcall_info(info);
> }
>
> static struct clnt_info *
> @@ -501,6 +535,7 @@ gssd_get_clnt(struct topdir *tdi, const char *name)
> clp->name = clp->relpath + strlen(tdi->name) + 1;
> clp->krb5_fd = -1;
> clp->gssd_fd = -1;
> + clp->refcount = 1;
>
> TAILQ_INSERT_HEAD(&tdi->clnt_list, clp, list);
> return clp;
> @@ -651,7 +686,7 @@ gssd_scan_topdir(const char *name)
> if (clp->scanned)
> continue;
>
> - printerr(3, "destroying client %s\n", clp->relpath);
> + printerr(3, "orphaned client %s\n", clp->relpath);
> saveprev = clp->list.tqe_prev;
> TAILQ_REMOVE(&tdi->clnt_list, clp, list);
> gssd_destroy_client(clp);
> diff --git a/utils/gssd/gssd.h b/utils/gssd/gssd.h
> index f4f59754..d33e4dff 100644
> --- a/utils/gssd/gssd.h
> +++ b/utils/gssd/gssd.h
> @@ -63,12 +63,10 @@ extern unsigned int context_timeout;
> extern unsigned int rpc_timeout;
> extern char *preferred_realm;
> extern pthread_mutex_t ple_lock;
> -extern pthread_cond_t pcond;
> -extern pthread_mutex_t pmutex;
> -extern int thread_started;
>
> struct clnt_info {
> TAILQ_ENTRY(clnt_info) list;
> + int refcount;
> int wd;
> bool scanned;
> char *name;
> @@ -94,6 +92,7 @@ struct clnt_upcall_info {
>
> void handle_krb5_upcall(struct clnt_upcall_info *clp);
> void handle_gssd_upcall(struct clnt_upcall_info *clp);
> +void free_upcall_info(struct clnt_upcall_info *info);
>
>
> #endif /* _RPC_GSSD_H_ */
> diff --git a/utils/gssd/gssd_proc.c b/utils/gssd/gssd_proc.c
> index 8fe6605b..05c1da64 100644
> --- a/utils/gssd/gssd_proc.c
> +++ b/utils/gssd/gssd_proc.c
> @@ -730,7 +730,7 @@ handle_krb5_upcall(struct clnt_upcall_info *info)
> printerr(2, "\n%s: uid %d (%s)\n", __func__, info->uid, clp->relpath);
>
> process_krb5_upcall(clp, info->uid, clp->krb5_fd, NULL, NULL, NULL);
> - free(info);
> + free_upcall_info(info);
> }
>
> void
> @@ -830,6 +830,6 @@ handle_gssd_upcall(struct clnt_upcall_info *info)
> out:
> free(upcall_str);
> out_nomem:
> - free(info);
> + free_upcall_info(info);
> return;
> }
> --
> 2.26.2
>
Hi Doug,
thanks very much for your patch and efforts.
I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?
Best Sebastian
__________________
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
Email: [email protected]
________________________________________
From: [email protected] <[email protected]> on behalf of Doug Nazar <[email protected]>
Sent: Friday, June 26, 2020 23:30
To: J. Bruce Fields
Cc: Kraus, Sebastian; [email protected]; Steve Dickson; Olga Kornievskaia
Subject: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
On 2020-06-26 17:02, J. Bruce Fields wrote:
> Unless I'm missing something--an upcall thread could still be using this
> file descriptor.
>
> If we're particularly unlucky, we could do a new open in a moment and
> reuse this file descriptor number, and then then writes in do_downcall()
> could end up going to some other random file.
>
> I think we want these closes done by gssd_free_client() in the !refcnt
> case?
Makes sense. I was thinking more that it was an abort situation and we
shouldn't be sending any data to the kernel but re-use is definitely a
concern.
I've split it so that we are removed from the event loop in destroy()
but the close happens in free().
Doug
On 2020-06-29 01:39, Kraus, Sebastian wrote:
> Hi Doug,
> thanks very much for your patch and efforts.
> I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
> I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
>
> One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?
Yes, I'm working against upstream. I did check briefly that the code
hadn't changed too much since 1.3.4 in that area.
I've found one other place that has insufficient locking but the race to
hit it is fairly small. It's in the Kerberos machine principal cache
when it refreshes the machine credentials. I have a patch for that, but
it's pretty invasive due to some other changes I'm currently working on.
Let me know if you hit it, and I can work on a simple version to backport.
Doug
Hi Doug,
>> Yes, I'm working against upstream. I did check briefly that the code hadn't changed too much since 1.3.4 in that area.
OK, thanks for the info. I wondered, because your patch did not show up as a commit within upstream.
Your patch seems to do a good job - no more segfaults since a period of four days. :-)
>> I've found one other place that has insufficient locking but the race to hit it is fairly small. It's in the Kerberos machine principal cache when it refreshes the machine credentials.
These type of patches are always welcome. :-)
In the recent past, some of our scientific staff exprienced strange problems with Kerberos authentication against our NFSv4 file servers.
Maybe, the outages were in connection with this type of race condition. But, I do not know for sure as the authentication errors did happen on a rather sporadic basis.
>> I have a patch for that, but it's pretty invasive due to some other changes I'm currently working on. Let me know if you hit it, and I can work on a simple version to backport.
NFSv4+Kerberos is not for the faint-hearted. I do not fear of invasive patches - as long as they are not missing technical correctness. ;-)
A question far apart from this:
How is it about the spread of NFSv4+Kerberos setups within academic community and commerical environments?
Are there, up to your knowledge, any bigger on-premise or cloud setups out there?
And are there any companies running dedicated NFSv4+Kerberos setups?
Best and keep well and fit
Sebastian
_________________
Sebastian Kraus
Team IT am Institut f?r Chemie
Geb?ude C, Stra?e des 17. Juni 115, Raum C7
Technische Universit?t Berlin
Fakult?t II
Institut f?r Chemie
Sekretariat C3
Stra?e des 17. Juni 135
10623 Berlin
Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: [email protected]
________________________________________
From: Doug Nazar <[email protected]>
Sent: Monday, June 29, 2020 16:09
To: Kraus, Sebastian
Cc: [email protected]
Subject: Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
On 2020-06-29 01:39, Kraus, Sebastian wrote:
> Hi Doug,
> thanks very much for your patch and efforts.
> I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
> I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
>
> One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?
Yes, I'm working against upstream. I did check briefly that the code
hadn't changed too much since 1.3.4 in that area.
I've found one other place that has insufficient locking but the race to
hit it is fairly small. It's in the Kerberos machine principal cache
when it refreshes the machine credentials. I have a patch for that, but
it's pretty invasive due to some other changes I'm currently working on.
Let me know if you hit it, and I can work on a simple version to backport.
Doug
>>> I've found one other place that has insufficient locking but the race to hit it is fairly small. It's in the Kerberos machine principal cache when it refreshes the machine credentials.
> These type of patches are always welcome. :-)
> In the recent past, some of our scientific staff exprienced strange problems with Kerberos authentication against our NFSv4 file servers.
> Maybe, the outages were in connection with this type of race condition. But, I do not know for sure as the authentication errors did happen on a rather sporadic basis.
We (Linköping University in Sweden) have seen these problems before too. I sent a patch for rpc.gssd this spring that “fixed” this problem too (well, fixed the symptom and not the root cause so it wasn’t the right fix). Without that patch we typically had rpc.gssd crash on our multiuser client servers every other day. It was partly masked by Puppet detecting it down and restarting it but the users had strange errors that they reported and then when the support folks checked everything was running :-). It also crashed very often on a set of test machines that every minute would connect to our NFS servers in order to verify that they were running and giving good response times. Multiple NFS connections being set up and teared with concurrently many times easily forced this problem to happen after a day or two.
> A question far apart from this:
> How is it about the spread of NFSv4+Kerberos setups within academic community and commerical environments?
We are using NFSv4+Kerberos. Most of our users are SMBv3 clients (Windows & Mac, 10x the Linux users) though but we have some 600 NFS clients (99.9% Linux (CentOS & Ubuntu mostly) based, servers are FreeBSD with ZFS). We used to be a big Sun/Solaris NFS shop previously so NFS comes “naturally” for us :-)
(Would have loved to use NFSv4+Kerberos on the MacOS clients but unfortunately MacOS panics when the Kerberos ticket expires and you have an active NFS share mounted which is a bit of a bummer :-)
(Using NFS v3 or lower and without Kerberos isn’t really an option - real ACLs and some sort of security is really needed)
Anyway - it’s good to see that the root cause for this bug has been found and fixed the right way :-)
- Peter
> Are there, up to your knowledge, any bigger on-premise or cloud setups out there?
> And are there any companies running dedicated NFSv4+Kerberos setups?
>
>
> Best and keep well and fit
> Sebastian
>
> _________________
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
>
> Tel.: +49 30 314 22263
> Fax: +49 30 314 29309
> Email: [email protected]
>
> ________________________________________
> From: Doug Nazar <[email protected]>
> Sent: Monday, June 29, 2020 16:09
> To: Kraus, Sebastian
> Cc: [email protected]
> Subject: Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
>
> On 2020-06-29 01:39, Kraus, Sebastian wrote:
>> Hi Doug,
>> thanks very much for your patch and efforts.
>> I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
>> I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
>>
>> One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?
>
> Yes, I'm working against upstream. I did check briefly that the code
> hadn't changed too much since 1.3.4 in that area.
>
> I've found one other place that has insufficient locking but the race to
> hit it is fairly small. It's in the Kerberos machine principal cache
> when it refreshes the machine credentials. I have a patch for that, but
> it's pretty invasive due to some other changes I'm currently working on.
> Let me know if you hit it, and I can work on a simple version to backport.
>
> Doug
>
On 2020-07-01 03:39, Kraus, Sebastian wrote:
> OK, thanks for the info. I wondered, because your patch did not show
> up as a commit within upstream.
> Your patch seems to do a good job - no more segfaults since a period of four days. :-)
I'm not a maintainer, just an enthusiastic user with a compiler... ;-)
I'm sure it'll get applied in the near future, as time permits.
>> I've found one other place that has insufficient locking but the race to hit it is fairly small. It's in the Kerberos machine principal cache when it refreshes the machine credentials.
> These type of patches are always welcome. :-)
> In the recent past, some of our scientific staff exprienced strange problems with Kerberos authentication against our NFSv4 file servers.
> Maybe, the outages were in connection with this type of race condition. But, I do not know for sure as the authentication errors did happen on a rather sporadic basis.
The previous bug could also cause authentication issues without crashing
depending on load, timing, memory usage, malloc library, etc. This one
would only crop up during machine credentials refresh, which by default
is once every 24 hours. I've just posted a patch 'gssd: Fix locking for
machine principal list', the interesting part for backporting is around
line 447. It used to always strdup() even if cache name was the same.
>> I have a patch for that, but it's pretty invasive due to some other changes I'm currently working on. Let me know if you hit it, and I can work on a simple version to backport.
> NFSv4+Kerberos is not for the faint-hearted. I do not fear of invasive patches - as long as they are not missing technical correctness. ;-)
No guarantees... but I do try. ;-)
> A question far apart from this:
> How is it about the spread of NFSv4+Kerberos setups within academic community and commerical environments?
> Are there, up to your knowledge, any bigger on-premise or cloud setups out there?
> And are there any companies running dedicated NFSv4+Kerberos setups?
I really have no idea. I only run it on my home network of a few dozen
(old) machines. From what I've seen while googling
trying to figure out how the code base works, there are fair number of
users. There's also been a large amount of work in
recent years, which would point to something driving that.
Doug