2010-08-03 15:52:46

by Michael Guntsche

[permalink] [raw]
Subject: Kerberos auth Problem with nfs3/4

Hi,

I recently tried re-enabling a kerberos setup here after running with
sec=sys for a while. Now the problem is that mount the export with
sec=krb5 just hangs.

To rule everything out I tried mount from the server itself.

mount gibson:/export /mnt

The mount just hangs and does not return.
This is happening on a debian sid system with nfs-utils 1.2.2 installed.

rpc.svcgssd -vvf:
=================
entering poll
leaving poll
handling null request
sname = nfs/[email protected]
DEBUG: serialize_krb5_ctx: lucid version!
prepare_krb5_rfc1964_buffer: serializing keys with enctype 4 and length
8
doing downcall
mech: krb5, hndl len: 4, ctx len 85, timeout: 1280885973 (35783 from
now), clnt: [email protected], uid: -1, gid: -1, num aux grps: 0:
sending null reply
finished handling null request
entering poll

rpc.gssd -vvf:
==============
beginning poll
destroying client /var/lib/nfs/rpc_pipefs/nfs/clnt1b
destroying client /var/lib/nfs/rpc_pipefs/nfs/clnt1a
handling gssd upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt1c)
handle_gssd_upcall: 'mech=krb5 uid=0 enctypes=18,17,16,23,3,1,2 '
handling krb5 upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt1c)
process_krb5_upcall: service is '<null>'
Successfully obtained machine credentials for principal
'nfs/[email protected]' stored in ccache
'FILE:/tmp/krb5cc_machine_COMSICK.AT'
INFO: Credentials in CC 'FILE:/tmp/krb5cc_machine_COMSICK.AT' are good
until 1280886246
using FILE:/tmp/krb5cc_machine_COMSICK.AT as credentials cache for
machine creds
using environment variable to select krb5 ccache
FILE:/tmp/krb5cc_machine_COMSICK.AT
creating context using fsuid 0 (save_uid 0)
creating tcp client for server gibson.comsick.at
DEBUG: port already set to 2049
creating context with server [email protected]
DEBUG: serialize_krb5_ctx: lucid version!
prepare_krb5_rfc1964_buffer: serializing keys with enctype 4 and length
8
doing downcall

After that nothing. the same setup worked a while ago but of course both
the kernel and the nfs-utils have been updated in the meantime. I tried
this both with nfs3 and nfs4.

Please tell me if you need further information to help me debug this
problem.

Kind regards,
Michael


2010-08-03 19:45:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On Tue, Aug 03, 2010 at 05:45:56PM +0200, Michael Guntsche wrote:
> Hi,
>
> I recently tried re-enabling a kerberos setup here after running with
> sec=sys for a while. Now the problem is that mount the export with
> sec=krb5 just hangs.
>
> To rule everything out I tried mount from the server itself.
>
> mount gibson:/export /mnt
>
> The mount just hangs and does not return.
> This is happening on a debian sid system with nfs-utils 1.2.2 installed.

You might try the following (in upstream nfs-utils)?

--b.

commit 6ca440c2661dccb05ae74ffb65817e9c30f05c8a
Author: Steve Dickson <[email protected]>
Date: Mon Mar 8 11:22:46 2010 -0500

mountd: fix --manage-gids hang due to int/uint bug

A uid or gid should be represented as unsigned, not signed.

The conversion to signed here could cause a hang on access by an unknown
user to a server running mountd with --manage-gids; such a user is
likely to be mapped to 232-1, which may be converted to 231-1 when
represented as an int, resulting in a downcall for uid 231-1, hence the
original rpc hanging forever waiting for a cache downcall for 232-1.

Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Steve Dickson <[email protected]>

diff --git a/support/nfs/cacheio.c b/support/nfs/cacheio.c
index bdf5d84..0587ecb 100644
--- a/support/nfs/cacheio.c
+++ b/support/nfs/cacheio.c
@@ -148,6 +148,11 @@ void qword_printint(FILE *f, int num)
fprintf(f, "%d ", num);
}

+void qword_printuint(FILE *f, unsigned int num)
+{
+ fprintf(f, "%u ", num);
+}
+
int qword_eol(FILE *f)
{
int err;
@@ -236,6 +241,20 @@ int qword_get_int(char **bpp, int *anint)
return 0;
}

+int qword_get_uint(char *bpp, unsigned int *anint)
+{
+ char buf[50];
+ char *ep;
+ unsigned int rv;
+ int len = qword_get(bpp, buf, 50);
+ if (len < 0) return -1;
+ if (len ==0) return -1;
+ rv = strtoul(buf, &ep, 0);
+ if (*ep) return -1;
+ *anint = rv;
+ return 0;
+}
+
#define READLINE_BUFFER_INCREMENT 2048

int readline(int fd, char **buf, int *lenp)
diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index d63e10a..b6c148f 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -125,7 +125,7 @@ void auth_unix_gid(FILE *f)
* reply is
* uid expiry count list of group ids
*/
- int uid;
+ uid_t uid;
struct passwd *pw;
gid_t glist[100], *groups = glist;
int ngroups = 100;
@@ -136,7 +136,7 @@ void auth_unix_gid(FILE *f)
return;

cp = lbuf;
- if (qword_get_int(&cp, &uid) != 0)
+ if (qword_get_uint(&cp, &uid) != 0)
return;

pw = getpwuid(uid);
@@ -153,14 +153,14 @@ void auth_unix_gid(FILE *f)
groups, &ngroups);
}
}
- qword_printint(f, uid);
- qword_printint(f, time(0)+30*60);
+ qword_printuint(f, uid);
+ qword_printuint(f, time(0)+30*60);
if (rv >= 0) {
- qword_printint(f, ngroups);
+ qword_printuint(f, ngroups);
for (i=0; i<ngroups; i++)
- qword_printint(f, groups[i]);
+ qword_printuint(f, groups[i]);
} else
- qword_printint(f, 0);
+ qword_printuint(f, 0);
qword_eol(f);

if (groups != glist)

2010-08-03 23:17:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On Tue, Aug 03, 2010 at 11:55:24PM +0200, Michael Guntsche wrote:
> On 03 Aug 10 17:36, J. Bruce Fields wrote:
> > > Aug 3 23:12:23 gibson kernel: RPC: AUTH_GSS upcall timed out.
> > > Aug 3 23:12:23 gibson kernel: Please check user daemon is running.
> > >
> > > Of course all the daemons on the server are running and the system seems
> > > to work fine otherwise.
> >
> > That's actually a client-side complaint--if you're seeing it on the
> > server then it's probably the server trying to do a callback to an NFSv4
> > client. Are you running rpc.gssd as well as rpc.svcgssd on the server?
> > Might want to if you want delegations to work (but it's not a critical
> > problem).
> Yes, rpc.gssd as well as rpc.svcgssd is running on the server. To make
> matters worse I noticed something else with sec=krb5. This messages
> appears on first access to a file either read or write. Not always, it
> seems that it reappears after some timeout but it is sparmming my logs
> nevertheless. But what's worse is that I now got a Stale NFS file handle
> on the lost+found directory of the export. A lot of question marks and
> then just the name. I am now running with sec=sys and cannot up to now
> was not able to reproduce this problem. Is it possible that does two
> problems are related or are they completely separate from each other?

I doubt they're related.

Is there something special about the lost+found directory that would
lead to stale filehandles? I can't think why there would be.

--b.

> FYI the patched nfs-utils version is only running on the server for now
> but I do not think that this is the problem.

2010-08-03 22:20:09

by Michael Guntsche

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On 2010.08.03 17:36:50 , J. Bruce Fields wrote:
> That's actually a client-side complaint--if you're seeing it on the
> server then it's probably the server trying to do a callback to an NFSv4
> client. Are you running rpc.gssd as well as rpc.svcgssd on the server?
> Might want to if you want delegations to work (but it's not a critical
> problem).

I started rpc.gssd in verbose mode on the server and actually saw this.

rpc.gssd -vvf:
==============

beginning poll
destroying client /var/lib/nfs/rpc_pipefs/nfsd4_cb/clnt46
handling gssd upcall (/var/lib/nfs/rpc_pipefs/nfsd4_cb/clnt47)
handle_gssd_upcall: 'mech=krb5 uid=0 [email protected] service=* enctypes=18,17,16,23,3,1,2 '
handling krb5 upcall (/var/lib/nfs/rpc_pipefs/nfsd4_cb/clnt47)
process_krb5_upcall: service is '*'
Successfully obtained machine credentials for principal 'nfs/[email protected]' stored in ccache 'FILE:/tmp/krb5cc_machine_COMSICK.AT'
INFO: Credentials in CC 'FILE:/tmp/krb5cc_machine_COMSICK.AT' are good until 1280909701
using FILE:/tmp/krb5cc_machine_COMSICK.AT as credentials cache for machine creds
using environment variable to select krb5 ccache FILE:/tmp/krb5cc_machine_COMSICK.AT
creating context using fsuid 0 (save_uid 0)
creating tcp client for server zaphod.comsick.at
DEBUG: port already set to 32844
creating context with server [email protected]
WARNING: Failed to create krb5 context for user with uid 0 for server zaphod.comsick.at
WARNING: Failed to create machine krb5 context with credentials cache FILE:/tmp/krb5cc_machine_COMSICK.AT for server zaphod.comsick.at
WARNING: Machine cache is prematurely expired or corrupted trying to recreate cache for server zaphod.comsick.at
INFO: Credentials in CC 'FILE:/tmp/krb5cc_machine_COMSICK.AT' are good until 1280909701
INFO: Credentials in CC 'FILE:/tmp/krb5cc_machine_COMSICK.AT' are good until 1280909701
using FILE:/tmp/krb5cc_machine_COMSICK.AT as credentials cache for machine creds
using environment variable to select krb5 ccache FILE:/tmp/krb5cc_machine_COMSICK.AT
creating context using fsuid 0 (save_uid 0)
creating tcp client for server zaphod.comsick.at
DEBUG: port already set to 32844
creating context with server [email protected]
WARNING: Failed to create krb5 context for user with uid 0 for server zaphod.comsick.at
WARNING: Failed to create machine krb5 context with credentials cache FILE:/tmp/krb5cc_machine_COMSICK.AT for server zaphod.comsick.at
WARNING: Failed to create machine krb5 context with any credentials cache for server zaphod.comsick.at
doing error downcall
destroying client /var/lib/nfs/rpc_pipefs/nfsd4_cb/clnt47

gibson being the server and zaphod being the client here. As you said the server tries to connect back to the client which fails since rpc.svcgssd is not running on the client. Should the server try to connect back to the client this way in the first place and if yes shouldn't he stop trying after seeing that it is not working?

Kind regards,
Michael

2010-08-03 23:15:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On Wed, Aug 04, 2010 at 12:20:02AM +0200, Michael Guntsche wrote:
> gibson being the server and zaphod being the client here. As you said
> the server tries to connect back to the client which fails since
> rpc.svcgssd is not running on the client. Should the server try to
> connect back to the client this way in the first place and if yes
> shouldn't he stop trying after seeing that it is not working?

What I'd expect would be for it to make one try with each new client and
then give up. (But actually if the client doesn't renew state while
files aren't open--it may also end up retrying if the client's idle for
a minute or so.) If it's retrying more often than that, that's a
problem.

In any case we should probably reconsider how that message is generated,
to prevent it going to the log by default in this case.

--b.

2010-08-03 21:38:12

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On Tue, Aug 03, 2010 at 11:19:06PM +0200, Michael Guntsche wrote:
> Hello Bruce.
>
> I was a little bit early with my success report. I can now successfully
> mount the exports with sec=krb5 and can also read and write from linux
> clients. But now I get the following messages from the kernel on the
> server itself. This apparently happens when I try to read/write to the
> export.
>
> Aug 3 23:12:23 gibson kernel: RPC: AUTH_GSS upcall timed out.
> Aug 3 23:12:23 gibson kernel: Please check user daemon is running.
>
> Of course all the daemons on the server are running and the system seems
> to work fine otherwise.

That's actually a client-side complaint--if you're seeing it on the
server then it's probably the server trying to do a callback to an NFSv4
client. Are you running rpc.gssd as well as rpc.svcgssd on the server?
Might want to if you want delegations to work (but it's not a critical
problem).

--b.

2010-08-03 20:13:29

by Michael Guntsche

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On 03 Aug 10 15:43, J. Bruce Fields wrote:
> On Tue, Aug 03, 2010 at 05:45:56PM +0200, Michael Guntsche wrote:
> > Hi,
> >
> > I recently tried re-enabling a kerberos setup here after running with
> > sec=sys for a while. Now the problem is that mount the export with
> > sec=krb5 just hangs.
> >
> > To rule everything out I tried mount from the server itself.
> >
> > mount gibson:/export /mnt
> >
> > The mount just hangs and does not return.
> > This is happening on a debian sid system with nfs-utils 1.2.2 installed.
>
> You might try the following (in upstream nfs-utils)?
>
> --b.
>
> commit 6ca440c2661dccb05ae74ffb65817e9c30f05c8a
> Author: Steve Dickson <[email protected]>
> Date: Mon Mar 8 11:22:46 2010 -0500

Hello Bruce,

Yes, this fixed the problem. I recompiled with the patch applied and
installed the new binaries on the server. Now mounting both nfs3 and
nfs4 with sec=krb5 works again. Will there a 1.2.3 release in the near
future, otherwise I will file a bug with debian to ask them to include
the patch in their package.

Thank you very much for the quick fix.

Kind regards,
Michael

2010-08-03 21:19:16

by Michael Guntsche

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

Hello Bruce.

I was a little bit early with my success report. I can now successfully
mount the exports with sec=krb5 and can also read and write from linux
clients. But now I get the following messages from the kernel on the
server itself. This apparently happens when I try to read/write to the
export.

Aug 3 23:12:23 gibson kernel: RPC: AUTH_GSS upcall timed out.
Aug 3 23:12:23 gibson kernel: Please check user daemon is running.

Of course all the daemons on the server are running and the system seems
to work fine otherwise.

Kind regards,
Michael

2010-08-04 05:29:36

by Michael Guntsche

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On 03 Aug 10 19:16, J. Bruce Fields wrote:
> > Yes, rpc.gssd as well as rpc.svcgssd is running on the server. To make
> > matters worse I noticed something else with sec=krb5. This messages
> > appears on first access to a file either read or write. Not always, it
> > seems that it reappears after some timeout but it is sparmming my logs
> > nevertheless. But what's worse is that I now got a Stale NFS file handle
> > on the lost+found directory of the export. A lot of question marks and
> > then just the name. I am now running with sec=sys and cannot up to now
> > was not able to reproduce this problem. Is it possible that does two
> > problems are related or are they completely separate from each other?
>
> I doubt they're related.
>
> Is there something special about the lost+found directory that would
> lead to stale filehandles? I can't think why there would be.

Nothing really special as far as I can see. Just out of curiosity it
tried nfs-utils-1.2.3-rc4 from Steve Dickson's tree and was not even
able to monunt the export (stale NFS file handles during mount).

Kind regards,
Michael

2010-08-03 21:55:32

by Michael Guntsche

[permalink] [raw]
Subject: Re: Kerberos auth Problem with nfs3/4

On 03 Aug 10 17:36, J. Bruce Fields wrote:
> > Aug 3 23:12:23 gibson kernel: RPC: AUTH_GSS upcall timed out.
> > Aug 3 23:12:23 gibson kernel: Please check user daemon is running.
> >
> > Of course all the daemons on the server are running and the system seems
> > to work fine otherwise.
>
> That's actually a client-side complaint--if you're seeing it on the
> server then it's probably the server trying to do a callback to an NFSv4
> client. Are you running rpc.gssd as well as rpc.svcgssd on the server?
> Might want to if you want delegations to work (but it's not a critical
> problem).
Yes, rpc.gssd as well as rpc.svcgssd is running on the server. To make
matters worse I noticed something else with sec=krb5. This messages
appears on first access to a file either read or write. Not always, it
seems that it reappears after some timeout but it is sparmming my logs
nevertheless. But what's worse is that I now got a Stale NFS file handle
on the lost+found directory of the export. A lot of question marks and
then just the name. I am now running with sec=sys and cannot up to now
was not able to reproduce this problem. Is it possible that does two
problems are related or are they completely separate from each other?

FYI the patched nfs-utils version is only running on the server for now
but I do not think that this is the problem.

Kind regards,
Michael