2015-08-24 13:02:48

by Ulrich Gemkow

[permalink] [raw]
Subject: NFSv4 mount fails on Sun Solaris 10 after reboot of client

Hello,

we have a weired problem with Linux NFSv4.0 Server (Vanilla
Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):

When mounting a share on the Solaris client and then rebooting
the client without unmounting the share first, after the reboot
every attempt to mount the share again gives an I/O error on
the client and the mount fails.

After a long time (serveral hours) the v4 mount suddenly works
again.

Mounting a share with vers=2 works always even in times when
the v4 mount fails.

So it seems the Linux NFSv4 server holds a state for the client
which prevents the re-mounting of the share and gives the
I/O-error on the client.

We use NFSv4 without idmapd.

Is there any tip how to debug or solve this?

Thank you and best regards

-Ulrich

--
|-----------------------------------------------------------------------
| Ulrich Gemkow
| University of Stuttgart, Germany
| Institute of Communication Networks and Computer Engineering (IKR)
|-----------------------------------------------------------------------


2015-08-24 20:14:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> we have a weired problem with Linux NFSv4.0 Server (Vanilla
> Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
>
> When mounting a share on the Solaris client and then rebooting
> the client without unmounting the share first, after the reboot
> every attempt to mount the share again gives an I/O error on
> the client and the mount fails.
>
> After a long time (serveral hours) the v4 mount suddenly works
> again.
>
> Mounting a share with vers=2 works always even in times when
> the v4 mount fails.
>
> So it seems the Linux NFSv4 server holds a state for the client
> which prevents the re-mounting of the share and gives the
> I/O-error on the client.
>
> We use NFSv4 without idmapd.
>
> Is there any tip how to debug or solve this?

Best is probably to get a packet trace. So something like:

tcpdump -s0 -iem0 -wtmp.pcap

and then try the client mount, then kill the tcpdump after the mount
fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself
with wireshark. The interesting question is what kind of error the
server is returning when the client tries the mount after reboot.)

--b.

>
> Thank you and best regards
>
> -Ulrich
>
> --
> |-----------------------------------------------------------------------
> | Ulrich Gemkow
> | University of Stuttgart, Germany
> | Institute of Communication Networks and Computer Engineering (IKR)
> |-----------------------------------------------------------------------
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-08-25 17:28:14

by Ulrich Gemkow

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

Hello Bruce,

On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> >
> > When mounting a share on the Solaris client and then rebooting
> > the client without unmounting the share first, after the reboot
> > every attempt to mount the share again gives an I/O error on
> > the client and the mount fails.
> >
> > After a long time (serveral hours) the v4 mount suddenly works
> > again.
> >
> > Mounting a share with vers=2 works always even in times when
> > the v4 mount fails.
> >
> > So it seems the Linux NFSv4 server holds a state for the client
> > which prevents the re-mounting of the share and gives the
> > I/O-error on the client.
> >
> > We use NFSv4 without idmapd.
> >
> > Is there any tip how to debug or solve this?
>
> Best is probably to get a packet trace. So something like:
>
> tcpdump -s0 -iem0 -wtmp.pcap
>
> and then try the client mount, then kill the tcpdump after the mount
> fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself
> with wireshark. The interesting question is what kind of error the
> server is returning when the client tries the mount after reboot.)

Thank you for your reply. The tcpdump is attached, the relevant
packets are 49..52. The error seems to be a SERVERFAULT. Can you
see more from the dump?

Thanks again and best regards

Ulrich

> --b.
>
> >
> > Thank you and best regards
> >
> > -Ulrich

> > |-----------------------------------------------------------------------
> > | Ulrich Gemkow
> > | University of Stuttgart, Germany
> > | Institute of Communication Networks and Computer Engineering (IKR)
> > |-----------------------------------------------------------------------


Attachments:
(No filename) (1.83 kB)
nfserr.pcap (7.26 kB)
Download all attachments

2015-08-25 21:54:57

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> Hello Bruce,
>
> On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> > >
> > > When mounting a share on the Solaris client and then rebooting
> > > the client without unmounting the share first, after the reboot
> > > every attempt to mount the share again gives an I/O error on
> > > the client and the mount fails.
> > >
> > > After a long time (serveral hours) the v4 mount suddenly works
> > > again.
> > >
> > > Mounting a share with vers=2 works always even in times when
> > > the v4 mount fails.
> > >
> > > So it seems the Linux NFSv4 server holds a state for the client
> > > which prevents the re-mounting of the share and gives the
> > > I/O-error on the client.
> > >
> > > We use NFSv4 without idmapd.
> > >
> > > Is there any tip how to debug or solve this?
> >
> > Best is probably to get a packet trace. So something like:
> >
> > tcpdump -s0 -iem0 -wtmp.pcap
> >
> > and then try the client mount, then kill the tcpdump after the mount
> > fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself
> > with wireshark. The interesting question is what kind of error the
> > server is returning when the client tries the mount after reboot.)
>
> Thank you for your reply. The tcpdump is attached, the relevant
> packets are 49..52. The error seems to be a SERVERFAULT. Can you
> see more from the dump?
>
> Thanks again and best regards

The SERVERFAULT is on SETCLIENTID_CONFIRM.

In nfsd4_setclientid_confirm():

conf = find_confirmed_client(clid, false, nn);
unconf = find_unconfirmed_client(clid, false, nn);
/*
* We try hard to give out unique clientid's, so if we get an
* attempt to confirm the same clientid with a different cred,
* there's a bug somewhere. Let's charitably assume it's our
* bug.
*/
status = nfserr_serverfault;
if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
goto out;
if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
goto out;

The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
auth_unix creds.

The clientid that were looking up there was returned from the previous
SETCLIENTID, generated by this logic:

if (conf && same_verf(&conf->cl_verifier, &clverifier))
/* case 1: probable callback update */
copy_clid(new, conf);
else /* case 4 (new client) or cases 2, 3 (client reboot): */
gen_clid(new, nn);

So it should be a brand new clientid, unless the client was reusing the old
verifier.

So perhaps the client is sending the SETCLIENTID with a verifier set to what it
used on the previous boot? That sounds like a client bug. The linux
client uses a timestamp for the verifier, looks like the Solaris client
might too. Is there some reason the clock on this client isn't
advancing on reboot?

--b.

2015-08-26 19:54:35

by Ulrich Gemkow

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

Hello Bruce,

On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
> On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> > Hello Bruce,
> >
> > On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> > > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> > > >
> > > > When mounting a share on the Solaris client and then rebooting
> > > > the client without unmounting the share first, after the reboot
> > > > every attempt to mount the share again gives an I/O error on
> > > > the client and the mount fails.
> > > >
> > > > After a long time (serveral hours) the v4 mount suddenly works
> > > > again.
> > > >
> > > > Mounting a share with vers=2 works always even in times when
> > > > the v4 mount fails.
> > > >
> > > > So it seems the Linux NFSv4 server holds a state for the client
> > > > which prevents the re-mounting of the share and gives the
> > > > I/O-error on the client.
> > > >
> > > > We use NFSv4 without idmapd.
> > > >
> > > > Is there any tip how to debug or solve this?
> > >
> > > Best is probably to get a packet trace. So something like:
> > >
> > > tcpdump -s0 -iem0 -wtmp.pcap
> > >
> > > and then try the client mount, then kill the tcpdump after the mount
> > > fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself
> > > with wireshark. The interesting question is what kind of error the
> > > server is returning when the client tries the mount after reboot.)
> >
> > Thank you for your reply. The tcpdump is attached, the relevant
> > packets are 49..52. The error seems to be a SERVERFAULT. Can you
> > see more from the dump?
> >
> > Thanks again and best regards
>
> The SERVERFAULT is on SETCLIENTID_CONFIRM.
>
> In nfsd4_setclientid_confirm():
>
> conf = find_confirmed_client(clid, false, nn);
> unconf = find_unconfirmed_client(clid, false, nn);
> /*
> * We try hard to give out unique clientid's, so if we get an
> * attempt to confirm the same clientid with a different cred,
> * there's a bug somewhere. Let's charitably assume it's our
> * bug.
> */
> status = nfserr_serverfault;
> if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> goto out;
> if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> goto out;
>
> The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> auth_unix creds.
>
> The clientid that were looking up there was returned from the previous
> SETCLIENTID, generated by this logic:
>
> if (conf && same_verf(&conf->cl_verifier, &clverifier))
> /* case 1: probable callback update */
> copy_clid(new, conf);
> else /* case 4 (new client) or cases 2, 3 (client reboot): */
> gen_clid(new, nn);
>
> So it should be a brand new clientid, unless the client was reusing the old
> verifier.
>
> So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> used on the previous boot? That sounds like a client bug. The linux
> client uses a timestamp for the verifier, looks like the Solaris client
> might too. Is there some reason the clock on this client isn't
> advancing on reboot?

Thank you for the analysis. But the clock of the client advances
regularely and as one would expect.

The client is SPARC Solaris 10 with the latest patches
applied - I cannot believe that this client has such a
basic NFS bug.

Can you think of any kind of server configuration bug
(as said, Vanilla Linux 4.1.6) that causes this error?
The NFS server startup system is "self-made"...

Thanks again and best regards

-Ulrich

>
> --b.
> --

--
|-----------------------------------------------------------------------
| Ulrich Gemkow
| University of Stuttgart, Germany
| Institute of Communication Networks and Computer Engineering (IKR)
|-----------------------------------------------------------------------

2015-08-26 20:09:40

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Wed, Aug 26, 2015 at 09:54:22PM +0200, Ulrich Gemkow wrote:
> Hello Bruce,
>
> On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
> > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> >
> > In nfsd4_setclientid_confirm():
> >
> > conf = find_confirmed_client(clid, false, nn);
> > unconf = find_unconfirmed_client(clid, false, nn);
> > /*
> > * We try hard to give out unique clientid's, so if we get an
> > * attempt to confirm the same clientid with a different cred,
> > * there's a bug somewhere. Let's charitably assume it's our
> > * bug.
> > */
> > status = nfserr_serverfault;
> > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> > goto out;
> > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> > goto out;
> >
> > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> > auth_unix creds.
> >
> > The clientid that were looking up there was returned from the previous
> > SETCLIENTID, generated by this logic:
> >
> > if (conf && same_verf(&conf->cl_verifier, &clverifier))
> > /* case 1: probable callback update */
> > copy_clid(new, conf);
> > else /* case 4 (new client) or cases 2, 3 (client reboot): */
> > gen_clid(new, nn);
> >
> > So it should be a brand new clientid, unless the client was reusing the old
> > verifier.
> >
> > So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> > used on the previous boot? That sounds like a client bug. The linux
> > client uses a timestamp for the verifier, looks like the Solaris client
> > might too. Is there some reason the clock on this client isn't
> > advancing on reboot?
>
> Thank you for the analysis. But the clock of the client advances
> regularely and as one would expect.

OK, thanks for checking that.

> The client is SPARC Solaris 10 with the latest patches
> applied - I cannot believe that this client has such a
> basic NFS bug.

To confirm or deny my hypothesis, I think what we want is a longer
capture that gets the failing SETCLIENTID_CONFIRM (as seen in the
previous capture) but also shows what clientid the client was using
before the reboot. So ideal might be something like:

- start the capture
- mount
- create a file (I just want to make sure the client does at
least one open)
- reboot the client
- mount again, see the failure
- stop the capture

> Can you think of any kind of server configuration bug
> (as said, Vanilla Linux 4.1.6) that causes this error?
> The NFS server startup system is "self-made"...

I can't think of any off hand, if there's a server-side problem here I'd
suspect the code before the configuration.

--b.

2015-08-27 06:43:54

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client



----- Original Message -----
> From: "J. Bruce Fields" <[email protected]>
> To: "Ulrich Gemkow" <[email protected]>
> Cc: [email protected]
> Sent: Tuesday, August 25, 2015 11:54:56 PM
> Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

> On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
>> Hello Bruce,
>>
>> On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
>> > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
>> > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
>> > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
>> > >
>> > > When mounting a share on the Solaris client and then rebooting
>> > > the client without unmounting the share first, after the reboot
>> > > every attempt to mount the share again gives an I/O error on
>> > > the client and the mount fails.
>> > >
>> > > After a long time (serveral hours) the v4 mount suddenly works
>> > > again.
>> > >
>> > > Mounting a share with vers=2 works always even in times when
>> > > the v4 mount fails.
>> > >
>> > > So it seems the Linux NFSv4 server holds a state for the client
>> > > which prevents the re-mounting of the share and gives the
>> > > I/O-error on the client.
>> > >
>> > > We use NFSv4 without idmapd.
>> > >
>> > > Is there any tip how to debug or solve this?
>> >
>> > Best is probably to get a packet trace. So something like:
>> >
>> > tcpdump -s0 -iem0 -wtmp.pcap
>> >
>> > and then try the client mount, then kill the tcpdump after the mount
>> > fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself
>> > with wireshark. The interesting question is what kind of error the
>> > server is returning when the client tries the mount after reboot.)
>>
>> Thank you for your reply. The tcpdump is attached, the relevant
>> packets are 49..52. The error seems to be a SERVERFAULT. Can you
>> see more from the dump?
>>
>> Thanks again and best regards
>
> The SERVERFAULT is on SETCLIENTID_CONFIRM.
>
> In nfsd4_setclientid_confirm():
>
> conf = find_confirmed_client(clid, false, nn);
> unconf = find_unconfirmed_client(clid, false, nn);
> /*
> * We try hard to give out unique clientid's, so if we get an
> * attempt to confirm the same clientid with a different cred,
> * there's a bug somewhere. Let's charitably assume it's our
> * bug.
> */
> status = nfserr_serverfault;
> if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> goto out;
> if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> goto out;
>
> The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> auth_unix creds.
>
> The clientid that were looking up there was returned from the previous
> SETCLIENTID, generated by this logic:
>
> if (conf && same_verf(&conf->cl_verifier, &clverifier))
> /* case 1: probable callback update */
> copy_clid(new, conf);
> else /* case 4 (new client) or cases 2, 3 (client reboot): */
> gen_clid(new, nn);
>
> So it should be a brand new clientid, unless the client was reusing the old
> verifier.
>
> So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> used on the previous boot? That sounds like a client bug. The linux
> client uses a timestamp for the verifier, looks like the Solaris client
> might too. Is there some reason the clock on this client isn't
> advancing on reboot?

probably NFS4ERR_STALE_CLIENTID is a better error code for this scenario.

Tigran.

>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-08-27 18:29:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Thu, Aug 27, 2015 at 08:43:51AM +0200, Mkrtchyan, Tigran wrote:
>
>
> ----- Original Message -----
> > From: "J. Bruce Fields" <[email protected]>
> > To: "Ulrich Gemkow" <[email protected]>
> > Cc: [email protected]
> > Sent: Tuesday, August 25, 2015 11:54:56 PM
> > Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client
>
> > On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> >> Hello Bruce,
> >>
> >> On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> >> > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> >> > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> >> > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> >> > >
> >> > > When mounting a share on the Solaris client and then rebooting
> >> > > the client without unmounting the share first, after the reboot
> >> > > every attempt to mount the share again gives an I/O error on
> >> > > the client and the mount fails.
> >> > >
> >> > > After a long time (serveral hours) the v4 mount suddenly works
> >> > > again.
> >> > >
> >> > > Mounting a share with vers=2 works always even in times when
> >> > > the v4 mount fails.
> >> > >
> >> > > So it seems the Linux NFSv4 server holds a state for the client
> >> > > which prevents the re-mounting of the share and gives the
> >> > > I/O-error on the client.
> >> > >
> >> > > We use NFSv4 without idmapd.
> >> > >
> >> > > Is there any tip how to debug or solve this?
> >> >
> >> > Best is probably to get a packet trace. So something like:
> >> >
> >> > tcpdump -s0 -iem0 -wtmp.pcap
> >> >
> >> > and then try the client mount, then kill the tcpdump after the mount
> >> > fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself
> >> > with wireshark. The interesting question is what kind of error the
> >> > server is returning when the client tries the mount after reboot.)
> >>
> >> Thank you for your reply. The tcpdump is attached, the relevant
> >> packets are 49..52. The error seems to be a SERVERFAULT. Can you
> >> see more from the dump?
> >>
> >> Thanks again and best regards
> >
> > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> >
> > In nfsd4_setclientid_confirm():
> >
> > conf = find_confirmed_client(clid, false, nn);
> > unconf = find_unconfirmed_client(clid, false, nn);
> > /*
> > * We try hard to give out unique clientid's, so if we get an
> > * attempt to confirm the same clientid with a different cred,
> > * there's a bug somewhere. Let's charitably assume it's our
> > * bug.
> > */
> > status = nfserr_serverfault;
> > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> > goto out;
> > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> > goto out;
> >
> > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> > auth_unix creds.
> >
> > The clientid that were looking up there was returned from the previous
> > SETCLIENTID, generated by this logic:
> >
> > if (conf && same_verf(&conf->cl_verifier, &clverifier))
> > /* case 1: probable callback update */
> > copy_clid(new, conf);
> > else /* case 4 (new client) or cases 2, 3 (client reboot): */
> > gen_clid(new, nn);
> >
> > So it should be a brand new clientid, unless the client was reusing the old
> > verifier.
> >
> > So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> > used on the previous boot? That sounds like a client bug. The linux
> > client uses a timestamp for the verifier, looks like the Solaris client
> > might too. Is there some reason the clock on this client isn't
> > advancing on reboot?
>
> probably NFS4ERR_STALE_CLIENTID is a better error code for this scenario.

SERVERFAULT is obviously lame, but I don't know that STALE_CLIENTID is
right either.

Another thing that's weird is:

> After a long time (serveral hours) the v4 mount suddenly works
> again.

I'd expect the clent to expire after a lease period (default 90
seconds), I don't know what could be happening that would take hours.

Also I don't know why those creds would change after a reboot.

Anyway I think a trace covering the reboot is still the best hope of an
explanation.

--b.

2015-08-27 20:37:09

by Frank Filz

[permalink] [raw]
Subject: RE: NFSv4 mount fails on Sun Solaris 10 after reboot of client

> On Thu, Aug 27, 2015 at 08:43:51AM +0200, Mkrtchyan, Tigran wrote:
> >
> >
> > ----- Original Message -----
> > > From: "J. Bruce Fields" <[email protected]>
> > > To: "Ulrich Gemkow" <[email protected]>
> > > Cc: [email protected]
> > > Sent: Tuesday, August 25, 2015 11:54:56 PM
> > > Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of
> > > client
> >
> > > On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> > >> Hello Bruce,
> > >>
> > >> On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> > >> > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > >> > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > >> > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> > >> > >
> > >> > > When mounting a share on the Solaris client and then rebooting
> > >> > > the client without unmounting the share first, after the reboot
> > >> > > every attempt to mount the share again gives an I/O error on
> > >> > > the client and the mount fails.
> > >> > >
> > >> > > After a long time (serveral hours) the v4 mount suddenly works
> > >> > > again.
> > >> > >
> > >> > > Mounting a share with vers=2 works always even in times when
> > >> > > the v4 mount fails.
> > >> > >
> > >> > > So it seems the Linux NFSv4 server holds a state for the client
> > >> > > which prevents the re-mounting of the share and gives the
> > >> > > I/O-error on the client.
> > >> > >
> > >> > > We use NFSv4 without idmapd.
> > >> > >
> > >> > > Is there any tip how to debug or solve this?
> > >> >
> > >> > Best is probably to get a packet trace. So something like:
> > >> >
> > >> > tcpdump -s0 -iem0 -wtmp.pcap
> > >> >
> > >> > and then try the client mount, then kill the tcpdump after the
> > >> > mount fails, and send us tmp.pcap. (And/or take a look at
> > >> > tmp.pcap yourself with wireshark. The interesting question is
> > >> > what kind of error the server is returning when the client tries
> > >> > the mount after reboot.)
> > >>
> > >> Thank you for your reply. The tcpdump is attached, the relevant
> > >> packets are 49..52. The error seems to be a SERVERFAULT. Can you
> > >> see more from the dump?
> > >>
> > >> Thanks again and best regards
> > >
> > > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> > >
> > > In nfsd4_setclientid_confirm():
> > >
> > > conf = find_confirmed_client(clid, false, nn);
> > > unconf = find_unconfirmed_client(clid, false, nn);
> > > /*
> > > * We try hard to give out unique clientid's, so if we get an
> > > * attempt to confirm the same clientid with a different cred,
> > > * there's a bug somewhere. Let's charitably assume it's our
> > > * bug.
> > > */
> > > status = nfserr_serverfault;
> > > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> > > goto out;
> > > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> > > goto out;

If the creds don't match, the return should be NFS4ERR_CLID_INUSE per
section 16.34.5. IMPLEMENTATION first bullet after DRC discussion.

At least the way I read RFC 7530...

> > > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> > > auth_unix creds.
> > >
> > > The clientid that were looking up there was returned from the
> > > previous SETCLIENTID, generated by this logic:
> > >
> > > if (conf && same_verf(&conf->cl_verifier, &clverifier))
> > > /* case 1: probable callback update */
> > > copy_clid(new, conf);
> > > else /* case 4 (new client) or cases 2, 3 (client reboot): */
> > > gen_clid(new, nn);
> > >
> > > So it should be a brand new clientid, unless the client was reusing
> > > the old verifier.
> > >
> > > So perhaps the client is sending the SETCLIENTID with a verifier set
> > > to what it used on the previous boot? That sounds like a client
> > > bug. The linux client uses a timestamp for the verifier, looks like
> > > the Solaris client might too. Is there some reason the clock on
> > > this client isn't advancing on reboot?
> >
> > probably NFS4ERR_STALE_CLIENTID is a better error code for this
scenario.
>
> SERVERFAULT is obviously lame, but I don't know that STALE_CLIENTID is
> right either.
>
> Another thing that's weird is:
>
> > After a long time (serveral hours) the v4 mount suddenly works
> > again.
>
> I'd expect the clent to expire after a lease period (default 90 seconds),
I don't
> know what could be happening that would take hours.
>
> Also I don't know why those creds would change after a reboot.
>
> Anyway I think a trace covering the reboot is still the best hope of an
> explanation.

Frank


2015-08-28 18:06:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Thu, Aug 27, 2015 at 01:36:38PM -0700, Frank Filz wrote:
> > On Thu, Aug 27, 2015 at 08:43:51AM +0200, Mkrtchyan, Tigran wrote:
> > >
> > >
> > > ----- Original Message -----
> > > > From: "J. Bruce Fields" <[email protected]>
> > > > To: "Ulrich Gemkow" <[email protected]>
> > > > Cc: [email protected]
> > > > Sent: Tuesday, August 25, 2015 11:54:56 PM
> > > > Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of
> > > > client
> > >
> > > > On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> > > >> Hello Bruce,
> > > >>
> > > >> On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> > > >> > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > > >> > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > > >> > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> > > >> > >
> > > >> > > When mounting a share on the Solaris client and then rebooting
> > > >> > > the client without unmounting the share first, after the reboot
> > > >> > > every attempt to mount the share again gives an I/O error on
> > > >> > > the client and the mount fails.
> > > >> > >
> > > >> > > After a long time (serveral hours) the v4 mount suddenly works
> > > >> > > again.
> > > >> > >
> > > >> > > Mounting a share with vers=2 works always even in times when
> > > >> > > the v4 mount fails.
> > > >> > >
> > > >> > > So it seems the Linux NFSv4 server holds a state for the client
> > > >> > > which prevents the re-mounting of the share and gives the
> > > >> > > I/O-error on the client.
> > > >> > >
> > > >> > > We use NFSv4 without idmapd.
> > > >> > >
> > > >> > > Is there any tip how to debug or solve this?
> > > >> >
> > > >> > Best is probably to get a packet trace. So something like:
> > > >> >
> > > >> > tcpdump -s0 -iem0 -wtmp.pcap
> > > >> >
> > > >> > and then try the client mount, then kill the tcpdump after the
> > > >> > mount fails, and send us tmp.pcap. (And/or take a look at
> > > >> > tmp.pcap yourself with wireshark. The interesting question is
> > > >> > what kind of error the server is returning when the client tries
> > > >> > the mount after reboot.)
> > > >>
> > > >> Thank you for your reply. The tcpdump is attached, the relevant
> > > >> packets are 49..52. The error seems to be a SERVERFAULT. Can you
> > > >> see more from the dump?
> > > >>
> > > >> Thanks again and best regards
> > > >
> > > > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> > > >
> > > > In nfsd4_setclientid_confirm():
> > > >
> > > > conf = find_confirmed_client(clid, false, nn);
> > > > unconf = find_unconfirmed_client(clid, false, nn);
> > > > /*
> > > > * We try hard to give out unique clientid's, so if we get an
> > > > * attempt to confirm the same clientid with a different cred,
> > > > * there's a bug somewhere. Let's charitably assume it's our
> > > > * bug.
> > > > */
> > > > status = nfserr_serverfault;
> > > > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> > > > goto out;
> > > > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> > > > goto out;
>
> If the creds don't match, the return should be NFS4ERR_CLID_INUSE per
> section 16.34.5. IMPLEMENTATION first bullet after DRC discussion.
>
> At least the way I read RFC 7530...

I assumed that was only the case when the long-form client-provided
client identifier matched, but here we're looking at records matched by
the shorthand server-generated clientid. Very weird that that we'd get
to this case (and without hitting CLID_INUSE on the setclientid?).
There's something we don't understand.

Anyway, looking at the SETCLIENTID_CONFIRM description in 7530, I think
you're right, they're recommending CLIDN_INUSE for this case. Doubt
that would actually help in Ulrich's case, though.

--b.

2015-08-31 12:08:13

by Ulrich Gemkow

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

Hallo Bruce,

On Wednesday 26 August 2015 22:09:40 you wrote:
> On Wed, Aug 26, 2015 at 09:54:22PM +0200, Ulrich Gemkow wrote:
> > Hello Bruce,
> >
> > On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
> > > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> > >
> > > In nfsd4_setclientid_confirm():
> > >
> > > conf = find_confirmed_client(clid, false, nn);
> > > unconf = find_unconfirmed_client(clid, false, nn);
> > > /*
> > > * We try hard to give out unique clientid's, so if we get an
> > > * attempt to confirm the same clientid with a different cred,
> > > * there's a bug somewhere. Let's charitably assume it's our
> > > * bug.
> > > */
> > > status = nfserr_serverfault;
> > > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> > > goto out;
> > > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> > > goto out;
> > >
> > > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> > > auth_unix creds.
> > >
> > > The clientid that were looking up there was returned from the previous
> > > SETCLIENTID, generated by this logic:
> > >
> > > if (conf && same_verf(&conf->cl_verifier, &clverifier))
> > > /* case 1: probable callback update */
> > > copy_clid(new, conf);
> > > else /* case 4 (new client) or cases 2, 3 (client reboot): */
> > > gen_clid(new, nn);
> > >
> > > So it should be a brand new clientid, unless the client was reusing the old
> > > verifier.
> > >
> > > So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> > > used on the previous boot? That sounds like a client bug. The linux
> > > client uses a timestamp for the verifier, looks like the Solaris client
> > > might too. Is there some reason the clock on this client isn't
> > > advancing on reboot?
> >
> > Thank you for the analysis. But the clock of the client advances
> > regularely and as one would expect.
>
> OK, thanks for checking that.
>
> > The client is SPARC Solaris 10 with the latest patches
> > applied - I cannot believe that this client has such a
> > basic NFS bug.
>
> To confirm or deny my hypothesis, I think what we want is a longer
> capture that gets the failing SETCLIENTID_CONFIRM (as seen in the
> previous capture) but also shows what clientid the client was using
> before the reboot. So ideal might be something like:
>
> - start the capture
> - mount
> - create a file (I just want to make sure the client does at
> least one open)
> - reboot the client
> - mount again, see the failure
> - stop the capture

I tried but probably made a mistake: To be sure to have a
defined state for the test I rebooted the server while clearing
all its NFS state and I reinstalled the client - both with the
exact same configuration as before.

And now the bug unfortunately does not happen again, the mount
always succeeds. I did the reinstall of the client also before
my first mail to be sure so it seems that the server may have
reached an invalid state before - whatever this may has caused.

I can only wait until the bug happens again (hoping not :-).

Maybe you are able to find a reason from the information
given before. I regret to be of no more help. If I can do
something please tell me.

Thank you very much again and best regards

-Ulrich

--
| Ulrich Gemkow
| University of Stuttgart
| Institute of Communication Networks and Computer Engineering (IKR)

2015-08-31 14:51:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Mon, Aug 31, 2015 at 02:08:08PM +0200, Ulrich Gemkow wrote:
> Hallo Bruce,
>
> On Wednesday 26 August 2015 22:09:40 you wrote:
> > On Wed, Aug 26, 2015 at 09:54:22PM +0200, Ulrich Gemkow wrote:
> > > Hello Bruce,
> > >
> > > On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
> > > > The SERVERFAULT is on SETCLIENTID_CONFIRM.
> > > >
> > > > In nfsd4_setclientid_confirm():
> > > >
> > > > conf = find_confirmed_client(clid, false, nn);
> > > > unconf = find_unconfirmed_client(clid, false, nn);
> > > > /*
> > > > * We try hard to give out unique clientid's, so if we get an
> > > > * attempt to confirm the same clientid with a different cred,
> > > > * there's a bug somewhere. Let's charitably assume it's our
> > > > * bug.
> > > > */
> > > > status = nfserr_serverfault;
> > > > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
> > > > goto out;
> > > > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
> > > > goto out;
> > > >
> > > > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> > > > auth_unix creds.
> > > >
> > > > The clientid that were looking up there was returned from the previous
> > > > SETCLIENTID, generated by this logic:
> > > >
> > > > if (conf && same_verf(&conf->cl_verifier, &clverifier))
> > > > /* case 1: probable callback update */
> > > > copy_clid(new, conf);
> > > > else /* case 4 (new client) or cases 2, 3 (client reboot): */
> > > > gen_clid(new, nn);
> > > >
> > > > So it should be a brand new clientid, unless the client was reusing the old
> > > > verifier.
> > > >
> > > > So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> > > > used on the previous boot? That sounds like a client bug. The linux
> > > > client uses a timestamp for the verifier, looks like the Solaris client
> > > > might too. Is there some reason the clock on this client isn't
> > > > advancing on reboot?
> > >
> > > Thank you for the analysis. But the clock of the client advances
> > > regularely and as one would expect.
> >
> > OK, thanks for checking that.
> >
> > > The client is SPARC Solaris 10 with the latest patches
> > > applied - I cannot believe that this client has such a
> > > basic NFS bug.
> >
> > To confirm or deny my hypothesis, I think what we want is a longer
> > capture that gets the failing SETCLIENTID_CONFIRM (as seen in the
> > previous capture) but also shows what clientid the client was using
> > before the reboot. So ideal might be something like:
> >
> > - start the capture
> > - mount
> > - create a file (I just want to make sure the client does at
> > least one open)
> > - reboot the client
> > - mount again, see the failure
> > - stop the capture
>
> I tried but probably made a mistake: To be sure to have a
> defined state for the test I rebooted the server while clearing
> all its NFS state and I reinstalled the client - both with the
> exact same configuration as before.
>
> And now the bug unfortunately does not happen again, the mount
> always succeeds. I did the reinstall of the client also before
> my first mail to be sure so it seems that the server may have
> reached an invalid state before - whatever this may has caused.

That's interesting!

> I can only wait until the bug happens again (hoping not :-).
>
> Maybe you are able to find a reason from the information
> given before. I regret to be of no more help. If I can do
> something please tell me.

I'm not coming up with any ideas right now. Do let us know if you get
into that state again.

--b.

2015-08-31 15:52:41

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client



----- Original Message -----
> From: "J. Bruce Fields" <[email protected]>
> To: "Ulrich Gemkow" <[email protected]>
> Cc: [email protected]
> Sent: Monday, August 31, 2015 4:51:48 PM
> Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

> On Mon, Aug 31, 2015 at 02:08:08PM +0200, Ulrich Gemkow wrote:
>> Hallo Bruce,
>>
>> On Wednesday 26 August 2015 22:09:40 you wrote:
>> > On Wed, Aug 26, 2015 at 09:54:22PM +0200, Ulrich Gemkow wrote:
>> > > Hello Bruce,
>> > >
>> > > On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
>> > > > The SERVERFAULT is on SETCLIENTID_CONFIRM.
>> > > >
>> > > > In nfsd4_setclientid_confirm():
>> > > >
>> > > > conf = find_confirmed_client(clid, false, nn);
>> > > > unconf = find_unconfirmed_client(clid, false, nn);
>> > > > /*
>> > > > * We try hard to give out unique clientid's, so if we get an
>> > > > * attempt to confirm the same clientid with a different cred,
>> > > > * there's a bug somewhere. Let's charitably assume it's our
>> > > > * bug.
>> > > > */
>> > > > status = nfserr_serverfault;
>> > > > if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
>> > > > goto out;
>> > > > if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
>> > > > goto out;
>> > > >
>> > > > The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
>> > > > auth_unix creds.
>> > > >
>> > > > The clientid that were looking up there was returned from the previous
>> > > > SETCLIENTID, generated by this logic:
>> > > >
>> > > > if (conf && same_verf(&conf->cl_verifier, &clverifier))
>> > > > /* case 1: probable callback update */
>> > > > copy_clid(new, conf);
>> > > > else /* case 4 (new client) or cases 2, 3 (client reboot): */
>> > > > gen_clid(new, nn);
>> > > >
>> > > > So it should be a brand new clientid, unless the client was reusing the old
>> > > > verifier.
>> > > >
>> > > > So perhaps the client is sending the SETCLIENTID with a verifier set to what it
>> > > > used on the previous boot? That sounds like a client bug. The linux
>> > > > client uses a timestamp for the verifier, looks like the Solaris client
>> > > > might too. Is there some reason the clock on this client isn't
>> > > > advancing on reboot?
>> > >
>> > > Thank you for the analysis. But the clock of the client advances
>> > > regularely and as one would expect.
>> >
>> > OK, thanks for checking that.
>> >
>> > > The client is SPARC Solaris 10 with the latest patches
>> > > applied - I cannot believe that this client has such a
>> > > basic NFS bug.
>> >
>> > To confirm or deny my hypothesis, I think what we want is a longer
>> > capture that gets the failing SETCLIENTID_CONFIRM (as seen in the
>> > previous capture) but also shows what clientid the client was using
>> > before the reboot. So ideal might be something like:
>> >
>> > - start the capture
>> > - mount
>> > - create a file (I just want to make sure the client does at
>> > least one open)
>> > - reboot the client
>> > - mount again, see the failure
>> > - stop the capture
>>
>> I tried but probably made a mistake: To be sure to have a
>> defined state for the test I rebooted the server while clearing
>> all its NFS state and I reinstalled the client - both with the
>> exact same configuration as before.
>>
>> And now the bug unfortunately does not happen again, the mount
>> always succeeds. I did the reinstall of the client also before
>> my first mail to be sure so it seems that the server may have
>> reached an invalid state before - whatever this may has caused.
>
> That's interesting!
>
>> I can only wait until the bug happens again (hoping not :-).
>>
>> Maybe you are able to find a reason from the information
>> given before. I regret to be of no more help. If I can do
>> something please tell me.
>
> I'm not coming up with any ideas right now. Do let us know if you get
> into that state again.


To me it sounds like server still has a reference by client's ownerid and fails
to detect that verifier is not valid any more. Some kind of leak in client
disconnect/reboot detection code (although code looks like doing the right thing).
I don't have much inside Linux server to verify that.

Tigran.


>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-09-01 17:43:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client

On Fri, Aug 28, 2015 at 02:06:12PM -0400, 'J. Bruce Fields' wrote:
> On Thu, Aug 27, 2015 at 01:36:38PM -0700, Frank Filz wrote:
> > If the creds don't match, the return should be NFS4ERR_CLID_INUSE per
> > section 16.34.5. IMPLEMENTATION first bullet after DRC discussion.
> >
> > At least the way I read RFC 7530...
>
> I assumed that was only the case when the long-form client-provided
> client identifier matched, but here we're looking at records matched by
> the shorthand server-generated clientid. Very weird that that we'd get
> to this case (and without hitting CLID_INUSE on the setclientid?).
> There's something we don't understand.
>
> Anyway, looking at the SETCLIENTID_CONFIRM description in 7530, I think
> you're right, they're recommending CLIDN_INUSE for this case. Doubt
> that would actually help in Ulrich's case, though.

May as well fix that anyway, I guess.--b.

commit 3d70e158ea9b
Author: J. Bruce Fields <[email protected]>
Date: Tue Sep 1 13:40:53 2015 -0400

nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case

Somebody with a Solaris client was hitting this case. We haven't
figured out why yet, and don't have a reproducer. Meanwhile Frank
noticed that RFC 7530 actually recommends CLID_INUSE for this case.
Unlikely to help the original reporter, but may as well fix it.

Reported-by: Frank Filz <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 416f32e34a33..08746ec1d44a 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3111,10 +3111,11 @@ nfsd4_setclientid_confirm(struct svc_rqst *rqstp,
/*
* We try hard to give out unique clientid's, so if we get an
* attempt to confirm the same clientid with a different cred,
- * there's a bug somewhere. Let's charitably assume it's our
- * bug.
+ * the client may be buggy; this should never happen.
+ *
+ * Nevertheless, RFC 7530 recommends INUSE for this case:
*/
- status = nfserr_serverfault;
+ status = nfserr_clid_inuse;
if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
goto out;
if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))