From: Ulrich Gemkow <ulrich.gemkow@ikr.uni-stuttgart.de>
To: "J. Bruce Fields" <bfields@fieldses.org>
Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client
Date: Wed, 26 Aug 2015 21:54:22 +0200
Cc: linux-nfs@vger.kernel.org
References: <201508241452.57718.ulrich.gemkow@ikr.uni-stuttgart.de> <201508251928.06201.ulrich.gemkow@ikr.uni-stuttgart.de> <20150825215456.GF8579@fieldses.org>
In-Reply-To: <20150825215456.GF8579@fieldses.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Message-Id: <201508262154.24455.ulrich.gemkow@ikr.uni-stuttgart.de>
Sender: linux-nfs-owner@vger.kernel.org

Hello Bruce,

On Tuesday 25 August 2015 23:54:56 J. Bruce Fields wrote:
> On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote:
> > Hello Bruce,
> > 
> > On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote:
> > > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote:
> > > > we have a weired problem with Linux NFSv4.0 Server (Vanilla
> > > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied):
> > > > 
> > > > When mounting a share on the Solaris client and then rebooting
> > > > the client without unmounting the share first, after the reboot
> > > > every attempt to mount the share again gives an I/O error on
> > > > the client and the mount fails.
> > > > 
> > > > After a long time (serveral hours) the v4 mount suddenly works
> > > > again.
> > > > 
> > > > Mounting a share with vers=2 works always even in times when
> > > > the v4 mount fails.
> > > > 
> > > > So it seems the Linux NFSv4 server holds a state for the client
> > > > which prevents the re-mounting of the share and gives the
> > > > I/O-error on the client.
> > > > 
> > > > We use NFSv4 without idmapd.
> > > > 
> > > > Is there any tip how to debug or solve this?
> > > 
> > > Best is probably to get a packet trace.  So something like:
> > > 
> > > 	tcpdump -s0 -iem0 -wtmp.pcap
> > > 
> > > and then try the client mount, then kill the tcpdump after the mount
> > > fails, and send us tmp.pcap.  (And/or take a look at tmp.pcap yourself
> > > with wireshark.  The interesting question is what kind of error the
> > > server is returning when the client tries the mount after reboot.)
> > 
> > Thank you for your reply. The tcpdump is attached, the relevant
> > packets are 49..52. The error seems to be a SERVERFAULT. Can you
> > see more from the dump?
> > 
> > Thanks again and best regards
> 
> The SERVERFAULT is on SETCLIENTID_CONFIRM.
> 
> In nfsd4_setclientid_confirm():
> 
> 	conf = find_confirmed_client(clid, false, nn);
> 	unconf = find_unconfirmed_client(clid, false, nn);
> 	/*
>          * We try hard to give out unique clientid's, so if we get an
>          * attempt to confirm the same clientid with a different cred,
>          * there's a bug somewhere.  Let's charitably assume it's our
>          * bug.
>          */
>         status = nfserr_serverfault;
>         if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred))
>                 goto out;
>         if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred))
>                 goto out;
> 
> The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical
> auth_unix creds.
> 
> The clientid that were looking up there was returned from the previous
> SETCLIENTID, generated by this logic:
> 
> 	if (conf && same_verf(&conf->cl_verifier, &clverifier))
>                 /* case 1: probable callback update */
>                 copy_clid(new, conf);
>         else /* case 4 (new client) or cases 2, 3 (client reboot): */
>                 gen_clid(new, nn);
> 
> So it should be a brand new clientid, unless the client was reusing the old
> verifier.
> 
> So perhaps the client is sending the SETCLIENTID with a verifier set to what it
> used on the previous boot?  That sounds like a client bug.  The linux
> client uses a timestamp for the verifier, looks like the Solaris client
> might too.  Is there some reason the clock on this client isn't
> advancing on reboot?

Thank you for the analysis. But the clock of the client advances
regularely and as one would expect.

The client is SPARC Solaris 10 with the latest patches
applied - I cannot believe that this client has such a
basic NFS bug.

Can you think of any kind of server configuration bug
(as said, Vanilla Linux 4.1.6) that causes this error?
The NFS server startup system is "self-made"...

Thanks again and best regards

-Ulrich

> 
> --b.
> --

-- 
|-----------------------------------------------------------------------
| Ulrich Gemkow
| University of Stuttgart, Germany
| Institute of Communication Networks and Computer Engineering (IKR)
|-----------------------------------------------------------------------