Return-Path: Received: from fieldses.org ([173.255.197.46]:46582 "EHLO fieldses.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751965AbbHYVy5 (ORCPT ); Tue, 25 Aug 2015 17:54:57 -0400 Date: Tue, 25 Aug 2015 17:54:56 -0400 From: "J. Bruce Fields" To: Ulrich Gemkow Cc: linux-nfs@vger.kernel.org Subject: Re: NFSv4 mount fails on Sun Solaris 10 after reboot of client Message-ID: <20150825215456.GF8579@fieldses.org> References: <201508241452.57718.ulrich.gemkow@ikr.uni-stuttgart.de> <20150824201401.GA401@fieldses.org> <201508251928.06201.ulrich.gemkow@ikr.uni-stuttgart.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <201508251928.06201.ulrich.gemkow@ikr.uni-stuttgart.de> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Tue, Aug 25, 2015 at 07:28:03PM +0200, Ulrich Gemkow wrote: > Hello Bruce, > > On Monday 24 August 2015 22:14:01 J. Bruce Fields wrote: > > On Mon, Aug 24, 2015 at 02:52:55PM +0200, Ulrich Gemkow wrote: > > > we have a weired problem with Linux NFSv4.0 Server (Vanilla > > > Kernel 4.1.6) and a Sun Solaris 10 client (all patches applied): > > > > > > When mounting a share on the Solaris client and then rebooting > > > the client without unmounting the share first, after the reboot > > > every attempt to mount the share again gives an I/O error on > > > the client and the mount fails. > > > > > > After a long time (serveral hours) the v4 mount suddenly works > > > again. > > > > > > Mounting a share with vers=2 works always even in times when > > > the v4 mount fails. > > > > > > So it seems the Linux NFSv4 server holds a state for the client > > > which prevents the re-mounting of the share and gives the > > > I/O-error on the client. > > > > > > We use NFSv4 without idmapd. > > > > > > Is there any tip how to debug or solve this? > > > > Best is probably to get a packet trace. So something like: > > > > tcpdump -s0 -iem0 -wtmp.pcap > > > > and then try the client mount, then kill the tcpdump after the mount > > fails, and send us tmp.pcap. (And/or take a look at tmp.pcap yourself > > with wireshark. The interesting question is what kind of error the > > server is returning when the client tries the mount after reboot.) > > Thank you for your reply. The tcpdump is attached, the relevant > packets are 49..52. The error seems to be a SERVERFAULT. Can you > see more from the dump? > > Thanks again and best regards The SERVERFAULT is on SETCLIENTID_CONFIRM. In nfsd4_setclientid_confirm(): conf = find_confirmed_client(clid, false, nn); unconf = find_unconfirmed_client(clid, false, nn); /* * We try hard to give out unique clientid's, so if we get an * attempt to confirm the same clientid with a different cred, * there's a bug somewhere. Let's charitably assume it's our * bug. */ status = nfserr_serverfault; if (unconf && !same_creds(&unconf->cl_cred, &rqstp->rq_cred)) goto out; if (conf && !same_creds(&conf->cl_cred, &rqstp->rq_cred)) goto out; The SETCLIENTID and SETCLIENTID_CONFIRM are done with identical auth_unix creds. The clientid that were looking up there was returned from the previous SETCLIENTID, generated by this logic: if (conf && same_verf(&conf->cl_verifier, &clverifier)) /* case 1: probable callback update */ copy_clid(new, conf); else /* case 4 (new client) or cases 2, 3 (client reboot): */ gen_clid(new, nn); So it should be a brand new clientid, unless the client was reusing the old verifier. So perhaps the client is sending the SETCLIENTID with a verifier set to what it used on the previous boot? That sounds like a client bug. The linux client uses a timestamp for the verifier, looks like the Solaris client might too. Is there some reason the clock on this client isn't advancing on reboot? --b.