Return-Path: Received: from mail-qg0-f44.google.com ([209.85.192.44]:33323 "EHLO mail-qg0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751791AbbLFNJ6 (ORCPT ); Sun, 6 Dec 2015 08:09:58 -0500 Received: by qgea14 with SMTP id a14so124792069qge.0 for ; Sun, 06 Dec 2015 05:09:57 -0800 (PST) Date: Sun, 6 Dec 2015 08:09:54 -0500 From: Jeff Layton To: Christoph Hellwig Cc: "J. Bruce Fields" , Kinglong Mee , linux-nfs@vger.kernel.org Subject: Re: [PATCH RFC] nfsd: serialize layout stateid morphing operations Message-ID: <20151206080954.1fe7e5c9@tlielax.poochiereds.net> In-Reply-To: <20151205072409.46d66109@tlielax.poochiereds.net> References: <20151129084614.42fb1272@tlielax.poochiereds.net> <565BBB03.7020206@gmail.com> <20151130213420.GA31564@fieldses.org> <20151130193313.5bb10791@synchrony.poochiereds.net> <20151201115600.GA1557@lst.de> <20151201174800.407e2c40@synchrony.poochiereds.net> <20151202072504.GA15839@lst.de> <20151203220850.GC19518@fieldses.org> <20151204083803.GA2440@lst.de> <20151204155110.64a352dd@tlielax.poochiereds.net> <20151205120222.GA27009@lst.de> <20151205072409.46d66109@tlielax.poochiereds.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-nfs-owner@vger.kernel.org List-ID: On Sat, 5 Dec 2015 07:24:09 -0500 Jeff Layton wrote: > On Sat, 5 Dec 2015 13:02:22 +0100 > Christoph Hellwig wrote: > > > On Fri, Dec 04, 2015 at 03:51:10PM -0500, Jeff Layton wrote: > > > > There is no reason not to do it, except for the significant effort > > > > to implement it a well as a synthetic test case to actually reproduce > > > > the behavior we want to handle. > > > > > > Could you end up livelocking here? Suppose you issue the callback and > > > the client returns success. He then returns the layout and gets a new > > > one just before the delay timer pops. We then end up recalling _that_ > > > layout...rinse, repeat... > > > > If we start allowing layoutgets before the whole range has been > > returned there is a great chance for livelocks, yes. But I don't think > > we should allow layoutgets to proceed before that. > > Maybe I didn't describe it well enough. I think you can still end up > looping even if you don't allow LAYOUTGETs before the entire range is > returned. > > If we treat NFS4_OK and NFS4ERR_DELAY equivalently, then we're > expecting the client to eventually return NFS4ERR_NOMATCHING_LAYOUT (or > a different error) to break the cycle of retransmissions. But, HZ/100 > is enough time for the client to return a layout and request a new one. > We may never see that error -- only a continual cycle of > CB_LAYOUTRECALL/LAYOUTRETURN/LAYOUTGET. > > I think we need a more reliable way to break that cycle so we don't end > up looping like that. We should either cancel any active callbacks > before reallowing LAYOUTGETs, or move the timeout handling outside of > the RPC state machine (like Bruce was suggesting). > Either way...in the near term we should probably take the patch that I originally proposed, just to ensure that no one hits the bugs that Kinglong hit. That does still leave some gaps in the seqid handling, but those are preferable to the warning and deadlock. Bruce, does that sound reasonable? I can send that patch in a separate email if you'd prefer. -- Jeff Layton