Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp1095460ybi; Fri, 21 Jun 2019 13:46:23 -0700 (PDT) X-Google-Smtp-Source: APXvYqydWRhCBrjLUB/Euh+lqe/r5A1Nm05eCwTLM9S8OsMRH35jwlayNt28HWtA0lfGUCVNAhOb X-Received: by 2002:a17:90a:338b:: with SMTP id n11mr8937094pjb.21.1561149982959; Fri, 21 Jun 2019 13:46:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561149982; cv=none; d=google.com; s=arc-20160816; b=HGVipnlSVSUpnKaFDoahDTtAVjOqbIBiPgt461jc6E02sNHX2Ogyiy6I4imlesLu1O NcUqNRTmrjsK6rbf69nkrh2ETLVdyUu8b40rySI5tG4y2ZjpVKhvRR9x7pm37TXyG6H0 DiGq3Z4jiJIuhP+M/oQFAR/WmANz7DNvvYppFNqEKUl1bHlI6vVVfP/ixHJrmGn7bxzP jCTXQxFVt6C4/UlE6IFzEH19PN94dxRLYAgap+aKUEFUsxIEfZQFQd1vSoQ9XRQG7qve tRfBrKeFTPRtOyK3CIYqPW6aK05WqktQztTyj+VuoIlQYugG2nwMuSR7+8x/RBw9Vp/N QrCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature:dkim-filter; bh=PEggR+xb+wNPnPYain2ruZE4jwc0J6l5mkP2Z/QzGlc=; b=GljZAoXvjkg/vBkj8ghTxl60WI4c+TAj02hC6pBj0rkI3GvMjSxUmPjgFwbgecBMdF 0bnHsYR8AoMivZhyqoK4ttwykEln63bz/Y1+Zlcq2m5ZbT8CIuF0RIdXRWNHCKwHsS1t 9eQT2RCV/fD7n0yz/yCdI9YyHpISzABysCG60IR2ZftwDyxVdsPi/IGdhk3oxUuUoihJ Cj2TIOdXQspvqj1NUf2BlTGQbqnPzCbIPYukKo0VvumsEEiaQSpebs8rpG+CXI3c0U/D Y+LoG3e84aI0BNV78q31unOvJB6ufgN5T6aDKzZ55CVtpX7/SW990J6zraaSrROnAKpe 0xjg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@prgmr.com header.s=default header.b=K2BxJCad; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w67si3608738pfb.125.2019.06.21.13.45.55; Fri, 21 Jun 2019 13:46:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@prgmr.com header.s=default header.b=K2BxJCad; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726067AbfFUUpy (ORCPT + 99 others); Fri, 21 Jun 2019 16:45:54 -0400 Received: from mail.prgmr.com ([71.19.149.6]:34166 "EHLO mail.prgmr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725985AbfFUUpx (ORCPT ); Fri, 21 Jun 2019 16:45:53 -0400 Received: from turtle.mx (96-92-68-116-static.hfc.comcastbusiness.net [96.92.68.116]) (Authenticated sender: adp) by mail.prgmr.com (Postfix) with ESMTPSA id 45ECD28C003 for ; Fri, 21 Jun 2019 21:43:08 -0400 (EDT) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.prgmr.com 45ECD28C003 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=prgmr.com; s=default; t=1561167788; bh=PEggR+xb+wNPnPYain2ruZE4jwc0J6l5mkP2Z/QzGlc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=K2BxJCadQqU+Wv/Fd2e+2b/nOw++0ZsyzuoRMgEpNVxPHt3KOzXNGn6fI8QHcbF89 4w/LvLKWer9x7K78jUXRO0hSKmG8WsGKFW3FM7tR3xy9ds8wJLh2+5QwoJozcfhGc2 E3UxLXBlCBIS1p9GyallFSUrnhVOFaFAdcvAE5Hk= Received: (qmail 25420 invoked by uid 1353); 21 Jun 2019 20:47:23 -0000 Date: Fri, 21 Jun 2019 14:47:23 -0600 From: Alan Post To: Benjamin Coddington Cc: linux-nfs Subject: Re: User process NFS write hang in wait_on_commit with kworker Message-ID: <20190621204723.GU4158@turtle.email> References: <20190618000613.GR4158@turtle.email> <6DE07E49-D450-4BF7-BC61-0973A14CD81B@redhat.com> <20190619000746.GT4158@turtle.email> <25608EB2-87F0-4196-BEF9-8AB8FC72270B@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <25608EB2-87F0-4196-BEF9-8AB8FC72270B@redhat.com> Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Wed, Jun 19, 2019 at 08:38:02AM -0400, Benjamin Coddington wrote: > TCP drops or overruns should not be a problem since the TCP layer will > retransmit packets that are not acked. The issue would be if the NFS > server is perhaps silently dropping a response to an IO RPC. Or, an > intelligent middle-box that keeps its own stateful transparent TCP handling > between client and server existed (you clearly don't have that here). > My conclusion as well. As part of debugging a complicity of reliability issues with the cluster, we've found that some workloads are more likely to lead to NFS client hang. We've migrated the exports used by those workloads to dedicated NFS servers, one of which is the server under discussion here. > So I recall some knfsd issues dropping replies in that era of kernel > versions when the GSS sequencing grew out of a window. Are you using a > sec=krb5* on these mounts, or is it all sec=sys? Perhaps that's the problem > you are seeing. Again, just some guessing. > We're using sec=sys for the NFS clients that hung on wait_on_commit, but have in the past used Kerberos. I'm still chasing down at least intermittent, lingering issue where an open(2) will return EIO, while on the the wire those procedures are returning NFS4ERR_EXPIRED. What appears to happening, though I'm not certain yet, is that a RENEW CID is or tries to be done with Kerberos when it was not previously, which succeeds, but only in this degraded manner. I cannot then rule out something of the sort you're describing. Thank you for bringing it to my attention. > Verifying this is the problem could be done by setting up some rolling > network captures.. but sometimes it can be hard to not have the capture > fill up with continuing traffic from other processes. > I did go ahead and set up a rolling capture between this NFS server and one rack of clients--I hope I can catch the event as it happens. Time will tell. Regards, -A -- Alan Post | Xen VPS hosting for the technically adept PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/ email: adp@prgmr.com