Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp647865ybi; Wed, 19 Jun 2019 05:40:06 -0700 (PDT) X-Google-Smtp-Source: APXvYqxNrJJfVfPS+zE2aZeg0cEQWkaWm2Wgx8pcWAkOD2C+MPQObtP8mRgeW/FQGfw79cyweLDb X-Received: by 2002:a62:2784:: with SMTP id n126mr59589311pfn.61.1560948006635; Wed, 19 Jun 2019 05:40:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560948006; cv=none; d=google.com; s=arc-20160816; b=UVRBFy/kx0H44afBm7DXpfzFIlEgOOyMDWLVJmB5faw9UNksJVR6AscyXZyfNrCyYI kCUihsFP/Tms/iagrdLmYardSkCcbq8VQ5J8iPz5rTwVwb/qai9W+kuuzhahyS69506I aeOrypNuuupc9ZPLw5mTUu7X98W3JgcreDrkkgaWjLsIFJNW8wzF1ywmmDZID81LY4S4 GWhoQhpe/EQpQeYk3dQFt+yzymlf6rsy8FP+lnBpPN17tuwV5PIzyfrfFX6rlfEtu4kI DrSLE3mT2lHp0Shl8IY/GBv2ZWH3zVX1D0ilsBSZ+Wwd9NNBUk9uE4AMstCyfloq3Xwn HPZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from; bh=ITfOpOjC/Tek+tu7ut/6LboBYEp8g5aUo9E83h+WhKY=; b=ECDyexTCflx6ka4Z4OG69XFEGjr/oXw+UzPatAAY+wf6jS+Oqy0wRkG7/Mnuy1avFe pVBMqllZvJmUBYF8rW/WwTtrf3udnza/YUcvfKBlltaBV9q9/6Jri8DGHhjOYdx/+UJu VKYbIpVdMNbHTICTtJwPRPkbEj3tSlBfxM5RUyTZCCLLetXHvKp1t7qbOeR98rI2gLpg VItZbbb3A4h0YcqwdwOSsCoGu3l2mrjfO/FRh2qxPA6+9rXvLbjFL6SOk61LvPL6gC92 R07yHBWjxzBpoqm6Ic5qlOkcBusq5K8BAGjpldm7KA0NSeOO9tYLi+PQ35v9W9/pmLJ3 rsDw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g90si15728078plb.282.2019.06.19.05.39.41; Wed, 19 Jun 2019 05:40:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731711AbfFSMiE (ORCPT + 99 others); Wed, 19 Jun 2019 08:38:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39922 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727457AbfFSMiE (ORCPT ); Wed, 19 Jun 2019 08:38:04 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3674A81F01; Wed, 19 Jun 2019 12:38:04 +0000 (UTC) Received: from [10.10.66.2] (ovpn-66-2.rdu2.redhat.com [10.10.66.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id CD1A6183C4; Wed, 19 Jun 2019 12:38:03 +0000 (UTC) From: "Benjamin Coddington" To: "Alan Post" Cc: linux-nfs Subject: Re: User process NFS write hang in wait_on_commit with kworker Date: Wed, 19 Jun 2019 08:38:02 -0400 Message-ID: <25608EB2-87F0-4196-BEF9-8AB8FC72270B@redhat.com> In-Reply-To: <20190619000746.GT4158@turtle.email> References: <20190618000613.GR4158@turtle.email> <6DE07E49-D450-4BF7-BC61-0973A14CD81B@redhat.com> <20190619000746.GT4158@turtle.email> MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 19 Jun 2019 12:38:04 +0000 (UTC) Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 18 Jun 2019, at 20:07, Alan Post wrote: > On Tue, Jun 18, 2019 at 11:29:16AM -0400, Benjamin Coddington wrote: >> I think that your transport or NFS server is dropping the response to an >> RPC. The NFS client will not retransmit on an established connection. >> >> What server are you using? Any middle boxes on the network that could be >> transparently dropping transmissions (less likely, but I have seen them)? >> > > I've found 8 separate NFS client hangs of the sort I reported here, > and in all cases the same NFS server was involved: an Ubuntu Trusty > system running 4.4.0. I've been upgrading all of these NFS servers, > haven't done this one yet--the complicity of NFS hangs I've been > seeing have slowed me down. > > Of the 8 NFS clients with a hang to this server, about half are in > the same computer room where packets only transit rack switches, with > the other half also going through a computer room router. > > I see positive dropped and overrun packet counts on the NFS server > interface, along with a similar magnitude of pause counts on the > switch port for the NFS server. Given the occurences of this issue > only this rack switch and a redundant pair of top-of-rack switches in > the rack with the NFS server are in-common between all 8 NFS clients > with write hangs. TCP drops or overruns should not be a problem since the TCP layer will retransmit packets that are not acked. The issue would be if the NFS server is perhaps silently dropping a response to an IO RPC. Or, an intelligent middle-box that keeps its own stateful transparent TCP handling between client and server existed (you clearly don't have that here). So I recall some knfsd issues dropping replies in that era of kernel versions when the GSS sequencing grew out of a window. Are you using a sec=krb5* on these mounts, or is it all sec=sys? Perhaps that's the problem you are seeing. Again, just some guessing. Verifying this is the problem could be done by setting up some rolling network captures.. but sometimes it can be hard to not have the capture fill up with continuing traffic from other processes. Ben