Received: by 2002:a05:6a10:c7d3:0:0:0:0 with SMTP id h19csp628710pxy; Sat, 14 Aug 2021 18:26:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwcpF9z0xf5f8dOjcfvC67VP/u9fWutEcmPPLINbreJTo5YXuc5mSGczzobTnUwNc+nCiQ0 X-Received: by 2002:a5d:97cc:: with SMTP id k12mr7443309ios.4.1628990814756; Sat, 14 Aug 2021 18:26:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628990814; cv=none; d=google.com; s=arc-20160816; b=HRY4f6PKHHcTHdzATYo7UWTskXlenXD9EjDR1shc4ioTFqMu1VtXToANX99GyK8/FC +b2xIltbVdPOVQBwkisJCAyPmSOH7YbfDxQeWYfi0oBcrpossjtIQpsWDSs14szVH5Qh HxwW560/nvCpWlcbdxI+qF4L+/IoMdc4MFHW3+K3TTg3+96/N714WS1gidq9TXRljEsF /8oh0U8ot9y9SpkhRUnmPa/bE+2uUB07g1V0VhU1Fyt2qXdx2Rody9bUJHI01019VUaU 7fD0U54F/6mp9bAK9vtyh66jBrj6ZyXayhm1RQdbceewYm6JamOY0Z/xhXFnvUWEyyFV u75w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=71/rYXFFR+lYpJLFLwn15vWcietufe8jW+ZEomEFgs4=; b=l/VH0Zd32gsLunppaS2lCt+jaHqHkVTEKahJp5scNazAXJ557fKocMVyp87+zy7Juq qtJtAr4mLgyXjP/UJifWcxUBODPHenuj6KoHxxwPBqxbKgsxYDVP+zlv58GKuii+ua/i 3J210gtnv2PE7d07G3FPQNdxx1H0cay8U4ufExW1r7LBIxpnoMCGuwhbQMJFnmbt2QBU ZDe8g5yYXUB8khi3onz6P/BYsgKLmvDfCaDLsXBqrM2+pv0tgfarBNs+qr+obqdvf1mo aaJbj7nuN3zi79/uZofK17nCvA7033FwFhqo/pRo1/BH/Zra/d7C6x08wM77JmZjmtSR aZLw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pq98J5hd; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f18si569571iox.32.2021.08.14.18.26.32; Sat, 14 Aug 2021 18:26:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pq98J5hd; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233237AbhHOBY2 (ORCPT + 99 others); Sat, 14 Aug 2021 21:24:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229453AbhHOBY1 (ORCPT ); Sat, 14 Aug 2021 21:24:27 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6AEA2C061764 for ; Sat, 14 Aug 2021 18:23:58 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id d1so16713164pll.1 for ; Sat, 14 Aug 2021 18:23:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=71/rYXFFR+lYpJLFLwn15vWcietufe8jW+ZEomEFgs4=; b=pq98J5hd8qObSM7B1+++fJRnfwAazRvxXMx381VfR0q3cvYcDiLiIakbgA489cqFrP JSBGVfrXtK0y99QCgOuuJG0YgqISI9yLXD/3gNIeqto/qKL+s4ahe3hptnhd34EUffkb Zdr80q+YB17PzU1BNuYOzBjaXS8a06kU7tqxnevwP13zqJSI9WES9ZMDQ/GEYVxFjvFZ OgzQe89yVNaHfOA6qQ+hEnjkXxKnGK5PerVLz3BD8RwdS5VPOOChVnSGbY+JvCT0sAnX EJvpgDZ3dqgaegpXEZOdc0h1GPN3tZyHdMn//33yDAvOKWQcrSZ5OmqytW4W2wwr3KGS XPKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=71/rYXFFR+lYpJLFLwn15vWcietufe8jW+ZEomEFgs4=; b=pe3zISFveNYCqLJIfHg8GJDweKSxQXAagp9mNTPy7evJkNAb3X87mrztaybhmAz35s N+l05I/ZaYOIlbAUwJ3dHWxplGCtgn9uKgZcACAOH44OzJagb4Ny6SrOcivTV8OPFI8b Uz209RKCj1zvQ1PSGCC8FMzi6VvXxoDZvYK1zgauE/9JMcn7jqiP0Y4lKMph8Mm1KHz3 CypYFS3Wcr1G8Z6CxioaqVkR3J4zqB+iUa05iUbJ5GtTIfwoVEuXvWVKlavPkksDgMrv LwnJyF3yoWj+DQ2FnJFb0tmx82HOeqaOlssMnRkz/Ks10SCV5Zuadrf6d6pHym1Xnl95 9uhw== X-Gm-Message-State: AOAM531kDZeahUklqKBGocvVqPodc7UOzEdUBdN6QqraaOHoDV83JjZK YwW8LWvPXkbIsu/YhaYn7sMtY2g464s3rEdLwWU= X-Received: by 2002:a65:4486:: with SMTP id l6mr8894837pgq.145.1628990637772; Sat, 14 Aug 2021 18:23:57 -0700 (PDT) MIME-Version: 1.0 References: <162846730406.22632.14734595494457390936@noble.neil.brown.name> <162855893202.12431.3423894387218130632@noble.neil.brown.name> <162882238416.1695.4958036322575947783@noble.neil.brown.name> In-Reply-To: From: Mike Javorski Date: Sat, 14 Aug 2021 18:23:46 -0700 Message-ID: Subject: Re: NFS server regression in kernel 5.13 (tested w/ 5.13.9) To: NeilBrown Cc: linux-nfs@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org I managed to get a cap with several discreet freezes in it, and I included a chunk with 5 of them in the span of ~3000 frames. I added packet comments at each frame that the tshark command reported as > 1 sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet details. This is with kernel 5.13.10 provided by Arch (See https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1 for diff vs mainline, nothing NFS/RPC related I can identify). I tried unsuccessfully to get any failures with the 5.12.15 kernel. https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing File should be downloadable anonymously. - mike On Thu, Aug 12, 2021 at 7:53 PM Mike Javorski wrote: > > The "semi-known-good" has been the client. I tried updating the server > multiple times to a 5.13 kernel and each time had to downgrade to the > last 5.12 kernel that ArchLinux released (5.12.15) to stabilize > performance. At each attempt, the client was running the same 5.13 > kernel that was being deployed to the server. I never downgraded the > client. > > Thank you for the scripts and all the details, I will test things out > this weekend when I can dedicate time to it. > > - mike > > On Thu, Aug 12, 2021 at 7:39 PM NeilBrown wrote: > > > > On Fri, 13 Aug 2021, Mike Javorski wrote: > > > Neil: > > > > > > Apologies for the delay, your message didn't get properly flagged in my email. > > > > :-) > > > > > > > > To answer your questions, both client (my Desktop PC) and server (my > > > NAS) are running ArchLinux; client w/ current kernel (5.13.9), server > > > w/ current or alternate testing kernels (see below). > > > > So the bug could be in the server or the client. I assume you are > > careful to test a client against a know-good server, or a server against > > a known-good client. > > > > > I > > > intend to spend some time this weekend attempting to get the tcpdump. > > > My initial attempts wound up with 400+Mb files which would be > > > difficult to ship and use for diagnostics. > > > > Rather than you sending me the dump, I'll send you the code. > > > > Run > > tshark -r filename -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.time > 1' > > > > This will ensure the NFS traffic is actually decoded as NFS and then > > report only NFS(rpc) replies that arrive more than 1 second after the > > request. > > You can add > > > > -T fields -e frame.number -e rpc.time > > > > to find out what the actual delay was. > > > > If it reports any, that will be interesting. Try with a larger time if > > necessary to get a modest number of hits. Using editcap and the given > > frame number you can select out 1000 packets either side of the problem > > and that should compress to be small enough to transport. > > > > However it might not find anything. If the reply never arrives, you'll > > never get a reply with a long timeout. So we need to check that > > everything got a reply... > > > > tshark -r filename -t tcp.port==2049,rpc \ > > -Y 'tcp.port==2049 && rpc.msg == 0' -T fields \ > > -e rpc.xid -e frame.number | sort > /tmp/requests > > > > tshark -r filename -t tcp.port==2049,rpc \ > > -Y 'tcp.port==2049 && rpc.msg == 1' -T fields \ > > -e rpc.xid -e frame.number | sort > /tmp/replies > > > > join -a1 /tmp/requests /tmp/replies | awk 'NF==2' > > > > This should list the xid and frame number of all requests that didn't > > get a reply. Again, editcap can extract a range of frames into a file of > > manageable size. > > > > Another possibility is that requests are getting replies, but the reply > > says "NFS4ERR_DELAY" > > > > tshark -r filename -t tcp.port==2049,rpc -Y nfs.nfsstat4==10008 > > > > should report any reply with that error code. > > > > Hopefully something there will be interesting. > > > > NeilBrown