Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp2315231rwl; Thu, 6 Apr 2023 08:34:15 -0700 (PDT) X-Google-Smtp-Source: AKy350ZCy3Kh/4sFf5ZnN4JmuLmNZX2hdpql+EpPj/6f+7FKtK150J0M0+Nl41lT3c6/O/0h++WN X-Received: by 2002:a17:907:76a5:b0:92e:a1d8:bf1 with SMTP id jw5-20020a17090776a500b0092ea1d80bf1mr6510166ejc.13.1680795255730; Thu, 06 Apr 2023 08:34:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680795255; cv=none; d=google.com; s=arc-20160816; b=QEcxIjOuHM0N5lSaqu+t1gfWxshVha4/sI93mOLxUkPReMdTBWV74Wy9pPoI+j0SrG o+AqSVSLJAXxjdzbfUYGm59Apye4yvkOSkSIddfrZUipW4meKYQAfVEuthds7d6jL7cz nhCVwVgUXqyfbZsc3FgjmeAi+K2fD6A7gyzX4wiGUD8gbgFnZaNYgXxkfK40hUGWXxTf jLAqiKMEj0962sXNDtRnnyam1YVwa8WFGUXzfMc3htPpvTq9R8qU5QjiwYgQBa1gGG8y wedlOrGjoYdjFXBENz0oPChqiFvUWSM6y+zWcfWuBHELBZufpyXVdr3CqLMcQSCMxVQR YT5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=irUNyor4Vu3vTfZes8C6R31gvUaeyykJeiWbWxtLoLU=; b=C/HTu8mgGcTl+pG1Ee1Q7+wl2T7+PVAaMPQPeanItptZKpz3z4TepAAXSE/y5E9/8Y xznxFG0CeFagwmvLtvfbo00mtFjVo5hEM+6wJQi0PE+OMDk2XSjuYfGeYKShq3niO49t RfyWmPWmuPdHmHkpkp2AknWBPfM4QuYmVSo3vGjYlJ2sf1f1lpSBQV6ns2tnn4kPXb6S zn+6hB2ktFlTcV8ORn0Du6Zh+acyXI/5wRaZG0PCJwB8pQaKA/4iU3OzOUleA3rg8ILc mWAjsa/RCfm/K6DPvjQgxRIphcIVUYYu3vmtrA4KjTGvz3k7F/oOIt6+idsCGUC8SJfp TBnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@phys.ethz.ch header.s=2023 header.b=eR1hDRej; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=phys.ethz.ch Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lu4-20020a170906fac400b00930cad91803si1363131ejb.473.2023.04.06.08.33.41; Thu, 06 Apr 2023 08:34:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@phys.ethz.ch header.s=2023 header.b=eR1hDRej; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=phys.ethz.ch Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229667AbjDFPdT (ORCPT + 99 others); Thu, 6 Apr 2023 11:33:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39622 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239215AbjDFPdS (ORCPT ); Thu, 6 Apr 2023 11:33:18 -0400 Received: from phd-imap.ethz.ch (phd-imap.ethz.ch [129.132.80.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A6C4910C4 for ; Thu, 6 Apr 2023 08:33:15 -0700 (PDT) Received: from localhost (192-168-127-49.net4.ethz.ch [192.168.127.49]) by phd-imap.ethz.ch (Postfix) with ESMTP id 4PslrT3j81z30; Thu, 6 Apr 2023 17:33:13 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=phys.ethz.ch; s=2023; t=1680795193; bh=irUNyor4Vu3vTfZes8C6R31gvUaeyykJeiWbWxtLoLU=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=eR1hDRejBH4jnInZq/kxdWVa4OPCfKY+qVnjsQCS6R1wvynz0fIYmuvZvkkKPDukZ XYQ/7EOfhszeMkaFiOgnk9POsHWKHAa9DhO09uiOjqANnIbctVi3IkLpeMxPOkKNte 8qHdi4i+liIjXhtQ66Ozb5y55gubqNOPDAGHYbNpWdxxzC8CfeuFLXuZYtSI6ydgwe +s6Hv2YgHLv/Bm7LlPKMpbG4cC/DnlBVFxZTun3xizdrj3OxJSX7bVr1PV3bOtBVB2 BQGippXjg3xj8VDvhKOP2NhuSW5PhB0givVhHcdVK7sQf4PX38EQqEdx5Q2P8LiGeU DS4snS5cN5H9A== X-Virus-Scanned: Debian amavisd-new at phys.ethz.ch Received: from phd-mxin.ethz.ch ([192.168.127.53]) by localhost (phd-mailscan.ethz.ch [192.168.127.49]) (amavisd-new, port 10024) with LMTP id vRramUPYqrHA; Thu, 6 Apr 2023 17:33:13 +0200 (CEST) Received: from phys.ethz.ch (mothership.ethz.ch [192.33.96.20]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: daduke@phd-mxin.ethz.ch) by phd-mxin.ethz.ch (Postfix) with ESMTPSA id 4PslrT2qvsz9r; Thu, 6 Apr 2023 17:33:13 +0200 (CEST) Date: Thu, 6 Apr 2023 17:33:12 +0200 From: Christian Herzog To: Chuck Lever III Cc: Linux NFS Mailing List Subject: Re: file server freezes with all nfsds stuck in D state after upgrade to Debian bookworm Message-ID: Reply-To: Christian Herzog References: <6785EFE7-2CE1-45CD-8643-C40CCCDADEB8@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6785EFE7-2CE1-45CD-8643-C40CCCDADEB8@oracle.com> X-Spam-Status: No, score=-0.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Dear Chuck, > > for our researchers we are running file servers in the hundreds-of-TiB to > > low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband > > LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL, > > we prepared an upgrade to Debian bookworm and tests went well. About a week > > after one of the upgrades, we ran into the first occurence of our problem: all > > of a sudden, all nfsds enter the D state and are not recoverable. However, the > > underlying file systems seem fine and can be read and written to. The only way > > out appears to be to reboot the server. The only clues are the frozen nfsds > > and strack traces like > > > > [<0>] rq_qos_wait+0xbc/0x130 > > [<0>] wbt_wait+0xa2/0x110 > > Hi Christian, you have a pretty deep storage stack! > rq_qos_wait is a few layers below NFSD. Jens Axboe > and linux-block are the folks who maintain that. are you saying the root cause isn't nfs*, but the file system? That was our first idea too, but we haven't found any indication that this is the case. The xfs file systems seem perfectly fine when all nfsds are in D state, and we can read from them and write to them. If xfs were to block nfs IO, this should affect other processes too, right? thanks and Happy Easter, -Christian