Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp374582rwr; Thu, 27 Apr 2023 02:33:32 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6CX6pq9EDQMvlTF6SjpVnDsImBWQVhDiRg5QmBZFgCCfd5UlDy19DnFjhaicd5jjSpqLmB X-Received: by 2002:a17:90b:198:b0:249:6086:a301 with SMTP id t24-20020a17090b019800b002496086a301mr1249445pjs.27.1682588011917; Thu, 27 Apr 2023 02:33:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682588011; cv=none; d=google.com; s=arc-20160816; b=NzPicHXJfjR+bnkywqROh8WMPHVemrqMnHMTdGkRNxgd+ajdxH3wzPswqSvcqOylTN b4MXoWrTljs27IEZ1BB8i4gFCjjdWVOx1MG9j9H1kfTtwFthzVSq3l3wlOYGg2L01/I/ 80HA3v/9Gc3Zf8A9dKbkOBDcXTos/LFMfkDpbb3a1fN0knZy50YRHWzjljdYL30svsYG Jn8d1pvVXtmo3FekVUFTfF5h7aGzyXMUUNgDn+qBtZ3XlFkxnmPIvzI31h2ITYQsW/Yi /ki2fW2QUVhISc4Gkcem/YlyTQ+x+YUGTGWH+QaRu4h+H/u0t35yRlJBLL5ep7gkJqDD d8DQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:reply-to :message-id:subject:to:from:date:dkim-signature; bh=aL89FJxMgdMiGyH9QcnuEknzdvnTH2Xcq5Tab8Fl0yE=; b=FvPqZlF+kAywXs7j6L2cgWA4lHESQRG4d+rLlE+B6bfULaVFP+iW7MxHZmIEOfA3bI KIfksWKyni7OUOdhEwONKfYAZBwdrPjlMTzp0XQeapQzDZkMBGNRkwXm162JwVmlXyAE 5oOrs9FkkR7fcGg3cDPcIhXR+gBBKVGZKNxek5t8f55+fuqFovdH37iUN2OM0ADa8DXY M97W0JqrNrCS8u8GRWJtpIuAoRGKXc0T5p6ELACvldJSTebi6rRlsan4lqEN9VaZb9Ig 8Ypy49FMcBcteYB62Np7KzuQmAaylmQw0fbDfkd94ZbJQqGT/hwufH5iGJF0W/lBbemG 0jzA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@phys.ethz.ch header.s=2023 header.b=smYZoX9I; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=phys.ethz.ch Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c7-20020a17090ad90700b002473c0c9fe9si21085280pjv.50.2023.04.27.02.33.09; Thu, 27 Apr 2023 02:33:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@phys.ethz.ch header.s=2023 header.b=smYZoX9I; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=phys.ethz.ch Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243332AbjD0J1j (ORCPT + 99 others); Thu, 27 Apr 2023 05:27:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49882 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243068AbjD0J1i (ORCPT ); Thu, 27 Apr 2023 05:27:38 -0400 Received: from phd-imap.ethz.ch (phd-imap.ethz.ch [129.132.80.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A316D1FF2 for ; Thu, 27 Apr 2023 02:27:35 -0700 (PDT) Received: from localhost (192-168-127-49.net4.ethz.ch [192.168.127.49]) by phd-imap.ethz.ch (Postfix) with ESMTP id 4Q6Vkt2SWVz30 for ; Thu, 27 Apr 2023 11:27:34 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=phys.ethz.ch; s=2023; t=1682587654; bh=aL89FJxMgdMiGyH9QcnuEknzdvnTH2Xcq5Tab8Fl0yE=; h=Date:From:To:Subject:Reply-To:From; b=smYZoX9IBjzLAvbu2iYKcc/CofgWCNlCxH1BR+MW4aDjpupxyMNIR9YW7UVA5tb9n jdNLHhG3fdJU2uwnbu/+hjnvF+mfi5PEnD0jDmIIM9sagtOz1Hfvou06zmBpP7tdcw dZEShkK5LfOC0EBoLo/R44gWvvo2ESe/JK7CnpCzG/XyEAIppgMmvCm3AUzcRidv5Q 4/QLa+S6q0KgStbnl8gW6tlJbFg4HQIzuNsm2gxM0CX6NvNt3jl7bmUc9/UHBiIxoH w3B2LPdb/mbLVfIeNz6LrrtbWRU2ilhUBLSGh3V5ta8wJ1j3FT9HzzuBOix5l2YWy2 2y5I6q526eA1Q== X-Virus-Scanned: Debian amavisd-new at phys.ethz.ch Received: from phd-mxin.ethz.ch ([192.168.127.53]) by localhost (phd-mailscan.ethz.ch [192.168.127.49]) (amavisd-new, port 10024) with LMTP id uW7ErqTvt5uR for ; Thu, 27 Apr 2023 11:27:34 +0200 (CEST) Received: from phys.ethz.ch (mothership.ethz.ch [192.33.96.20]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: daduke@phd-mxin.ethz.ch) by phd-mxin.ethz.ch (Postfix) with ESMTPSA id 4Q6Vkt1jnzz9r for ; Thu, 27 Apr 2023 11:27:34 +0200 (CEST) Date: Thu, 27 Apr 2023 11:27:33 +0200 From: Christian Herzog To: linux-nfs@vger.kernel.org Subject: re: file server freezes with all nfsds stuck in D state after upgrade to Debian Message-ID: Reply-To: Christian Herzog MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Hello again Three weeks ago we reported on nfsd D state-induced freezes on our bookworm-upgraded files servers [1]. The general consensus at the time seems to have been that the real issue was deeper in our storage stack, so we headed over to linux-block but were never able to pinpoint the issue. We just had another instance where all our 64 nfsd processes were stuck in D state. This time the stack traces look different and we have some more hints in our logs, and this time we're pretty sure it's nfsd and not general block IO. All 64 nfds have similiar stack traces: 14 processes: [<0>] __flush_workqueue+0x152/0x420 [<0>] nfsd4_shutdown_callback+0x49/0x130 [nfsd] [<0>] __destroy_client+0x1f3/0x290 [nfsd] [<0>] nfsd4_exchange_id+0x752/0x760 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 9 processes: [<0>] __flush_workqueue+0x152/0x420 [<0>] nfsd4_shutdown_callback+0x49/0x130 [nfsd] [<0>] __destroy_client+0x1f3/0x290 [nfsd] [<0>] nfsd4_exchange_id+0x358/0x760 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 41 processes: [<0>] __flush_workqueue+0x152/0x420 [<0>] nfsd4_destroy_session+0x1b6/0x250 [nfsd] [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd] [<0>] nfsd_dispatch+0x167/0x280 [nfsd] [<0>] svc_process_common+0x286/0x5e0 [sunrpc] [<0>] svc_process+0xad/0x100 [sunrpc] [<0>] nfsd+0xd5/0x190 [nfsd] [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 20 minutes prior to the first frozen nfsds, we saw messages similiar to receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000fcdd40ac xid 182df75c It seems these messages come from receive_cb_reply [2] and it looks like xprt_lookup_rqst cannot find the RPC request beloning to a certain transaction. We see these messages with different values for xpt_bc_xprt, which, we think, correspond to the different NFS clients. All this is on production file servers running Debian bookworm with iSCSI block devices and XFS file systems. Does anyone have any suggestions how to further debug this? Unfortunately we have yet to find a way to trigger it deliberately, for the time being it happens whenever it happens.... thanks and best regards, -Christian [1] https://www.spinics.net/lists/linux-nfs/msg96048.html [2] https://elixir.bootlin.com/linux/v6.1.20/source/net/sunrpc/svcsock.c#L902 -- Dr. Christian Herzog support: +41 44 633 26 68 Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50 Department of Physics, ETH Zurich 8093 Zurich, Switzerland http://isg.phys.ethz.ch/