Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp1538359rdf; Sun, 5 Nov 2023 04:02:54 -0800 (PST) X-Google-Smtp-Source: AGHT+IHf++0MlN8BpB8Eiqo2oJLNv1VOIIFsp9xO7Mv99SJpsUWg5qQveMyq0eTF86X1av9eU4If X-Received: by 2002:a05:6a21:4849:b0:181:1cfa:4106 with SMTP id au9-20020a056a21484900b001811cfa4106mr12175833pzc.62.1699185774437; Sun, 05 Nov 2023 04:02:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699185774; cv=none; d=google.com; s=arc-20160816; b=se41vtuIgzrbjPrUll5Ieg5bbDGujP/kPcG9IzxKWnPgmIBz9bgC6QG4NR1YUXT5CG dKSmPsrrnGUS03lz/cPvgt3H4BFyT2ZGFhAEaLTMO293f6Srkveq+IEDYJxBJ5ppy0sh v6ybUa3zRQYm9g70Nuit74wx9w+NR+XFDR7RcW3K1cS7sQip5OkvDWTYb5MgNJUER5ef G1epE1tQP2Nn1dlnho52IGus2Lp5NMQH6w4GzlTAetltJgobR+i18N+jMnBcxIb1dPHJ s9HOQPqin4A/kEbjOKZva42HWFIKT/vZ4Fwm11THXI8R7Dk+1nsTxdVz3aOUbLWngswN 0EWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=HKXNPHCNFdPxUogka4hm8V+2J/QunQMFn3t+L+Ja0qo=; fh=hXLkgzlJplB1goyjQthOZ3tPt/5n1GPvn5HRIssLof8=; b=vBtoKEEmNuNRcFt7XAGlcqXb11aUjeUPcdEWxos54xIhAyxONb/yWSnXuVsHUFTrYo LOvfvKa7JtXWTriB3gyOC1OWDVc/RpzDrgIJysKfMtChcyr8J9sAcCPyJr71TLIMFloo 47PY+RIWPVHFcZGZdhK9oAaZc9JJu2ndo8YhiT9YxjhHJU2HIdDlrKhqhk9prjpa2y+2 LScJFDOWmPmCGFcnRE9zuZc25A+keas79Sf9FsT5iqK/Q8P4kIjC66+GwtROPHNqvcYx WWuUa4DTDVviQ5tkfMsWIYR7rWsWCzXZxOyDEE46PyI9+zZchGyI202oCE1p1aa3Oghn nBjA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@treblig.org header.s=bytemarkmx header.b=rrxdtwly; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id v9-20020a17090abb8900b002791bfc67bdsi5569162pjr.41.2023.11.05.04.02.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Nov 2023 04:02:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=fail header.i=@treblig.org header.s=bytemarkmx header.b=rrxdtwly; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id C39F4807C7F5; Sun, 5 Nov 2023 04:02:13 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229614AbjKEMCM (ORCPT + 99 others); Sun, 5 Nov 2023 07:02:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43390 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229447AbjKEMCL (ORCPT ); Sun, 5 Nov 2023 07:02:11 -0500 Received: from mx.treblig.org (mx.treblig.org [IPv6:2a00:1098:5b::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E4BCB3; Sun, 5 Nov 2023 04:02:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=treblig.org ; s=bytemarkmx; h=In-Reply-To:Content-Type:MIME-Version:References:Message-ID :Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID :Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To: Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe :List-Post:List-Owner:List-Archive; bh=HKXNPHCNFdPxUogka4hm8V+2J/QunQMFn3t+L+Ja0qo=; b=rrxdtwlydDZtpPWdp4PyeluEdR YCtNI9BNTwOd/NGU1Aie9jh4PYI8D7SJBc8MHKqgCKf4rU3dkxqKhdA6xAQ2TcvX1gWWM8BMFurtW i4+L5rJdlnuqjN+crYn074KCNu4GTaGIulqcmXkW8Kvg5Wx0cy/SRnDiG4SUuiIm83gG+Q3j5brIX Zp1WMwqxwZbS6KcmXhrWae0MnxClOycL5oy7s2fdECJH9MSylDW+Wif7bGJ7A0T3EoOHcGix+g7eQ ekhiBsXhxGWGbNo74EP6mFtnp0oszox4X6jPHSg1qbILD/tNvMoO78nM0oI6X++GHdpm2BB24lqmn +qqJSYGg==; Received: from dg by mx.treblig.org with local (Exim 4.96) (envelope-from ) id 1qzbpN-007pAD-0H; Sun, 05 Nov 2023 12:02:05 +0000 Date: Sun, 5 Nov 2023 12:02:05 +0000 From: "Dr. David Alan Gilbert" To: Donald Buczek Cc: Linux Kernel Mailing List , linux-fsdevel@vger.kernel.org Subject: Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process Message-ID: References: <77184fcc-46ab-4d69-b163-368264fa49f7@molgen.mpg.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <77184fcc-46ab-4d69-b163-368264fa49f7@molgen.mpg.de> X-Chocolate: 70 percent or better cocoa solids preferably X-Operating-System: Linux/6.1.0-12-amd64 (x86_64) X-Uptime: 11:59:39 up 50 days, 14:58, 1 user, load average: 0.00, 0.00, 0.00 User-Agent: Mutt/2.2.12 (2023-09-09) X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sun, 05 Nov 2023 04:02:14 -0800 (PST) * Donald Buczek (buczek@molgen.mpg.de) wrote: > Hello, experts, > > we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot. > > Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before. I'm not sure; but did you check /proc/cpuinfo after that revert and check the microcode version dropped back (or physically pwoer cycle); I'm not sure if a reboot reverts the microcode version. > Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part: > > for task in /proc/*/task/*; do > echo "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)" > cmd cat $task/stack > done > > which can further be reduced to > > for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done > > This is absolutely reproducible. Above line unblocks the system reliably. > > Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ). > > Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect? Not me; but had you tried something simpler like a sysrq-d or sysrq-w for locks and blocked tasks. > I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise. Seems a weird hardware problem to have that specific a way to unblock it. Dave > Thanks > Donald > -- > Donald Buczek > buczek@molgen.mpg.de > Tel: +49 30 8413 1433 -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ \ dave @ treblig.org | | In Hex / \ _________________________|_____ http://www.treblig.org |_______/