Received: by 2002:a05:6500:1b8f:b0:1fa:5c73:8e2d with SMTP id df15csp406611lqb; Tue, 28 May 2024 21:39:35 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUa3b5qspZG7kVS8+zh6sw2eOpnObVbWSZ1cbmnNiboK/3eEgmg2ARgu5A8A0sSoRyW39HnLoS7w9r8a/15JMkpbSgp1RLj9+3WpCoiEQ== X-Google-Smtp-Source: AGHT+IEGf7R9WtGelgV9q3MFN5i87TyT8fO3F7X+V+R44ZcwpBn5wOfjxAFyAAVfeRF01+Sm/gL9 X-Received: by 2002:ac8:5984:0:b0:43e:3c19:a2dd with SMTP id d75a77b69052e-43fb0efed1dmr181424591cf.46.1716957575111; Tue, 28 May 2024 21:39:35 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1716957575; cv=pass; d=google.com; s=arc-20160816; b=SkxKAgzDgdNtAJPNumrT0uG0b2DFfraN6+uuINs4UZaM8MYUrz87dI5Rc5pnC66DLC Ovm87cBMJiC6dv4hgSv50VzsN1Fe0oCLHkQZKlFbmzFPYlFNa9UbSBTteZeFisU6w3dQ azWc5x3OOmJaM0AAuQ44L5nO1XHXkSUdvGlfWebnK9rB0In8/X4TNhwCQv4fYYnXvLsb 8UYFL9843E7ZinH9bnPxo5GGA3VNMIzaCe2RKW9D3+vZ7jR4sMoRmYY7qzO4KPO89jfZ b5v6vmB+GtNvvBBnJzuN1/0UN5GmE1J0kmJh7ZOrPBKZryyOOjeJ9WkuwUBpmBvo0k8r Yw0g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=erM23kK9vzq2Est2HYQO4qfMZC+Q2p/Bt5zesIKH+ng=; fh=FSCrSZwImijT0S64kia3nPuYNZpARZ7VJE0ViULyq3k=; b=U7ytUCU9SjJv3yNT8/xwiRocpYKTWxJ9+V2JOdsJucN8eae7zo1JnF5q69jvrVMXmU mwMudGX2oXLv/81+dz6FnztwdmQQFZJj1bv2umMEZhaUY2pPkC+0Vui3HutHaC25VpHj A6DBLNNMiWMOjkQxBGRFJYEo0hECtwFdniKEbxp/CX5ePs0msW+AGkHfxU8/6bC1mli/ Zvwm155Gb9bX8OokVbbC1mRm1B0/tnGQ/dsmDnrxy1sxcPcMuywo/MKEm+Re/HsqnPbE H47hy5VTufPvYNARL/9jb6bmYAtZqDXw3dTc1Q5teqVzDjuZOBG8YTh1KqyxRW5/LUvY nxFw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=zhaoxin.com); spf=pass (google.com: domain of linux-kernel+bounces-193423-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-193423-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id d75a77b69052e-43fb18c7421si115913551cf.624.2024.05.28.21.39.34 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 May 2024 21:39:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-193423-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=zhaoxin.com); spf=pass (google.com: domain of linux-kernel+bounces-193423-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-193423-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id BF5481C21DE0 for ; Wed, 29 May 2024 04:39:34 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E86B415B158; Wed, 29 May 2024 04:39:27 +0000 (UTC) Received: from mx2.zhaoxin.com (mx2.zhaoxin.com [203.110.167.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3F5828E8 for ; Wed, 29 May 2024 04:39:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=203.110.167.99 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716957567; cv=none; b=Efm19qF+FlOpMM7EYq3eQ92hsQoYYiEpO/Gw+DK87AfmtNCdWWotFp45+QJkOqJvdrLGREJpveCUjId6EKvv3BqzXCUEI833XOVAuadlbTRQGMPx3g3acWJsNkX8SfiMx34CDPaNufqmmrc11RF1NUcwlgct+0/iNkgjo1ypzw8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716957567; c=relaxed/simple; bh=w+SMDGzTJFQuqdvJfu1aC+qtLD5AS6Lz4nehZdCVabs=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=Lmvzwm9rqnSlLC2zpXhyAafP6JKX4lDhlISGODD8bGRAZRaabI0tVpYnpYdG/0qRv+rOlgpxmoIYALLhAClcYHXefFDeZoCsOCNQWDyWLRjELEeAsmraTL/ErH35vsWrsTStXExvEiGmsDdM5N+9I1cu2+p8a9g7jIBHeCFZ7s4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com; spf=pass smtp.mailfrom=zhaoxin.com; arc=none smtp.client-ip=203.110.167.99 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zhaoxin.com X-ASG-Debug-ID: 1716957556-1eb14e15780f850001-xx1T2L Received: from ZXSHMBX1.zhaoxin.com (ZXSHMBX1.zhaoxin.com [10.28.252.163]) by mx2.zhaoxin.com with ESMTP id 0VsrEwHzzn7pxCgS (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Wed, 29 May 2024 12:39:16 +0800 (CST) X-Barracuda-Envelope-From: TonyWWang-oc@zhaoxin.com X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.163 Received: from ZXBJMBX03.zhaoxin.com (10.29.252.7) by ZXSHMBX1.zhaoxin.com (10.28.252.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Wed, 29 May 2024 12:39:16 +0800 Received: from [10.32.65.162] (10.32.65.162) by ZXBJMBX03.zhaoxin.com (10.29.252.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.27; Wed, 29 May 2024 12:39:14 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.163 Message-ID: <2553dd17-f763-4894-89b7-5f76c03d3a37@zhaoxin.com> X-Barracuda-RBL-Trusted-Forwarder: 10.32.65.162 Date: Wed, 29 May 2024 12:39:02 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress To: Thomas Gleixner , Dave Hansen , , , , , , , , , , , , , , , , , X-ASG-Orig-Subj: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress CC: , , , Linus Torvalds References: <20240528063836.5248-1-TonyWWang-oc@zhaoxin.com> <50fc1bd3-909e-41c4-a991-9d81e32ef92c@intel.com> <87wmnda8mc.ffs@tglx> Content-Language: en-US From: Tony W Wang-oc In-Reply-To: <87wmnda8mc.ffs@tglx> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: ZXSHCAS2.zhaoxin.com (10.28.252.162) To ZXBJMBX03.zhaoxin.com (10.29.252.7) X-Barracuda-Connect: ZXSHMBX1.zhaoxin.com[10.28.252.163] X-Barracuda-Start-Time: 1716957556 X-Barracuda-Encrypted: ECDHE-RSA-AES128-GCM-SHA256 X-Barracuda-URL: https://10.28.252.36:4443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at zhaoxin.com X-Barracuda-Scan-Msg-Size: 6655 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.125490 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- On 2024/5/29 06:12, Thomas Gleixner wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Tue, May 28 2024 at 07:18, Dave Hansen wrote: >> On 5/27/24 23:38, Tony W Wang-oc wrote: >> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >>> index c96ae8fee95e..ecadd0698d6a 100644 >>> --- a/arch/x86/kernel/hpet.c >>> +++ b/arch/x86/kernel/hpet.c >>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >>> if (in_nmi()) >>> return (u64)hpet_readl(HPET_COUNTER); >>> >>> + /* >>> + * Read HPET directly if panic in progress. >>> + */ >>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >>> + return (u64)hpet_readl(HPET_COUNTER); >>> + >> >> There is literally one other piece of the code in the kernel doing >> something similar: the printk() implementation. There's no other >> clocksource or timekeeping code that does this on any architecture. >> >> Why doesn't this problem apply to any other clock sources? > > I principle it applies to any clocksource which needs a spinlock to > serialize access. HPET is not the only insanity here. > > Think about i8253 :) > > Most real clocksources, like TSC and the majority of the preferred clock > sources on other architectures are perfectly fine. They just read and be > done. > >> Why should the problem be fixed in the clock sources themselves? Why >> doesn't printk() deadlock on systems using the HPET? > > Because regular printk()s are deferred to irq work when in NMI and > similar contexts, but that obviously does not apply to panic > situations. Also NMI is treated special even in the HPET code. See > below. > >> In other words, I think we should fix pstore to be more like printk >> rather than hacking around this in each clock source. > > pstore is perfectly fine. It uses a NMI safe time accessor function > which is then tripping over the HPET lock. That's really a HPET specific > problem. > > Though what I read out of the changelog is that the MCE hits the same > CPU 'x' which holds the lock. But that's fairy tale material as you can > see in the patch above: > > if (in_nmi()) > return (u64)hpet_readl(HPET_COUNTER); > > For that particular case the dead lock, which would actually be a live > lock, cannot happen because in kernel MCEs are NMI class exceptions and > therefore in_nmi() evaluates to true and that new voodoo can't be > reached at all. > > Now there are two other scenarios which really can make that happen: > > 1) A non-NMI class exception within the lock held region > > CPU A > acquire(hpet_lock); > ... <- #PF, #GP, #DE, ... -> panic() > > If any of that happens within that lock held section then the live > lock on the hpet_lock is the least of your worries. Seriously, I > don't care about this at all. > > 2) The actual scenario is: > > CPU A CPU B > lock(hpet_lock) > MCE hits user space > ... > exc_machine_check_user() > irqentry_enter_from_user_mode(regs); > > irqentry_enter_from_user_mode() obviously does not mark the > exception as NMI class, so in_nmi() evaluates to false. That would > actually dead lock if CPU A is not making progress and releases > hpet_lock. > > Sounds unlikely to happen, right? But in reality it can because of > MCE broadcast. Assume that both CPUs go into MCE: > > CPU A CPU B > lock(hpet_lock) > exc_machine_check_user() > irqentry_enter_from_user_mode(); > exc_machine_check_kernel() do_machine_check() > irqentry_nmi_enter(); mce_panic() > do_machine_check() if (atomic_inc_return(&mce_panicked) > 1) > mce_panic() wait_for_panic(); <- Not taken > > if (atomic_inc_return(&mce_panicked) > 1) > wait_for_panic(); <- Taken > > .... > hpet_read() > > -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A > obviously can't release the lock. > Because MCE handler will call printk() before call the panic, so printk() deadlock may happen in this scenario: CPU A CPU B printk() lock(console_owner_lock) exc_machine_check_user() irqentry_enter_from_user_mode() exc_machine_check_kernel() do_machine_check() irqentry_nmi_enter() mce_panic() do_machine_check() printk_mce() #A mce_panic() ... wait_for_panic() panic() printk deadlock will happened at #A because in_nmi() evaluates to false on CPU B and CPU B do not enter the panic() AT #A. Update user space MCE handler to NMI class context is preferred? Sincerely TonyWWang-oc > So the proposed patch makes sense to some extent. But it only cures the > symptom. The real underlying questions are: > > 1) Should we provide a panic mode read callback for clocksources which > are affected by this? > > 2) Is it correct to claim that a MCE which hits user space and ends up in > mce_panic() is still just a regular exception or should we upgrade to > NMI class context when we enter mce_panic() or even go as far to > upgrade to NMI class context for any panic() invocation? > > #1 Solves it at the clocksource level. It still needs HPET specific > changes. > > #2 Solves a whole class of issues > > ... while potentially introducing new ones :) > > To me upgrading any panic() invocation to NMI class context makes a > lot of sense because in that case all bets are off. > > in_nmi() is used in quite some places to avoid such problems. IOW, > that would kill a whole class of issues instead of "curing" the HPET > problem locally for the price of an extra conditional. Not that the > extra conditional matters much if HPET is the clocksource as that's > awfully slow anyway and I really don't care about that. > > But I very much care about avoiding to sprinkle panic_cpu checks all > over the place. > > Thanks, > > tglx