Received: by 2002:a05:6500:1b8f:b0:1fa:5c73:8e2d with SMTP id df15csp267856lqb; Tue, 28 May 2024 15:13:15 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWlp7n/hwPYnSaGcvMZqLteHc8oPlaHP+QuvgALAOnZZCxPcyYwu/xYQ/VlRexUC/VdD9YLZRu91sBxMaSNv6vjZaC2QwHBKCgdSBiFtw== X-Google-Smtp-Source: AGHT+IHVh5/UJ2SCDaUGgd+IB/DcrfVZqIowg2mtHACQOP8oUFvHuw5hFRR31ZQBgCLS3k3j4DDJ X-Received: by 2002:a54:441a:0:b0:3c9:6e80:8ea2 with SMTP id 5614622812f47-3d1a6019290mr13575668b6e.32.1716934395132; Tue, 28 May 2024 15:13:15 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1716934395; cv=pass; d=google.com; s=arc-20160816; b=nrBJ+5lJQaeqGwwX3p6GSiKGeaGzhZ/Dr5/nUyNJQ44FWlQD/JUdOP+iXPPqKN0Wwo U6YVVIBCksfBbxl+vA4PjgUIBdk57DkVaNV+w8y7zqKfCwM04MY6LFGJtsZYQy/6TPpV YvkowCf04iDHXd2YUCUAZXCy33Xe/K7nsOcRzb2jFIg49zfuW1vLx0rvyBSRzLiCQyEE k8caevQNCPnDwNYVllwLFFGI2kldS0Snag8KWehxX+/zfmZZ5B48EA2cl7FE4TbGWD7z XmsFjqE3zrgNks/9Uj0lr8eX4TsSudgBq34o6jY7lRCmlYi4xxIWfCVATEe3Dfp5TcdS AN8g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :message-id:date:references:in-reply-to:subject:cc:to:dkim-signature :dkim-signature:from; bh=GZub3AKGtf3m0WqaMw01L0KFTwsMRo2LAh+Cg+JRsv0=; fh=a5v44vo4thnqBZmYQwmDC7EHnyzvULj7NgnoUIkhGe8=; b=DMsFsGDHzEo8OTMMG8ncr+fJCuoBJIk4iilvLEQMOeleOcpM/0XZ3eXMuC3BJZ2C17 39HT3iOnET0bJ7KdGtd5hI27mtTvZkQ2FJvNr2jaOhDpeB93xwGRlXFy+QqQv5M7FM6i 13oiWIYr3Cfq4dclR7guC7M++HcaQiHIjv6+w00sd2dHocmvzuR3PH2tjO+0cDAwdqNh PM7gJeRQ7rSzTv+lgSKQvMUZhHcYWYQNXEOC863CzDUyGtfY/0ONJEKhmsp79oNb1oNW Gcy2Jm29C1I0/ifziSY4nwOwr8K+Uuwb85d+Nz+8YR8eK6YyRw+kfCIx7a514dABf1cc Hl4Q==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=10EpJ6hN; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; arc=pass (i=1 spf=pass spfdomain=linutronix.de dkim=pass dkdomain=linutronix.de dmarc=pass fromdomain=linutronix.de); spf=pass (google.com: domain of linux-kernel+bounces-193149-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-193149-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id 6a1803df08f44-6ac06f5031fsi115780886d6.2.2024.05.28.15.13.14 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 May 2024 15:13:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-193149-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=10EpJ6hN; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; arc=pass (i=1 spf=pass spfdomain=linutronix.de dkim=pass dkdomain=linutronix.de dmarc=pass fromdomain=linutronix.de); spf=pass (google.com: domain of linux-kernel+bounces-193149-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-193149-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 4D0921C23851 for ; Tue, 28 May 2024 22:12:54 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 465DF13DDAA; Tue, 28 May 2024 22:12:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="10EpJ6hN"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="YNfAxC6G" Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A78A913DDA1; Tue, 28 May 2024 22:12:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716934368; cv=none; b=d6krU/cLQRfbqsnAYUSIQ2QMXa3iWMmic/HQRgPcfLVmm+p0FhltNJYO99KW0F3+3xFsI+W5pK5kbKqUuNZ208WdVRDG194Syn979uqJTI12AEEkGDD+hzgrgtQOCtG7C5lVmUqnP8MbGIn8ZYAjl4Dcktskpc0FffGp8TW5AFk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716934368; c=relaxed/simple; bh=1hLjs6HYpgIThhhT9gGLIDRvVYnA1jimJtxl/FXjhzA=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=Lh29IKxvkF9yG6EGNTPToN2I+szaKEmquXXXbO/macn/TiIKvnCjzw6+XDuAJ9M27LvOzYVYKmyWjsMO+IIqHtkn3jPNelJ6934TO+iuOCFM5g+/Gc9QuVsAZ9LddZ7edu/NgTqHARrfFtdca40ygQG3RE53upUXefNGCXg7brM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=10EpJ6hN; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=YNfAxC6G; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1716934363; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GZub3AKGtf3m0WqaMw01L0KFTwsMRo2LAh+Cg+JRsv0=; b=10EpJ6hNlROGnXgv687DoZ4wgrzIzfraguXluq2TvrA53ZzXTtjgzsRZ2ntZiDyk/sEXUR Nry6KMgOAzHcXuIvuzVYZcK2ArTMU644G4vdAqqpMa7EGFNWM2+ykO/VkE0EUGG+baruFQ uARHLnUAFKHBRj9jbK+VoQMcupN5KkiEtvvBoEv8ucRCCaVzBClfoAllvSCUD624cvx3WL sYSpqCgfwZ0p/Wa2yVukWBPO8bTtq1QQYE6d7jaCm9Yw/QDsmUFZaX8zF5nzOnxkDInZD3 lwyfDDNg6YQpvUUtQYdcs4tpGO9dpGHdzYj+FURtGK+em5+saXBYVYImnLD1Sg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1716934363; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=GZub3AKGtf3m0WqaMw01L0KFTwsMRo2LAh+Cg+JRsv0=; b=YNfAxC6G+tEJE6ZRI9e2UBKdG84tVoH4nusCdtPdZV2AGpGAdsn5bo+qXBR8qrsZa1LCZY AFQ1L/bahS1cvKDg== To: Dave Hansen , Tony W Wang-oc , mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, keescook@chromium.org, tony.luck@intel.com, gpiccoli@igalia.com, mat.jonczyk@o2.pl, rdunlap@infradead.org, alexandre.belloni@bootlin.com, mario.limonciello@amd.com, yaolu@kylinos.cn, bhelgaas@google.com, justinstitt@google.com, linux-kernel@vger.kernel.org, linux-hardening@vger.kernel.org Cc: CobeChen@zhaoxin.com, TimGuo@zhaoxin.com, LeoLiu-oc@zhaoxin.com, Linus Torvalds Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress In-Reply-To: <50fc1bd3-909e-41c4-a991-9d81e32ef92c@intel.com> References: <20240528063836.5248-1-TonyWWang-oc@zhaoxin.com> <50fc1bd3-909e-41c4-a991-9d81e32ef92c@intel.com> Date: Wed, 29 May 2024 00:12:43 +0200 Message-ID: <87wmnda8mc.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Tue, May 28 2024 at 07:18, Dave Hansen wrote: > On 5/27/24 23:38, Tony W Wang-oc wrote: > ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >> index c96ae8fee95e..ecadd0698d6a 100644 >> --- a/arch/x86/kernel/hpet.c >> +++ b/arch/x86/kernel/hpet.c >> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >> if (in_nmi()) >> return (u64)hpet_readl(HPET_COUNTER); >> >> + /* >> + * Read HPET directly if panic in progress. >> + */ >> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >> + return (u64)hpet_readl(HPET_COUNTER); >> + > > There is literally one other piece of the code in the kernel doing > something similar: the printk() implementation. There's no other > clocksource or timekeeping code that does this on any architecture. > > Why doesn't this problem apply to any other clock sources? I principle it applies to any clocksource which needs a spinlock to serialize access. HPET is not the only insanity here. Think about i8253 :) Most real clocksources, like TSC and the majority of the preferred clock sources on other architectures are perfectly fine. They just read and be done. > Why should the problem be fixed in the clock sources themselves? Why > doesn't printk() deadlock on systems using the HPET? Because regular printk()s are deferred to irq work when in NMI and similar contexts, but that obviously does not apply to panic situations. Also NMI is treated special even in the HPET code. See below. > In other words, I think we should fix pstore to be more like printk > rather than hacking around this in each clock source. pstore is perfectly fine. It uses a NMI safe time accessor function which is then tripping over the HPET lock. That's really a HPET specific problem. Though what I read out of the changelog is that the MCE hits the same CPU 'x' which holds the lock. But that's fairy tale material as you can see in the patch above: if (in_nmi()) return (u64)hpet_readl(HPET_COUNTER); For that particular case the dead lock, which would actually be a live lock, cannot happen because in kernel MCEs are NMI class exceptions and therefore in_nmi() evaluates to true and that new voodoo can't be reached at all. Now there are two other scenarios which really can make that happen: 1) A non-NMI class exception within the lock held region CPU A acquire(hpet_lock); ... <- #PF, #GP, #DE, ... -> panic() If any of that happens within that lock held section then the live lock on the hpet_lock is the least of your worries. Seriously, I don't care about this at all. 2) The actual scenario is: CPU A CPU B lock(hpet_lock) MCE hits user space ... exc_machine_check_user() irqentry_enter_from_user_mode(regs); irqentry_enter_from_user_mode() obviously does not mark the exception as NMI class, so in_nmi() evaluates to false. That would actually dead lock if CPU A is not making progress and releases hpet_lock. Sounds unlikely to happen, right? But in reality it can because of MCE broadcast. Assume that both CPUs go into MCE: CPU A CPU B lock(hpet_lock) exc_machine_check_user() irqentry_enter_from_user_mode(); exc_machine_check_kernel() do_machine_check() irqentry_nmi_enter(); mce_panic() do_machine_check() if (atomic_inc_return(&mce_panicked) > 1) mce_panic() wait_for_panic(); <- Not taken if (atomic_inc_return(&mce_panicked) > 1) wait_for_panic(); <- Taken .... hpet_read() -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A obviously can't release the lock. So the proposed patch makes sense to some extent. But it only cures the symptom. The real underlying questions are: 1) Should we provide a panic mode read callback for clocksources which are affected by this? 2) Is it correct to claim that a MCE which hits user space and ends up in mce_panic() is still just a regular exception or should we upgrade to NMI class context when we enter mce_panic() or even go as far to upgrade to NMI class context for any panic() invocation? #1 Solves it at the clocksource level. It still needs HPET specific changes. #2 Solves a whole class of issues ... while potentially introducing new ones :) To me upgrading any panic() invocation to NMI class context makes a lot of sense because in that case all bets are off. in_nmi() is used in quite some places to avoid such problems. IOW, that would kill a whole class of issues instead of "curing" the HPET problem locally for the price of an extra conditional. Not that the extra conditional matters much if HPET is the clocksource as that's awfully slow anyway and I really don't care about that. But I very much care about avoiding to sprinkle panic_cpu checks all over the place. Thanks, tglx