Received: by 2002:a05:7208:3188:b0:7e:5202:c8b4 with SMTP id r8csp803308rbd; Fri, 23 Feb 2024 04:17:17 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXnd4uTguoYJ8vOF7g5lAl8bf12GMHdp8450YtbmqoyxbjyT+LN/YSaN9IrfXyW8OFT+TxNVGj3Zjw5/XORKWQSjGFvH2LujbwnSNkf+g== X-Google-Smtp-Source: AGHT+IEQXZ01sBVB25DgecHh9FKNu/NbiCOP4ESQih9kfRWavEdWBJ0QpKAscwONMPujtRnWnJHM X-Received: by 2002:a92:c80e:0:b0:365:101e:b47d with SMTP id v14-20020a92c80e000000b00365101eb47dmr1665965iln.0.1708690637491; Fri, 23 Feb 2024 04:17:17 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708690637; cv=pass; d=google.com; s=arc-20160816; b=lASQw6ndUI8W5t0VkKUKWsCcXrUL8rYfmAGFeyQoNoLei0NL2G+IvLPRZGaISSek8q 5afzO0zQS1zc403SSgM+3IHojjazKjiY+S8M5mS/9hVfSa5Ybs6v1y8iSmH4b612gCDH htutoTTEZlIyZ5QVvXyNqMeM+d8k09u88twJqsuDi5POIKWmPobv+AOLUK/vbsn/D8xC 5ATcnnETh4/gFTbr2b6Jj0i7bkKdJh8dOPJx7L0PGkBQYvjTzqznBeYtsFn1zCVOJkz1 iUGJuAVDzActJewTCbzj6aC1iOoSDGk+QFff+ZEnS3GF0XENYmKSYbbHqdY+zybsho8j g40Q== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:organization:references :in-reply-to:message-id:subject:cc:to:from:date; bh=FxG9J350XfUe1UvPl4JC3RkQgxuv89D9lsJ4xhWtfng=; fh=gYmxPOVWiMcJpDPMUcuXHaRGtBhvAgZW5JkMHQQ3lvA=; b=FiYLic6N+98f4uKwLetEuBgLY4WmD8d8zpg5oe722i9F75v7LEZ4MKwKHlBPuLl5SS JiEhzrCumLLDY8ThA00agd21pSQWWzRfKHWQPp2+dCFabVKDoA/XG/N6F5qU/V4Y548F hOgXhgkuAm2/gAItjgGAaNZLTGzcCzsUNZUc5Zif851yi7bnj2sm1NvCcy7KtAtcBzCJ 0uueSLWOX/fKPL6kPiopV7t8SkzzC3dRqaSDDUzqcK4lGnedHy80W/RK71V/jSQqABXG E9aCduX+4Eb7glxQ/RLsX/u6FM81PYZYGR5+lXkIQBEMQ9FZK/f+kL0N3SkszPnl5aLs +jEg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-78283-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-78283-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id e22-20020a656796000000b005d55aec96b2si11915842pgr.41.2024.02.23.04.17.17 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Feb 2024 04:17:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-78283-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-78283-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-78283-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 2DB302867CB for ; Fri, 23 Feb 2024 12:17:17 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DFEE17BAF8; Fri, 23 Feb 2024 12:17:09 +0000 (UTC) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 360AC5C90C; Fri, 23 Feb 2024 12:17:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708690629; cv=none; b=haFEuefbmTpOXt8PAKVPc8QPQk1CqOxsUQQ0zvRCuf1XY+lucvRH0S9XfXtxlIAO/nastBFzPwEFUfItlQk68zOaSKFT4bmVeJgOW+962ckwLrLJKXDyQT5K0EID866kbAl/z64ciBQq2Co/n9ZivLt1dxyShS8w5eynuQwiAas= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708690629; c=relaxed/simple; bh=qDrweQQqf8nzNthDbW29boU/th8Tx7nKWJxKyp/ks0w=; h=Date:From:To:CC:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=gfxIxTNfU69dKBvuG1MWJXrFVnl6rKlIqwVWqS/eQ9eDTVVjtY17R+DColcg0MwyFUSDsw4mUHBdwQ4MZItNI87DlNqXpngw4U248SaqJEg0ZID2sXqRx5iW/mZYMR2AoGv0ImR8n3AoDRAsYXZAxShlone84yOe8X+zDxbxTX0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=Huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=Huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4Th86H5SCgz6K64V; Fri, 23 Feb 2024 20:12:55 +0800 (CST) Received: from lhrpeml500005.china.huawei.com (unknown [7.191.163.240]) by mail.maildlp.com (Postfix) with ESMTPS id B5251140B54; Fri, 23 Feb 2024 20:17:04 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Fri, 23 Feb 2024 12:17:03 +0000 Date: Fri, 23 Feb 2024 12:17:01 +0000 From: Jonathan Cameron To: Dan Williams CC: Shuai Xue , Borislav Petkov , Ira Weiny , "Luck, Tony" , "james.morse@arm.com" , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Message-ID: <20240223121701.00004bcf@Huawei.com> In-Reply-To: <20240223120813.00005d1f@Huawei.com> References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20240204080144.7977-2-xueshuai@linux.alibaba.com> <20240219092528.GTZdMeiDWIDz613VeT@fat_crate.local> <65d82c9352e78_24f3f294d5@dwillia2-mobl3.amr.corp.intel.com.notmuch> <20240223120813.00005d1f@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-ClientProxiedBy: lhrpeml100004.china.huawei.com (7.191.162.219) To lhrpeml500005.china.huawei.com (7.191.163.240) On Fri, 23 Feb 2024 12:08:13 +0000 Jonathan Cameron wrote: > On Thu, 22 Feb 2024 21:26:43 -0800 > Dan Williams wrote: > > > Shuai Xue wrote: > > > > > > > > > On 2024/2/19 17:25, Borislav Petkov wrote: > > > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote: > > > >> Synchronous error was detected as a result of user-space process accessing > > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception > > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a > > > >> memory_failure() work which poisons the related page, unmaps the page, and > > > >> then sends a SIGBUS to the process, so that a system wide panic can be > > > >> avoided. > > > >> > > > >> However, no memory_failure() work will be queued when abnormal synchronous > > > >> errors occur. These errors can include situations such as invalid PA, > > > >> unexpected severity, no memory failure config support, invalid GUID > > > >> section, etc. In such case, the user-space process will trigger SEA again. > > > >> This loop can potentially exceed the platform firmware threshold or even > > > >> trigger a kernel hard lockup, leading to a system reboot. > > > >> > > > >> Fix it by performing a force kill if no memory_failure() work is queued > > > >> for synchronous errors. > > > >> > > > >> Signed-off-by: Shuai Xue > > > >> --- > > > >> drivers/acpi/apei/ghes.c | 9 +++++++++ > > > >> 1 file changed, 9 insertions(+) > > > >> > > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > > > >> index 7b7c605166e0..0892550732d4 100644 > > > >> --- a/drivers/acpi/apei/ghes.c > > > >> +++ b/drivers/acpi/apei/ghes.c > > > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes, > > > >> } > > > >> } > > > >> > > > >> + /* > > > >> + * If no memory failure work is queued for abnormal synchronous > > > >> + * errors, do a force kill. > > > >> + */ > > > >> + if (sync && !queued) { > > > >> + pr_err("Sending SIGBUS to current task due to memory error not recovered"); > > > >> + force_sig(SIGBUS); > > > >> + } > > > > > > > > Except that there are a bunch of CXL GUIDs being handled there too and > > > > this will sigbus those processes now automatically. > > > > > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always > > > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is > > > delivered as a synchronous notification. > > > > > > Will the CXL component trigger synchronous events for which we need to terminate the > > > current process by sending sigbus to process? > > > > None of the CXL component errors should be handled as synchronous > > events. They are either asynchronous protocol errors, or effectively > > equivalent to CPER_SEC_PLATFORM_MEM notifications. > > Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA. > Premature send.:( One example I can point at is how we do signaling of memory errors detected by the host into a VM on arm64. https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391 CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA). Right now we've only used async in QEMU for proposed CXL error CPER records signalling but your reference to them being similar to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be synchronous in some physical systems as it's one viable way to provide rich information for synchronous reception of poison. For the VM case my assumption today is we don't care about providing the VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for errors whether from CXL CPER records or not. Jonathan