Received: by 2002:a05:7412:40d:b0:e2:908c:2ebd with SMTP id 13csp94813rdf; Mon, 20 Nov 2023 17:49:08 -0800 (PST) X-Google-Smtp-Source: AGHT+IGeT10PYVJUCbVb6yrfdh8wwLmKh2c3vNDNf+QJlSr6E02ZzDp4zffE9PQOw/9JswwCmF7R X-Received: by 2002:a17:902:830a:b0:1c5:befa:d81d with SMTP id bd10-20020a170902830a00b001c5befad81dmr7201287plb.10.1700531348431; Mon, 20 Nov 2023 17:49:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700531348; cv=none; d=google.com; s=arc-20160816; b=VsjTKmzTqImS3VRaAq3YvPdBqekjJoSoVcSN95JASKK9yejUNbqshYE+SNaC/Zbd+K DeZJAzsRush9WjCYj4EljHSFe0TdxFWv0SBgx/Uj/eiYVVqSOHtMnhPq+tHxjySzG/0f IKcliOXm7fuCd1ef19/m2pkoJOq/EJXGY/ZoDrA41ML7r8QR9wy51Adps50g9sC823XC AyzDeetLvwA4hNh+9D5R0ULjzDJD3fUpDpiNasQ+97gwntJmbaH8lhBBz96PH8Jm287k FzFr5hlhBEoVjEG0IJbakXueYNG+xEGf6ckkTPM934yybYmFmtsiI1Ymhj6hegAFtD8N hZkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=FFI/wVmAjJyNPZXj/djBCUkgDhXqAa1sNr5eCDcxk0E=; fh=o4M9doRTHbvwAjPLobpXB6yy6dC14sOcl7TQjMw/rLU=; b=ND6eycEIsKPw628sUVRylRH3zL4s1Lss9DQi64Zg6axQWWV06gHYWU4LGaDMSdrdgo B1014yLRLGlnCp5nAe0whFKO0qGBnx28ovgbSsGXVdUHOt9vGAEVMhenIeA/hKMMvCp5 5RbHqjzFCQpWxurcmYX/oRRwZmzIZ8i8LsyMF/sTSt0dbPKg1QylW1rR4Iv4XNeR+QQ4 hifZEzjD7zu0P8FOs8YmGL6gNa/FybJFuFB0lkxiHe4aq6ZccWsjhOew2BgA0fb9CxRT 0Plqh8QX3GiEhGa928QbVyXZATiDHq4WWQ9Hfo1Hq+nUre3P58ogVLhXVLcaaBAWa6sJ UtYw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id d1-20020a170903230100b001ce5b8cfe7dsi9899363plh.230.2023.11.20.17.49.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 17:49:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id EF0958075ED9; Mon, 20 Nov 2023 17:49:03 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232997AbjKUBsm (ORCPT + 99 others); Mon, 20 Nov 2023 20:48:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229679AbjKUBsm (ORCPT ); Mon, 20 Nov 2023 20:48:42 -0500 Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E66BDC3; Mon, 20 Nov 2023 17:48:36 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=34;SR=0;TI=SMTPD_---0Vwqn1AR_1700531311; Received: from 30.240.112.71(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0Vwqn1AR_1700531311) by smtp.aliyun-inc.com; Tue, 21 Nov 2023 09:48:33 +0800 Message-ID: <57bd6874-35df-48b0-90d8-45077396b44f@linux.alibaba.com> Date: Tue, 21 Nov 2023 09:48:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code Content-Language: en-US To: rafael@kernel.org, wangkefeng.wang@huawei.com, tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com, linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com, gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org Cc: linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, linux-edac@vger.kernel.org, acpica-devel@lists.linuxfoundation.org, stable@vger.kernel.org, x86@kernel.org, justin.he@arm.com, ardb@kernel.org, ying.huang@intel.com, ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, bp@alien8.de, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, lenb@kernel.org, hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com, xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20231007072818.58951-1-xueshuai@linux.alibaba.com> From: Shuai Xue In-Reply-To: <20231007072818.58951-1-xueshuai@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 20 Nov 2023 17:49:04 -0800 (PST) Hi, ALL, Gentle ping. Best Regards, Shuai On 2023/10/7 15:28, Shuai Xue wrote: > Hi, ALL, > > I have rewritten the cover letter with the hope that the maintainer will truly > understand the necessity of this patch. Both Alibaba and Huawei met the same > issue in products, and we hope it could be fixed ASAP. > > ## Changes Log > > changes since v8: > - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) > - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) > - rewrite the return value comments of memory_failure (per Naoya Horiguchi) > > changes since v7: > - rebase to Linux v6.6-rc2 (no code changed) > - rewritten the cover letter to explain the motivation of this patchset > > changes since v6: > - add more explicty error message suggested by Xiaofei > - pick up reviewed-by tag from Xiaofei > - pick up internal reviewed-by tag from Baolin > > changes since v5 by addressing comments from Kefeng: > - document return value of memory_failure() > - drop redundant comments in call site of memory_failure() > - make ghes_do_proc void and handle abnormal case within it > - pick up reviewed-by tag from Kefeng Wang > > changes since v4 by addressing comments from Xiaofei: > - do a force kill only for abnormal sync errors > > changes since v3 by addressing comments from Xiaofei: > - do a force kill for abnormal memory failure error such as invalid PA, > unexpected severity, OOM, etc > - pcik up tested-by tag from Ma Wupeng > > changes since v2 by addressing comments from Naoya: > - rename mce_task_work to sync_task_work > - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() > - add steps to reproduce this problem in cover letter > > changes since v1: > - synchronous events by notify type > - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/ > > > ## Cover Letter > > There are two major types of uncorrected recoverable (UCR) errors : > > - Action Required (AR): The error is detected and the processor already > consumes the memory. OS requires to take action (for example, offline > failure page/kill failure thread) to recover this error. > > - Action Optional (AO): The error is detected out of processor execution > context. Some data in the memory are corrupted. But the data have not > been consumed. OS is optional to take action to recover this error. > > The main difference between AR and AO errors is that AR errors are synchronous > events, while AO errors are asynchronous events. Synchronous exceptions, such as > Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on > Arm64, are signaled by the hardware when an error is detected and the memory > access has architecturally been executed. > > Currently, both synchronous and asynchronous errors are queued as AO errors and > handled by a dedicated kernel thread in a work queue on the ARM64 platform. For > synchronous errors, memory_failure() is synced using a cancel_work_sync trick to > ensure that the corrupted page is unmapped and poisoned. Upon returning to > user-space, the process resumes at the current instruction, triggering a page > fault. As a result, the kernel sends a SIGBUS signal to the current process due > to VM_FAULT_HWPOISON. > > However, this trick is not always be effective, this patch set improves the > recovery process in three specific aspects: > > 1. Handle synchronous exceptions with proper si_code > > ghes_handle_memory_failure() queue both synchronous and asynchronous errors with > flag=0. Then the kernel will notify the process by sending a SIGBUS signal in > memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space > process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code > to distinguish to handle memory failure. > > For example, hwpoison-aware user-space processes use the si_code: > BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR > for 'action required' synchronous/late notifications. Specifically, when a > signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to > Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored > by QEMU.[1] > > Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) > > 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot > > If process mapping fault page, but memory_failure() abnormal return before > try_to_unmap(), for example, the fault page process mapping is KSM page. > In this case, arm64 cannot use the page fault process to terminate the > synchronous exception loop.[4] > > This loop can potentially exceed the platform firmware threshold or even trigger > a kernel hard lockup, leading to a system reboot. However, kernel has the > capability to recover from this error. > > Fix it by performing a force kill when memory_failure() abnormal fails or when > other abnormal synchronous errors occur. These errors can include situations > such as invalid PA, unexpected severity, no memory failure config support, > invalid GUID section, OOM, etc. (PATCH 2) > > 3. Handle memory_failure() in current process context which consuming poison > > When synchronous errors occur, memory_failure() assume that current process > context is exactly that consuming poison synchronous error. > > For example, kill_accessing_process() holds mmap locking of current->mm, does > pagetable walk to find the error virtual address, and sends SIGBUS to the > current process with error info. However, the mm of kworker is not valid, > resulting in a null-pointer dereference. I have fixed this in[3]. > > commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process > > Another example is that collect_procs()/kill_procs() walk the task list, only > collect and send sigbus to task which consuming poison. But memory_failure() is > queued and handled by a dedicated kernel thread on arm64 platform. > > Fix it by queuing memory_failure() as a task work which runs in current > execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2) > > ** In summary, this patch set handles synchronous errors in task work with > proper si_code so that hwpoison-aware process can recover from errors, and > fixes (potentially) abnormal cases. ** > > Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. > Acknowledge to discussion with them. > > ## Steps to Reproduce This Problem > > To reproduce this problem: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error > and it is not fact. > > After this patch set: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error > as we expected. > > [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/ > [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/ > [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com > [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/ > > Shuai Xue (2): > ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on > synchronous events > ACPI: APEI: handle synchronous exceptions in task work > > arch/x86/kernel/cpu/mce/core.c | 9 +-- > drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- > include/acpi/ghes.h | 3 - > include/linux/mm.h | 1 - > mm/memory-failure.c | 22 ++----- > 5 files changed, 82 insertions(+), 66 deletions(-) >