Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp272440pxj; Wed, 9 Jun 2021 23:48:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJypm3sxH3+7tyOa/ZiEufFCIwmzp9YvjKFCPHrwbSphAedj9I35FhyZzpIifAGjz93n35a3 X-Received: by 2002:a50:fa8c:: with SMTP id w12mr3172803edr.350.1623307708857; Wed, 09 Jun 2021 23:48:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623307708; cv=none; d=google.com; s=arc-20160816; b=v2KePVHKuC8ZMPNOjLnfIpMDEDtrxdJtsPemXW9C01R+R+Z1iC68i6jayxbHXuslF0 BtCa+EGntJiIHV4O0hXgnK2mn9i+z1GSZaVCIlbgMCy1NHP+3jA79YLFA/cR8sqJ8z6y iHiozRgWEaCGOBL2BaPgpk+9DCCdMrkVf4iDZYtouN5efCH217T/aSEqLN5IdNtMqO/I cvm8p86D70PbgTeZlTIxHhP50KZfIGzs6WESp9/ydF1zeGDuU/0rrpAD9mGTQqYrvisk zH9iU6lUzFWFPu8h+NjsqqBhr8twwiYFHlWaOfpXp13Q+NwMFyy9X+bfsi9sXN1HXiu4 a01w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:cc:references:to :subject; bh=zUXeptJOUxwPeCyyJWOnN/tXfU8rqSTVeANlC623EHo=; b=SVIbLW2sd6ulqF7/ros21t3fwAlGdWdqQOPFzBOLU4G46AbSHx9rEdXVpdCeezm7DN 7KkcFLcCYS4qqf0RJWWx5IKYkE83SCBH8G9zXhgeJPfGlVl7h5yrvVkn0uPs5KpBWZtA V/fmHudcRLS1+vZLbqn4oiemF0gBToFMsmYo/DgKzMzPT2fmj/qUhThRMvKE/cqc2ql8 yS5BAArR20cn2/+l5cRy2sFT/VqGBE4TvlcjhCvVQ3DXRr4p/OZreCEy+RfcdSXwPLAh GiDQQgJ5s+v78zyCqWXkpjYsClmPp+V5oqWB+puGVuCABcLdUn3P8rqrXq3k5gyRzQvu 13Kg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v11si1665959edt.160.2021.06.09.23.48.05; Wed, 09 Jun 2021 23:48:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229963AbhFJGqP (ORCPT + 99 others); Thu, 10 Jun 2021 02:46:15 -0400 Received: from szxga02-in.huawei.com ([45.249.212.188]:3931 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229705AbhFJGqO (ORCPT ); Thu, 10 Jun 2021 02:46:14 -0400 Received: from dggemv703-chm.china.huawei.com (unknown [172.30.72.53]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4G0vWW63Tpz6xCy; Thu, 10 Jun 2021 14:41:11 +0800 (CST) Received: from dggpemm500001.china.huawei.com (7.185.36.107) by dggemv703-chm.china.huawei.com (10.3.19.46) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 10 Jun 2021 14:44:15 +0800 Received: from [127.0.0.1] (10.40.192.162) by dggpemm500001.china.huawei.com (7.185.36.107) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 10 Jun 2021 14:44:15 +0800 Subject: Re: [PATCH v6] ACPI / APEI: fix the regression of synchronous external aborts occur in user-mode To: "Rafael J. Wysocki" References: <1623218580-41912-1-git-send-email-tanxiaofei@huawei.com> CC: James Morse , "Rafael J. Wysocki" , Len Brown , Tony Luck , Borislav Petkov , Andrew Morton , Joerg Roedel , Peter Zijlstra , ACPI Devel Maling List , Linux Kernel Mailing List , From: Xiaofei Tan Message-ID: <68ce6212-29f0-beb1-16fa-bd5e7d1e2806@huawei.com> Date: Thu, 10 Jun 2021 14:44:14 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.40.192.162] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemm500001.china.huawei.com (7.185.36.107) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Rafael, On 2021/6/9 21:22, Rafael J. Wysocki wrote: > On Wed, Jun 9, 2021 at 8:06 AM Xiaofei Tan wrote: >> >> Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea() >> synchronise with APEI's irq work"), do_sea() would unconditionally >> signal the affected task from the arch code. Since that change, >> the GHES driver sends the signals. > > Since this fixes a regression apparently introduced by the above > commit, please add a Fixes tag pointing to that commit to it. > OK. >> This exposes a problem as errors the GHES driver doesn't understand >> or doesn't handle effectively are silently ignored. It will cause >> the errors get taken again, and circulate endlessly. User-space task >> get stuck in this loop. >> >> Existing firmware on Kunpeng9xx systems reports cache errors with the >> 'ARM Processor Error' CPER records. >> >> Do memory failure handling for ARM Processor Error Section just like >> for Memory Error Section. > > So why is this the right thing to do? > > I guess it doesn't address the problem entirely, but only in this > particular case, so what if the firmware on some other platform > reports errors with a new type unknown to the GHES driver? Will the > problem show up again? Yes. GHES driver should give right feedback to ARCH code. I mean apei_claim_sea() or ghes_notify_sea() doesn't return 0 if the error is unknown. But it seems difficult to achieve this for current architecture of GHES driver. > >> Signed-off-by: Xiaofei Tan >> Reviewed-by: James Morse >> >> --- >> Changes since v5: >> - Do some changes following James's suggestions: 1) optimize commit log >> 2) use err_info->length instead of err_info++' 3) some coding style >> advice. >> >> Changes since v4: >> - 1. Change the patch name from " ACPI / APEI: do memory failure on the >> physical address reported by ARM processor error section" to this >> more proper one. >> - 2. Add a comment in the code to tell why not filter out corrected >> error in an uncorrected section. >> >> Changes since v3: >> - Print unhandled error following James Morse's advice. >> >> Changes since v2: >> - Updated commit log >> --- >> drivers/acpi/apei/ghes.c | 81 ++++++++++++++++++++++++++++++++++++++---------- >> 1 file changed, 64 insertions(+), 17 deletions(-) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index fce7ade..0c8330e 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -441,28 +441,35 @@ static void ghes_kick_task_work(struct callback_head *head) >> gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len); >> } >> >> -static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, >> - int sev) >> +static bool ghes_do_memory_failure(u64 physical_addr, int flags) >> { >> unsigned long pfn; >> - int flags = -1; >> - int sec_sev = ghes_severity(gdata->error_severity); >> - struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); >> >> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE)) >> return false; >> >> - if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) >> - return false; >> - >> - pfn = mem_err->physical_addr >> PAGE_SHIFT; >> + pfn = PHYS_PFN(physical_addr); >> if (!pfn_valid(pfn)) { >> pr_warn_ratelimited(FW_WARN GHES_PFX >> "Invalid address in generic error data: %#llx\n", >> - mem_err->physical_addr); >> + physical_addr); >> return false; >> } >> >> + memory_failure_queue(pfn, flags); >> + return true; >> +} >> + >> +static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, >> + int sev) >> +{ >> + int flags = -1; >> + int sec_sev = ghes_severity(gdata->error_severity); >> + struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); >> + >> + if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) >> + return false; >> + >> /* iff following two events can be handled properly by now */ >> if (sec_sev == GHES_SEV_CORRECTED && >> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) >> @@ -470,14 +477,56 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, >> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) >> flags = 0; >> >> - if (flags != -1) { >> - memory_failure_queue(pfn, flags); >> - return true; >> - } >> + if (flags != -1) >> + return ghes_do_memory_failure(mem_err->physical_addr, flags); >> >> return false; >> } >> >> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev) >> +{ >> + struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata); >> + bool queued = false; >> + int sec_sev, i; >> + char *p; >> + >> + log_arm_hw_error(err); >> + >> + sec_sev = ghes_severity(gdata->error_severity); >> + if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE) >> + return false; >> + >> + p = (char *)(err + 1); >> + for (i = 0; i < err->err_info_num; i++) { >> + struct cper_arm_err_info *err_info = (struct cper_arm_err_info *)p; >> + bool is_cache = (err_info->type == CPER_ARM_CACHE_ERROR); >> + bool has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR); >> + const char *error_type = "unknown error"; >> + >> + /* >> + * The field (err_info->error_info & BIT(26)) is fixed to set to >> + * 1 in some old firmware of HiSilicon Kunpeng920. We assume that >> + * firmware won't mix corrected errors in an uncorrected section, >> + * and don't filter out 'corrected' error here. >> + */ >> + if (is_cache && has_pa) { >> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0); >> + p += err_info->length; >> + continue; >> + } >> + >> + if (err_info->type < ARRAY_SIZE(cper_proc_error_type_strs)) >> + error_type = cper_proc_error_type_strs[err_info->type]; >> + >> + pr_warn_ratelimited(FW_WARN GHES_PFX >> + "Unhandled processor error type: %s\n", >> + error_type); >> + p += err_info->length; >> + } >> + >> + return queued; >> +} >> + >> /* >> * PCIe AER errors need to be sent to the AER driver for reporting and >> * recovery. The GHES severities map to the following AER severities and >> @@ -605,9 +654,7 @@ static bool ghes_do_proc(struct ghes *ghes, >> ghes_handle_aer(gdata); >> } >> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) { >> - struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata); >> - >> - log_arm_hw_error(err); >> + queued = ghes_handle_arm_hw_error(gdata, sev); >> } else { >> void *err = acpi_hest_get_payload(gdata); >> >> -- >> 2.8.1 >> > > . >