Received: by 10.192.165.148 with SMTP id m20csp1166726imm; Wed, 25 Apr 2018 13:44:16 -0700 (PDT) X-Google-Smtp-Source: AIpwx48n72naOlkLwDpTYX6cYcrhTNxxl4bHXPJ3LAYJUKKjYsUXL/ZGNvycGivTtOgBwLijMsTw X-Received: by 2002:a17:902:362:: with SMTP id 89-v6mr31305641pld.270.1524689056128; Wed, 25 Apr 2018 13:44:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524689056; cv=none; d=google.com; s=arc-20160816; b=vTjSBp4t47c31kqg7MubTpIoj5ik3BYrvALM9FH6jfvdg2g8QrIQPxadKzPXzGd90F KRWsJ2xccWcr9PfbLBTYG0hYkQkTmm63WN6CZtiL+hPes8lEhvE+QdFdnvZhpjw4wSL9 RubYclvkC/bVg5pKJLoBjpl1tBf9SJdfbAoCgzKSUXyNVcmDibjKlj7j5NfptUnrg6Ds 8nu0SG8wzSM8On3jl4Bb5zMm2B7EQMn6cQDnrJozUwXfaUu9pM8RKjJqAk83zMg9Ypqc hUujpJ9RR6SZCTO/V1CK+HkyMMPZHmZ6ZbrL/LvOc6ZCLfn4GJgKrFMi6ctfH4TaHZvT d6gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:arc-authentication-results; bh=UfwiXIiS9nOjoLq+2uHkJLd4DfGl15XPiAqxUgDxvww=; b=At4ncbvKYZ5rsK6v/b0nQIHAMjuIdA6NUX68CHGE4UxRweFZVL1CC+z+ZHbVwQ8it+ bx8vhM01HwisngCuEFAxZpWK1OIetoodkRqWyeljYk/1z0rJ2rI94qOTITWc2loVjWKG 34sY7rt0BIBn6St2qqn2bAxUasMuwDOjfwpbowQc9BO84gdp6LqfUVvXORMPkrvJLSX+ hoRmvyLQL1Rp6cJ7/LUE6uBNbu+v848/MOaBN0QFoK1KWqTidwu63UYODLi9kzLg5wqk YjS+rhhYpPfO1bU3xTvPZ0Q5K3F5fs7MJmNGuA/aK/AYk2cf6fTfTmGCqYndBmFpD2Fz Q3DQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=EyVkz//s; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r87si16608352pfg.305.2018.04.25.13.44.01; Wed, 25 Apr 2018 13:44:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=EyVkz//s; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751940AbeDYUky (ORCPT + 99 others); Wed, 25 Apr 2018 16:40:54 -0400 Received: from mail-oi0-f66.google.com ([209.85.218.66]:39241 "EHLO mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751128AbeDYUkY (ORCPT ); Wed, 25 Apr 2018 16:40:24 -0400 Received: by mail-oi0-f66.google.com with SMTP id n65-v6so22095239oig.6; Wed, 25 Apr 2018 13:40:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=UfwiXIiS9nOjoLq+2uHkJLd4DfGl15XPiAqxUgDxvww=; b=EyVkz//sCxmceRqEybPRvGUx8sR8sfuSyk7TqdniUTzjGZqbzUsO3HAYGPaNbInVZA vrq/WfKhiiACkmwzQmyWHEV69tXbVEFKlXMDkvAaXgHvmgj2Lo7u1890gVGJD/iMd5o2 MwMMBh0STahChGALflO+3xH+YiXtR2TUciGw8nledJ5CKWtu5wxUtAltRDuKimAhxFTt 4ESwAy+1QQURnPqYCDCvl6pWBwcXS5O5csQhvso19bUOejM06rrY6THGrVEu6oHPWh5J 46aWUoUP0GfbmcXjgPAffQqjYVBPoqXAi2skQ8Gyjz6u06Nk/SzZz5TRzjCn59X0aTeQ 9N/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=UfwiXIiS9nOjoLq+2uHkJLd4DfGl15XPiAqxUgDxvww=; b=UlcD/wIIPoZV0kew/9pmbJjK5X9h3mKvpSOTkl0ZzQbScYRra6mUpuEHSrX50TnlOd ilpvQ/zzcGVHrGoxhkYXVnir5J093MYKaafEFWh+gFrWAxPFcLGsukgWPUJk+pyakAA7 il+9Y+1hOQZRPGopuMH+xjot+9rhfTDTzBawxVP3cxq7nMjdBmm//gFuLKyvRDnp5g0G UvYFA8TXe4yVno/HjGrUvMCt+p6UQETzrE4zUTMmNvunsjUNnlvHNRUZ6l/Kjd5iPZIT /zCrr4/t5XnWMdQGwax+xIqV7NglvxzW/9lju6WByTez6BYC0nKGazfMNl6InkgpK9ly EZYQ== X-Gm-Message-State: ALQs6tCW2bBoOLY8mkvqa5T4zsOMG76VQ7U3tGld+tDYm+1uolC82Akh Xnk4523sdtBHhJkog3P/hnSJm9bV X-Received: by 2002:aca:674b:: with SMTP id b11-v6mr5343831oiy.249.1524688823711; Wed, 25 Apr 2018 13:40:23 -0700 (PDT) Received: from nuclearis2_1.lan (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id o189-v6sm9729908oia.19.2018.04.25.13.40.22 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 25 Apr 2018 13:40:23 -0700 (PDT) From: Alexandru Gagniuc To: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org Cc: Alexandru Gagniuc , "Rafael J. Wysocki" , Len Brown , Tony Luck , Borislav Petkov , Mauro Carvalho Chehab , Robert Moore , Erik Schmauss , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , linux-kernel@vger.kernel.org, devel@acpica.org Subject: [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Date: Wed, 25 Apr 2018 15:39:50 -0500 Message-Id: <20180425203957.18224-3-mr.nuke.me@gmail.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180425203957.18224-1-mr.nuke.me@gmail.com> References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180425203957.18224-1-mr.nuke.me@gmail.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The policy was to panic() when GHES said that an error is "Fatal". This logic is wrong for several reasons, as it doesn't take into account what caused the error. PCIe fatal errors indicate that the link to a device is either unstable or unusable. They don't indicate that the machine is on fire, and they are not severe enough that we need to panic(). Instead of relying on crackmonkey firmware, evaluate the error severity based on what caused the error (GHES subsections). Signed-off-by: Alexandru Gagniuc --- drivers/acpi/apei/ghes.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 44 insertions(+), 4 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index f9b53a6f55f3..8ccb9cc10fc8 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int * GHES_SEV_RECOVERABLE -> AER_NONFATAL * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL * These both need to be reported and recovered from by the AER driver. - * GHES_SEV_PANIC does not make it to this handling since the kernel must - * panic. + * GHES_SEV_PANIC -> AER_FATAL */ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) { @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) #endif } +/* PCIe errors should not cause a panic. */ +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) +{ + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); + + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) + return CPER_SEV_RECOVERABLE; + + return ghes_severity(gdata->error_severity); +} +/* + * The severity field in the status block is oftentimes more severe than it + * needs to be. This makes it an unreliable metric for the severity. A more + * reliable way is to look at each subsection and correlate it with how well + * the error can be handled. + * - SEC_PCIE: All PCIe errors can be handled by AER. + */ +static int ghes_actual_severity(struct ghes *ghes) +{ + int worst_sev, sec_sev; + struct acpi_hest_generic_data *gdata; + const guid_t *section_type; + const struct acpi_hest_generic_status *estatus = ghes->estatus; + + worst_sev = GHES_SEV_NO; + apei_estatus_for_each_section(estatus, gdata) { + section_type = (guid_t *)gdata->section_type; + sec_sev = ghes_severity(gdata->error_severity); + + if (guid_equal(section_type, &CPER_SEC_PCIE)) + sec_sev = ghes_sec_pcie_severity(gdata); + + worst_sev = max(worst_sev, sec_sev); + } + + return worst_sev; +} + static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -932,7 +971,7 @@ static void __process_error(struct ghes *ghes) static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) { struct ghes *ghes; - int sev, ret = NMI_DONE; + int sev, asev, ret = NMI_DONE; if (!atomic_add_unless(&ghes_in_nmi, 1, 1)) return ret; @@ -945,8 +984,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) ret = NMI_HANDLED; } + asev = ghes_actual_severity(ghes); sev = ghes_severity(ghes->estatus->error_severity); - if (sev >= GHES_SEV_PANIC) { + if ((sev >= GHES_SEV_PANIC) && (asev >= GHES_SEV_PANIC)) { oops_begin(); ghes_print_queued_estatus(); __ghes_panic(ghes); -- 2.14.3