Received: by 10.192.165.156 with SMTP id m28csp1518875imm; Wed, 18 Apr 2018 10:56:17 -0700 (PDT) X-Google-Smtp-Source: AIpwx48hv0daZZbHWeDYptu7f12Ejo5Bbwu7WIkPpWZwH9m4stRnwl+T97usi15Rh4D00VHeoRB9 X-Received: by 10.98.144.85 with SMTP id a82mr2869607pfe.14.1524074177592; Wed, 18 Apr 2018 10:56:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524074177; cv=none; d=google.com; s=arc-20160816; b=TvnXUznEqdr0VWmjqwupVMXEexAHrh55JkaYS+C8dEsmL5hDruEQgsI8A/wq2ZWu3X Sb3/kWNWM5qbNp0Vo1WN5f50ZPWDhMRemOmbxAe6gIcYRUufH/UEo9Bab0ZCdJky4dqC Zi9AL0jXHU6DJqcBzYIEP0eTVjHedinvzWVZVhhR/JjEgxUHYb3kCbcnHUfi+xhzpI0P 8NnAR9oc27rspDVUd8oTqvGEjbAaHnkKdic6BWoKbtA0qoJXwiX2ex2tvkS/5pqDZauT zOPeSYIBqRhhL1RtRxWq7Rv34JdnAl3BTz5x5uWU8rw7tsciKc07FM+eMequLJlwocj8 2Ngg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=iJnoZkuTrXULFKL3XxM1ep1blC8h1UGXium3AVMyRJU=; b=YMn75za5gfpONBmYNuQAH0I5pdNMFiZtiYE9UaxD892NxUsvN6qmqTZf35vjTTlgCs hGnDLY3H6Tp0+j/7raCO8bI/AYuKWpHnlHTdxLI93gCrnPJgtMoRv4v28gbMks5EtBuv lgnxSVyZ7yiqA8B1+YHRR5ZHacHPgjb6z/XdC6M4gWuu4WvafJxcnuLvK3aqAgq0E4OW N8PTX0KD1srAFg+po6DH7LhJABiO+4g4a52H5bTVrtgAsfwK+EUwQrx7xoGGHsAv06jS /dIm7Asv6JZGDp0Lrgb0XhL1o6BuSRIiUJIUof3wsa8q1+zTlEi/7lMiTQQT+cYh2A40 PEkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t2si1464722pgb.338.2018.04.18.10.56.03; Wed, 18 Apr 2018 10:56:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753117AbeDRRyf (ORCPT + 99 others); Wed, 18 Apr 2018 13:54:35 -0400 Received: from mail.skyhub.de ([5.9.137.197]:51258 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753071AbeDRRye (ORCPT ); Wed, 18 Apr 2018 13:54:34 -0400 X-Virus-Scanned: Nedap ESD1 at mail.skyhub.de Received: from mail.skyhub.de ([127.0.0.1]) by localhost (blast.alien8.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id MHzer6IEqk32; Wed, 18 Apr 2018 19:54:16 +0200 (CEST) Received: from pd.tnic (p200300EC2BCA86003047E01E637F8411.dip0.t-ipconnect.de [IPv6:2003:ec:2bca:8600:3047:e01e:637f:8411]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id BCFC91EC00FF; Wed, 18 Apr 2018 19:54:16 +0200 (CEST) Date: Wed, 18 Apr 2018 19:54:15 +0200 From: Borislav Petkov To: Alexandru Gagniuc Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. Message-ID: <20180418175415.GJ4795@pd.tnic> References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20180416215903.7318-4-mr.nuke.me@gmail.com> User-Agent: Mutt/1.9.3 (2018-01-21) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote: > Firmware is evil: > - ACPI was created to "try and make the 'ACPI' extensions somehow > Windows specific" in order to "work well with NT and not the others > even if they are open" > - EFI was created to hide "secret" registers from the OS. > - UEFI was created to allow compromising an otherwise secure OS. > > Never has firmware been created to solve a problem or simplify an > otherwise cumbersome process. It is of no surprise then, that > firmware nowadays intentionally crashes an OS. I don't believe I'm saying this but, get rid of that rant. Even though I agree, it doesn't belong in a commit message. > > One simple way to do that is to mark GHES errors as fatal. Firmware > knows and even expects that an OS will crash in this case. And most > OSes do. > > PCIe errors are notorious for having different definitions of "fatal". > In ACPI, and other firmware sandards, 'fatal' means the machine is > about to explode and needs to be reset. In PCIe, on the other hand, > fatal means that the link to a device has died. In the hotplug world > of PCIe, this is akin to a USB disconnect. From that view, the "fatal" > loss of a link is a normal event. To allow a machine to crash in this > case is downright idiotic. > > To solve this, implement an IRQ safe handler for AER. This makes sure > we have enough information to invoke the full AER handler later down > the road, and tells ghes_notify_nmi that "It's all cool". > ghes_notify_nmi() then gets calmed down a little, and doesn't panic(). > > Signed-off-by: Alexandru Gagniuc > --- > drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 42 insertions(+), 2 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 2119c51b4a9e..e0528da4e8f8 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev) > return ghes_severity(gdata->error_severity); > } > > +static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata, > + int sev) > +{ > + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); > + > + /* The system can always recover from AER errors. */ > + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && > + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) > + return CPER_SEV_RECOVERABLE; > + > + return ghes_severity(gdata->error_severity); > +} Well, Tyler touched that AER error severity handling recently and we had it all nicely documented in the comment above ghes_handle_aer(). Your ghes_handle_aer_irqsafe() graft basically bypasses ghes_handle_aer() instead of incorporating in it. If all you wanna say is, the severity computation should go through all the sections and look at each error's severity before making a decision, then add that to ghes_severity() instead of doing that "deferrable" severity dance. And add the changes to the policy to the comment above ghes_handle_aer(). I don't want any changes from people coming and going and leaving us scratching heads why we did it this way. And no need for those handlers and so on - make it simple first - then we can talk more complex handling. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.