Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1033459imm; Fri, 11 May 2018 10:02:51 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrWCeNlUKrrb22qglemfYf+68ymH6OL8UFBIuyT1nY57ty/c1HyNdSIN8OMRybo4CHgWevl X-Received: by 2002:a63:b54b:: with SMTP id u11-v6mr5018782pgo.365.1526058171221; Fri, 11 May 2018 10:02:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526058171; cv=none; d=google.com; s=arc-20160816; b=oBdB/Ml8/RKO5ZMYpsfoAW/E/0xoFOMFgoaNSpOdK960DtR6BYA/Xo7DDv7RmT6oUL tlCBA/uWgG1+exwRaxG1OcRaGu/91LBacgWDWjoPo5chw9KCDYoKy0COgwRZRNYmhiTG E4zcTZkTpsaiDEeI03oRiNZi9YckbBvbOwxsOEzFlWQOuPVGDR7tbw4e4TmdffGthNwY 8O739SHgYGmYhgZawfHrdhAGsIhJBtVd6Qs0pFVlXX3HeUkNdeb1C9jYAGZ9VhgViOxR rx6ysEp5FK9Z4cUwqBt6WGAT/1jupWhlGAk6pdSRURSLCPinKGsVPhxrxdoAehwOePR5 E0Dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=ZWjtlb/92LXPUma7/eXlUG0j+wIfPN3sI5dVnqBG9bg=; b=vrX3B9BOShj4LXL7+SSSdkWlEDvBPHvjAu0muM7WdJbHqt8RIyatK6Ivom4pa7HUyc YMe/l1qq9sNhaqdPK4O9eMXSbMgn1/Biz9tI4OljW8dOPvhkUTb7j5hU/ktV218UH7Px QXYf8DxnQNRLHmn9TtQTHG+s60JUTMPAHaoYXgPgJCKWLKTfppX82nIx2gpLfhCm4SxO MZw3IaOQxflrsRCnKd6ggsqnLycFhMnBrptOV7ox1U0hKGv4Qh9C9Vodnidv+NzG6ADc uA0ruAsULVuC0bIU6ovAQ6tcc4hUQIHY1qlH3q3Gm3Lpq6Wm2wjlrHf6FcOJWiWtHL+n fLew== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AC77XyWt; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m5-v6si695917pgq.489.2018.05.11.10.02.36; Fri, 11 May 2018 10:02:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AC77XyWt; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751204AbeEKRB5 (ORCPT + 99 others); Fri, 11 May 2018 13:01:57 -0400 Received: from mail-oi0-f49.google.com ([209.85.218.49]:38020 "EHLO mail-oi0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750950AbeEKRBz (ORCPT ); Fri, 11 May 2018 13:01:55 -0400 Received: by mail-oi0-f49.google.com with SMTP id k17-v6so5311775oih.5; Fri, 11 May 2018 10:01:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=ZWjtlb/92LXPUma7/eXlUG0j+wIfPN3sI5dVnqBG9bg=; b=AC77XyWtMtnu9hkrrNWnA12JvVxJ4f8UuBnIFJYRRKPH+MtgtEs9ayRyQ+bw4NvAr3 DAiWQ58WDQshiwqHjLpV2bVkuNPHJuyo7608pHYDpKUCAbDNPBdS0/BsGrMQlPc1b67l PUFKiZeUyqF3s+SQGmjd2JMkfmvJMM5xRBWsy75sOb8V3YYQWVDpHvCQ9UtS6yjWt0dQ 2hgis5HQj7eVyerBTCVC8UEBYX3igI3FB8qGWLf5yQ+xt4gt20i8AfDpZhTIHxOHmEbT KSHqdqkuAp2Xbcmi9pJnt3yZ4rZLaeNkl0DQ/llWCocn4CvkbZTSmkmqvGALOWraJsv+ E7gQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=ZWjtlb/92LXPUma7/eXlUG0j+wIfPN3sI5dVnqBG9bg=; b=UWPmqGuJ3/2YI9ZmBA89+14zM3tQvFGmmgFcmt2z0LX0wK3h02yGBAJaVggsWNgceV GwM+HBnvMH4Y78DaFIKnn6xWAvoq2uW/bfYq+IJNnkht7FILBHtFFgcO112WYZzJwpjE 6/2T7pV5bJofXnYJzo0lwKgQtExM4tH+TfzDyyihDGBGlMqh9TNniWbvUC4XzpiAy4Fs 4HlianZe3RtajUwiljZS+1g88lt/tqq9Nki1IvwoFIjGlvWz9osxrur+KspbiOLbkjxV HDRQNGIGAdxwq9JMJXTR5De2mRs/dUi2TWioDPjx50OiMHyiFejwZWnoayjYAKOXG5Um O4Wg== X-Gm-Message-State: ALKqPwc5PExLwPqTf2mM4uVoOmh1RUgScXWs75Mnaez3DK+A4VEQCGDS VQLoTFVfRAT8jrwOT7Ep3KI= X-Received: by 2002:aca:6615:: with SMTP id a21-v6mr3774891oic.309.1526058114467; Fri, 11 May 2018 10:01:54 -0700 (PDT) Received: from nuclearis2_1.gtech (c-98-201-114-184.hsd1.tx.comcast.net. [98.201.114.184]) by smtp.gmail.com with ESMTPSA id f21-v6sm2163857otj.0.2018.05.11.10.01.52 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 May 2018 10:01:53 -0700 (PDT) Subject: Re: [RFC PATCH v4 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES To: Borislav Petkov Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, "Rafael J. Wysocki" , Len Brown , Tony Luck , Mauro Carvalho Chehab , Robert Moore , Erik Schmauss , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, devel@acpica.org References: <20180430212836.7807-1-mr.nuke.me@gmail.com> <20180430213358.8319-1-mr.nuke.me@gmail.com> <20180430213358.8319-3-mr.nuke.me@gmail.com> <20180511154039.GD12705@pd.tnic> <8e3c0cc6-9c5c-85ce-650c-8f498f5907da@gmail.com> <20180511160253.GF12705@pd.tnic> <45b7be09-c9b3-8006-6ea0-36b4ff38607c@gmail.com> <20180511162951.GH12705@pd.tnic> From: "Alex G." Message-ID: <95bcbc2d-0f8c-e51a-f0fc-08ea8c5fca26@gmail.com> Date: Fri, 11 May 2018 12:01:52 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180511162951.GH12705@pd.tnic> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/11/2018 11:29 AM, Borislav Petkov wrote: > On Fri, May 11, 2018 at 11:12:25AM -0500, Alex G. wrote: >>> I think *you* didn't get it: IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER) is not >>> enough of a check to confirm that there actually *is* an AER driver to >>> handle the errors. If you really want to make sure the driver is loaded >>> and functioning, then you need an explicit registering mechanism or some >>> other way of checking it really is there and handling errors. >> >> config ACPI_APEI_PCIEAER >> bool "APEI PCIe AER logging/recovering support" >> depends on ACPI_APEI && PCIEAER >> help >> PCIe AER errors may be reported via APEI firmware first mode. >> Turn on this option to enable the corresponding support. >> >> PCIAER is not modularizable. QED > > QED my ass. > > Read the f*ck my email again: the presence of the *code* is > not enough of a check to confirm the error has been handled. > aer_recover_work_func() can fail as that kfifo_put() in > aer_recover_queue() can too. > > You need an *actual* confirmation that the error has been handled > properly and *only* *then* not panic the system. Otherwise you are > potentially leaving those errors unhandled. "How is PCIe error severity dependent on whether the AER error reporting driver is enabled (and possibly not even loaded) on the system?" Little about confirmation of error being handled was talked about either in your **** email, or previous versions of this series. And quite frankly it's besides the scope of this patch. The scope is to enable SURPRISE!!! removal of NVMe drives and PCIe devices. For that purpose, we don't need confirmation that the error was handled. Such a confirmation requires a rework of GHES handling, or at least the interaction between GHES and AER, both of which I find to be mostly satisfactory. You can't at this point know if the error is going to be handled. There's code further downstream to handle this. You also didn't like it when I wanted to handle things downstream. I understand your concern with unhandled AER errors evolving into MCE's. That's extremely rare, but when it happens you still panic due to the MCE. To give you an idea of the rarity, in several months of testing, I was only able to reproduce MCEs once, and that was with a very defective drive, and a very idiotic test case. If you find this solution unacceptable, that's fine. We can fix it in firmware. We can hide all the events from the OS, contain the downstream ports, and simulate hot-remove interrupts. All in firmware, all the time. Alex