Received: by 10.192.165.148 with SMTP id m20csp4191499imm; Mon, 23 Apr 2018 21:21:05 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+XJ3NbqvQVzwMqMsYprhJ2dhax1GTHDEH6Hb6zdBRhaUW1TrCOCCKCUlbnzGOSlD/zV2Fr X-Received: by 2002:a17:902:9349:: with SMTP id g9-v6mr23237301plp.73.1524543665053; Mon, 23 Apr 2018 21:21:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524543665; cv=none; d=google.com; s=arc-20160816; b=Vw/lZZo2aHz6lI3NCkK1rbY1wgBigACblDtIhvHHyIk/PcwnmK67NFrmOiAG/f6GxV HJXUHvB+USDWUDOIb2iWrEruEFeNr/jMXv8pvJXDaRU3SsyJrnuG4O68YteS2sANS2ue 06h3qFaYaIZYCvc6B9whdhIlMTH6cBR9aIIE8ji8UMVTso7c1g61L05unowgEDXqYd+G OyE4G4ElVQ7jnrycgg59k6y7/elI5eY+IorgsS/3KbGVhZ7q4nbsUtwfF31gpoVzAqzA Z/UiOv/6CMeQ3JUGk/5VXtF6GCBHEpuzdSh7CS9XyLeS1A7BOJrCiZUjWYJPuK7ulU/Y J8sg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=q9N6/+QzcvBqSpTVzDXO5TzvjPnF/77LnwHanCrhivo=; b=EKrGtCS5sJXjFJ08zdYgiFBLnmdKG55Hmkq4elxvCJi0h9FDo9c6rMx3t2OlGySQJz JylxI67X44LRx9rTXJyHPNged+t48RepBb6HrkPIcbAPr2/ci3nqN+oZ4BdfAU9bUNoJ m02qBmxyHHZGjQyVfYdpdGPcHf96+A1CRTwkevjjkslqhtIJ62Dj5RayT7BFUEGTSQiT 1xOYcNLkQAx1M7fy4FCS/S4IYBf3zidajzP49RivrS2C/tVqO6Gouj9OeuYJaHfKfB4v G3kkhWDfXOCbyEyBVw/Jr53R+sO21g0X6+RMqRdCeZAmWuedOJ7Ov3fo6NzNN2AsFHst BZnA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=ehJM5LKo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d21-v6si13245707plr.352.2018.04.23.21.20.51; Mon, 23 Apr 2018 21:21:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=ehJM5LKo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751401AbeDXETl (ORCPT + 99 others); Tue, 24 Apr 2018 00:19:41 -0400 Received: from mail-oi0-f54.google.com ([209.85.218.54]:32967 "EHLO mail-oi0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750962AbeDXETg (ORCPT ); Tue, 24 Apr 2018 00:19:36 -0400 Received: by mail-oi0-f54.google.com with SMTP id 126-v6so16425568oig.0; Mon, 23 Apr 2018 21:19:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=q9N6/+QzcvBqSpTVzDXO5TzvjPnF/77LnwHanCrhivo=; b=ehJM5LKob9Q9V6TDUoUkA0rdT2GJXlE2VjOH2UZW9G67BFfHeKOx4t5dFfIoQEDD1D DdIkKvCgN789KTfwc66OOyVVrVMRgxrC+3qYQ9liheC1xkOSmM1LPJE7ss2Y0z6iiTxA uB6sC/a3tkNVS6WExvqMyzC3OXVxS5cR3o//BVt5FnTPBw9bVdlhw95DyFS4rzq1pPLK QJaYXBOPsgkuwJCQA0ZKn0uhlA3gOkPOxogrXgi5my0tg0bdu7RRfI6SNSPJThWR4KZJ 0vYmSe6rrn2b2qRawdBrlOaatERU0k+zsHwHNa8973Jimkyc7L6FriaNnO6VZWlvLvFV MceQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=q9N6/+QzcvBqSpTVzDXO5TzvjPnF/77LnwHanCrhivo=; b=VsZ8lsIFYfsfKnCQpVtff1/WG8sUzbEJmuHDcr0PNsC5mqYVIuzHurLR/Boz2JmAOj NiNB6UJHFbsq2OrzoSPvVmdK8O1hBQaJTeJuCZFhiUK96fR/E9wa7NujDwLjdRHHOiig Q2Eb+ph/LDRvphpZyC17LafCc/kVEWiynh6dD/HkTs3XAtk9w3Qrm3CqESvHA01NK4HO CJXm/dgCMGjNmXKbtOn1gXGCwkLFhWxrcc15LQle1Zy5kh/vvZ6nqV24vMcuKWZaTsdf Lis6PF86wxFv1I8HaRScEgmPlXXbggqfbW9QZp80hG3u2uX+jlTslPo5utiRLaacHNP3 R71Q== X-Gm-Message-State: ALQs6tAzE/FyyNaudoH87/HwKgCBB23Mq2DDc+huvQuB6OelB9LXUmzV 8wmTFB5y1MBzXIkZO0iddC8= X-Received: by 2002:aca:51ce:: with SMTP id f197-v6mr15170446oib.32.1524543575759; Mon, 23 Apr 2018 21:19:35 -0700 (PDT) Received: from nukespec.gtech (rrcs-97-77-96-242.sw.biz.rr.com. [97.77.96.242]) by smtp.gmail.com with ESMTPSA id n82-v6sm7916142oif.23.2018.04.23.21.19.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Apr 2018 21:19:35 -0700 (PDT) Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> <20180419190323.GF5635@pd.tnic> <20180422104849.GA32754@pd.tnic> From: "Alex G." Message-ID: <70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com> Date: Mon, 23 Apr 2018 23:19:25 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: <20180422104849.GA32754@pd.tnic> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/22/2018 05:48 AM, Borislav Petkov wrote: > On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote: >>> How does such an error look like, in detail? >> >> It's green on the soft side, with lots of red accents, as well as some >> textured white shades: >> >> [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down >> [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down >> [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able >> to correct >> [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor >> [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic >> Hardware Error Source: 1 >> [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up >> [ 52.711616] {1}[Hardware Error]: event severity: fatal >> [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal >> [ 52.721891] {1}[Hardware Error]: section_type: PCIe error >> [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port >> [ 52.734075] {1}[Hardware Error]: version: 3.0 >> [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010 >> [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0 >> [ 52.750271] {1}[Hardware Error]: slot: 4 >> [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3 >> [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733 >> [ 52.766123] {1}[Hardware Error]: class_code: 000406 >> [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000, >> control: 0x0003 >> [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask: >> 0x01a10000 >> [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0 >> [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request >> [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer, >> aer_agent=Requester ID >> [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030 >> [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f >> e12023bc 01000000 >> [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message >> [ 52.883895] pci 0000:b3:00.0: device has no driver >> [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down >> [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event >> queued; currently getting powered on >> [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up > > Btw, from another discussion we're having with Yazen: > > @Yazen, do you see how this error record is worth shit? > > class_code: 000406 > command: 0x0407, status: 0x0010 > bridge: secondary_status: 0x0000, control: 0x0003 > aer_status: 0x00100000, aer_mask: 0x01a10000 > aer_uncor_severity: 0x004eb030 That tells you what FFS said about the error. Keep in mind that FFS has cleared the hardware error bits, which the AER handler would normally read from the PCI device. > those above are only some of the fields which are purely useless > undecoded. Makes me wonder what's worse for the user: dump the > half-decoded error or not dump an error at all... It's immediately obvious if there's a glaring FFS bug and if we get bogus data. If you distrust firmware as much as I do, then you will find great value in having such info in the logs. It's probably not too useful to a casual user, but then neither is a majority of the system log. > Anyway, Alex, I see this in the logs: > > [ 66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down > [ 66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present > [ 66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present > > and that comes from that pciehp_isr() interrupt handler AFAICT. > > So there *is* a way to know that the card is not present anymore. So, > theoretically, and ignoring the code layering for now, we can connect > that error to the card not present event and then ignore the error... You're missing the timing and assuming you will get the hotplug interrupt. In this example, you have 22ms between the link down and presence detect state change. This is a fairly fast removal. Hotplug dependencies aside (you can have the kernel run without PCIe hotplug support), I don't think you want to just linger in NMI for dozens of milliseconds waiting for presence detect confirmation. For enterprise SFF NVMe drives, the data lanes will disconnect before the presence detect. FFS relies on presence detect, and these are two of the reasons why slow removal is such a problem. You might not get a presence detect interrupt at all. Presence detect is optional for PCIe. PD is such a reliable heuristic, that it guarantees worse error handling than the crackmonkey firmware. I don't see how might be useful in a way which gives us better handling than firmware. > Hmmm. Hmmm Anyway, heuristics about PCIe error recovery belong in the recovery handler. I don't think it's smart to apply policy before we get there Alex