Received: by 10.192.165.148 with SMTP id m20csp53234imm; Thu, 19 Apr 2018 15:59:23 -0700 (PDT) X-Google-Smtp-Source: AIpwx49ggvxPz2W6XV0hQ5MWoeXfTWEdaMcEvsxg3zP/kBT0takQt3dtk+G54IIAEqA2yIdmZTr/ X-Received: by 10.98.157.90 with SMTP id i87mr2586900pfd.190.1524178763362; Thu, 19 Apr 2018 15:59:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524178763; cv=none; d=google.com; s=arc-20160816; b=tfe9MQz9Bi5PvWJSTVzD+UTU008sUXcW2NEh2qPP5I9raQN3gvT7B4VWu/8TUec5TT ReCdaodTXO7MyIGwCMZcd+iwTws1oSkopZYMZoApcuI4uNTjxMU7rocqgKcOd2HW0M7C Ijm53g1BVL2OwtmGkjW4eQlN6qHVBTMZU5eTpB1Lw1LV+0QhWDdau6x0Sqt6sjbHJt6E 3D6Q7fltvXbGyxBf/K7yHBmUsNfgcScc8WmR1DyJyNzi0ZAaslmI78/roBFx5GhjS3m9 TIvuI2KGC1wdPZbNsWVF9+a8W1+360y0eFZGl0+eW7Rlmc8c4fgGxbCc9CTVY6cCxD0c xLiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=vvUyNEZAwDhgxe91CX9hhTCk+zI0TvBjURBv1TFJQak=; b=bVv4/gOGHDO3GO8wUrhTC4pz6MbL6HCnZz4T+BYwp2L+1p9J7egAktvxrhfscMLOfL qmGpPjtcnabUoxzJnMVA5hubPcoBeN+i1epHtcbM8TVrkcpTDiHHn2eBdmyarYSL6PDL mSzNgzDcrHxJgZuLA+Rm5mSaLZLVobCYWc8y4G9gWLu9GDYiIQ4GXSjaFjmSYQExbTgr U6qC/1uwUQ5nd6lTw+f5A2+khpLfu3szihEkuElzRMcAaDaEzit/JxY3Wof4d/4fBEOe coG8LqbDcdyyMhtJg34Lp8MILovqumtxg3Q7dMLBwa50yVELJaZSd2ZgSziSI33ggopt EX0w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=lBytNEY0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x7si3986013pfk.311.2018.04.19.15.59.07; Thu, 19 Apr 2018 15:59:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=lBytNEY0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753847AbeDSWzP (ORCPT + 99 others); Thu, 19 Apr 2018 18:55:15 -0400 Received: from mail-oi0-f51.google.com ([209.85.218.51]:44097 "EHLO mail-oi0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753792AbeDSWzL (ORCPT ); Thu, 19 Apr 2018 18:55:11 -0400 Received: by mail-oi0-f51.google.com with SMTP id e11-v6so6379693oii.11; Thu, 19 Apr 2018 15:55:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=vvUyNEZAwDhgxe91CX9hhTCk+zI0TvBjURBv1TFJQak=; b=lBytNEY0J3jNDHWc7a2lGmvJ9oz3BZ389goqs3zhMLI7pP0N33MlEVq1B54gvNW4tR UePMwGTzsYtE3QPt1jgnYkh5FI41z6SaA+pVesvuZMqJD1P8O8VRi+4AWa/AG1XNO3E7 S9vbNs7C3dlukazj0xp32iC4R6veabPtKomiIi9h9ik4zry+j+StTkJvq/h+tg0HSH9m EWqBSDotnKv/WBqXQ8PGXJ2WuXLvlhHgoCmqg57ge8t/VRbWgOniEwd8meW6Dg7R28Jd iKYwsr9PDeWAl4EP6faHLUrIiwRfqVS/+Mnhoc+m1aga682eQGotvDjqFMHLI2FfyfSF JikA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=vvUyNEZAwDhgxe91CX9hhTCk+zI0TvBjURBv1TFJQak=; b=rwFtVFFqP+UYYYqLfEdrg7WUXVezR296yA82YvsSx7hStdYN7OJ77Ni1PVr23zGyRo fiWWzpWJzotbjqvc9s9oWuFA9qxzpF3IK4q6Iph85TkLeZUjbA0oVj4ojtwyv7ddwqRJ DzHhwYwDrdHZL3TE+xN9c9QTl+aVMVF1yRMkXwEhNg4cmPAJgS/7yUtZiL+fP6bNUm3i tW+CtfUYlhOwak9V4qJbwNsNitBCovpJAqXicSOlx+ZVXyPmgNHN3EgxMut2YVyjMLm8 phA5W3gShWwIZkXKvDPMhuDAqP0K+I4ltxILlXC9ljWla1eIjVf5kGMOrkgPznXkwbW9 HZMw== X-Gm-Message-State: ALQs6tBSGyMBCHqYfQ9TbllAxuHnIMk06MAmkiRR0SqQZ0KwonCR81Hm PICj1fsLlEvPvHu8tW+uR2k= X-Received: by 2002:aca:6505:: with SMTP id m5-v6mr5080001oim.215.1524178510325; Thu, 19 Apr 2018 15:55:10 -0700 (PDT) Received: from nuclearis2_1.gtech (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id d69-v6sm2650428oih.58.2018.04.19.15.55.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Apr 2018 15:55:09 -0700 (PDT) Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> <20180419190323.GF5635@pd.tnic> From: "Alex G." Message-ID: Date: Thu, 19 Apr 2018 17:55:08 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180419190323.GF5635@pd.tnic> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/19/2018 02:03 PM, Borislav Petkov wrote: > (snip useful explanation). > > On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote: >> On the r740xd, FW just hides those errors from the OS with no further >> notification. On this machine BIOS sets things up such that non-posted >> requests report fatal (PCIe) errors. FW still tries very hard to hide >> this from the OS, and I think the heuristic is that if the drive >> physical presence is gone, don't even report the error. > > Ok, second question: can you detect from the error signatures alone that > it was a surprise removal? I suppose you could make some inference, given the timing of other events going on around the the crash. It's not uncommon to see a "Card not present" event around drive removal. Since the presence detect pin breaks last, you might not get that interrupt for a long while. In that case it's much harder to determine if you're seeing a SURPRISE!!! removal or some other fault. I don't think you can use GHES alone to determine the nature of the event. There is not a 1:1 mapping from the set of things going wrong to the set of PCIe errors. > How does such an error look like, in detail? It's green on the soft side, with lots of red accents, as well as some textured white shades: [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able to correct [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up [ 52.711616] {1}[Hardware Error]: event severity: fatal [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal [ 52.721891] {1}[Hardware Error]: section_type: PCIe error [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port [ 52.734075] {1}[Hardware Error]: version: 3.0 [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010 [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0 [ 52.750271] {1}[Hardware Error]: slot: 4 [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3 [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733 [ 52.766123] {1}[Hardware Error]: class_code: 000406 [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003 [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask: 0x01a10000 [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0 [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer, aer_agent=Requester ID [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030 [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f e12023bc 01000000 [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message [ 52.883895] pci 0000:b3:00.0: device has no driver [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event queued; currently getting powered on [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up > Got error logs somewhere to dump? Sure [1]. They have the ANSI sequences, so you might want to wget and grep them in a color terminal. Alex [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log