Received: by 10.213.65.68 with SMTP id h4csp3759951imn; Tue, 3 Apr 2018 10:10:54 -0700 (PDT) X-Google-Smtp-Source: AIpwx49j2znH36+0pB5N0dQyK/Qo6ffufC52fc2w9z4o+5DG+nZsDzoGXBIBS5Lz0mZ4TWNt4Y03 X-Received: by 2002:a17:902:33a5:: with SMTP id b34-v6mr14910330plc.232.1522775454485; Tue, 03 Apr 2018 10:10:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522775454; cv=none; d=google.com; s=arc-20160816; b=Fyi3eGtunPkd0S79bUPsH7ILklj5vQ4Zknp//WP/sxxbQ/ngLD3ZW4k04hVCde5s6I uK8Hmt590Nm8C5OHPKFyXeElDKer7w68tDN6uR0wOe2wVqPMcLFmccdYRZTaSQi5Qk63 eaX8eaHbnNX9b+KjUmFp99p+9mzMK8D9NqdG3qhvLuBCd3j0eaPP10av5nEuSOUvFch5 +raFQIhlUwBOFx3Fab/qu1Va8qa8vIWHQwWqaXrrosZP7/2ORv0aVBoVdqvxc9d/T6sj zkD7J23AAiXReYxosTUNusTeS9z1RGrFdS0fY4ml+Bcbov4NiA53kylYZLHQLzJk4A/6 mV2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=L5z91bK95YHsP1ll/5fcaOWxJLPD6TA7gUiGmOoK8ik=; b=tRQzRDNjivwFFFKiGks8sdIpxJFmcAY+50tquqpEGK74Fwt3hQ34bdy2678yjzu/Ph B24O1dvlOs++6BbgMobgJreF9kDHfOijJhNbsubV+VhcEGQP1Q36KPMC8rMwUbqm/cAP g3lRdXA5dA95LnALk6Z5CQyz98Me2NJv8YpjztsN61aIa2yTuVuxJrdpo3uTM+a1l//P wwSnPnptK+86ZPFCirYDZnXuTwNRaXSRrA8RgPTOLBp15olsOq58kpjfK6meXU55Kj6L df9/nQ1WQ/l3bsHJ2fIni639Z5KheCCrWnXa3udzKHaOc8NDdg+zg1hId32y/hzb/CVd gmTw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=GO7Q9f/N; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 32-v6si3314347pla.348.2018.04.03.10.10.40; Tue, 03 Apr 2018 10:10:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=GO7Q9f/N; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752584AbeDCRIv (ORCPT + 99 others); Tue, 3 Apr 2018 13:08:51 -0400 Received: from mail-ot0-f174.google.com ([74.125.82.174]:44045 "EHLO mail-ot0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751773AbeDCRIu (ORCPT ); Tue, 3 Apr 2018 13:08:50 -0400 Received: by mail-ot0-f174.google.com with SMTP id p33-v6so13916019otp.11; Tue, 03 Apr 2018 10:08:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=L5z91bK95YHsP1ll/5fcaOWxJLPD6TA7gUiGmOoK8ik=; b=GO7Q9f/NGhOSI+VI1wumbKBx33JNoDbRWL1nJLRFK/H6xKUxse/3DbZC8RyNsawxz8 2/w+Nd0MSIgHcepomLznEcI8qDl+AsdgkIv67V3x+FpSLx5/vLEtEDsdrx53eThIQ4XZ 8D3oDLJpA+yYrX2i0NgQPIZH7NQuMpwR684yctEZxs3+UNG6GTRAiuxiBjugIXACcp/p njtYl3WOqSwm07brUYZ7B4KsO355sRbhCAmPasjDaMEIdkmspTFis31PKKz6k8pTKEg+ k1Fo2P+0cygxmGo6+aU8nN+JNvXIqyOGqyT4kmmppxqDT0kg0IfB9m8F1soyvqrsgUqn tzeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=L5z91bK95YHsP1ll/5fcaOWxJLPD6TA7gUiGmOoK8ik=; b=QWBohcs8/1zY6JoL0Q/KA4lQXHK5aqAP6tz6fgrWHUCynBMWrLlYsD/AdZt7kCG43J V6xAW6LLglFn+nTA20DYqUP86a1N0ZdUSfRA/Yph96XiUf0ro7Jo0kqoMYy+kk+9CJci ZKZkP3dHx3g8lQ2h6OTP14yYsf4VKd5/YRVXo2qBv6qtOP57Pjs1sih25masKxzsE/zT B/PQf1PjYTJjfvk+vsFHlm4+a7/nXb+AjQCxJ85sDnkcNOX8XRTpZn3fqEEFWQgP8FED Z3Pvs2pHaL3c0ukR6ra7KuDrRZSNjYSoE5f/8Jf0uDrtpv8c7nIrwe84EhpdVqpbXzuJ S/xw== X-Gm-Message-State: ALQs6tCdBZneGzjaGRj+2BfkZYe68hjI6bNhzuh5tGJ4fuSz51g/qRh5 iTsDEpeLS6FXYZ6OH0/Sq1bt8Wx3 X-Received: by 2002:a9d:3fd2:: with SMTP id i18-v6mr9144642ote.8.1522775329433; Tue, 03 Apr 2018 10:08:49 -0700 (PDT) Received: from nuclearis2_1.lan (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id p35-v6sm1763878ota.72.2018.04.03.10.08.48 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 03 Apr 2018 10:08:48 -0700 (PDT) From: Alexandru Gagniuc To: linux-acpi@vger.kernel.org Cc: rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, bp@alien8.de, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, Alexandru Gagniuc Subject: [RFC PATCH 0/4] acpi: apei: Improve error handling with firmware-first Date: Tue, 3 Apr 2018 12:08:26 -0500 Message-Id: <20180403170830.29282-1-mr.nuke.me@gmail.com> X-Mailer: git-send-email 2.14.3 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, I'm helping out Dell work out through the issues related to PCIe and NVMe hotplug. Although hot-plug generally works, there are corner cases such as pin bounce, drives failing and surprise removal that are not 100% worked out. Because of this, NVMe is not yet on feature parity with SCSI and SAS. One of the interesting issues is that most server vendors like to use firmware-first (FFS), for various reasons that I won't go into. The side effect of that is that we oftentimes don't even a stab at correcting the problem. This is especially troublesome for NVMe, which needs PCIe hotplug to work correctly. When we do get a stab, it's after FFS can't handle a fatal error, and we're told of it through ACPI tables. On x86, this happens through an NMI, and as soon as we see a "fatal" error, we panic(). One problem with this FFS approach is that AER never even gets notified of the issue. And even if a PCIe drive were to stop responding, nvme or higher block drivers would notice something is wrong even without AER. Unless there is a physical defect or silicon bug, AER can recover the link. Another issue we're seeing with FFS is that BIOSes assume than an OS will crash on a fatal error reported through ACPI. Sometimes they will leave hardware in a "kind of" working state, or will fail to re-arm the errors. From what I've observed, this happens on hardware with silicon bugs. For example, PCIe root ports are unaffected, but certain PCIe switches may stop issuing hotplug interrupts. It's just another headache with FFS. While I don't expect server vendors to drop FFS in favor of native AER control, I do think we can harden linux against the idiosyncrasies of such systems. The scope of these patches is to protect against poorly designed firmware, and perform proper error handling when possible. It is not to make FFS a first class citizen in error handling. Alexandru Gagniuc (4): acpi: apei: Return severity of GHES messages after handling acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq acpi: apei: Do not panic() in NMI because of GHES messages acpi: apei: Warn when GHES marks correctable errors as "fatal" drivers/acpi/apei/ghes.c | 100 ++++++++++++++++++++++++++++++----------------- 1 file changed, 64 insertions(+), 36 deletions(-) -- 2.14.3