Received: by 10.192.165.156 with SMTP id m28csp982903imm; Thu, 19 Apr 2018 10:42:56 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+Ifh/L+fDJ+JddrWQJ3XWz1UTCgl7XHLs1v6TvwmN7HoKelxAoRKFUNkfojZZ5hi2LyMKO X-Received: by 2002:a17:902:9349:: with SMTP id g9-v6mr6936742plp.73.1524159776210; Thu, 19 Apr 2018 10:42:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524159776; cv=none; d=google.com; s=arc-20160816; b=s2UQ/rFU0NVXgCsk/3TTwA6U1d+akLpxBsee8/4eZw0iHJrudtXl0393g+vKCsN2GS tIeIIr3dT9tbPlIzh3oLD6FAkyebekZYfWEjblMmbtMh4OP5kCwtClpbNdJOgYmGV6KS UlgTy+kXU+4r/fz0Cxok5c9zC6N5G9o175PBJY3vS6DzJAuMEF66BuJGzrpHu5JmqY1M GE0HOcBD/2xaPi3ml+gZP942OU2G+kq95QQGalL+SPx2k9Rf1qV6GpJYKWYXEeEiRfbH D47oBzIzf33zB7yX7ruvi0vn4EytVStzY3CpGLYczhf7sktjNeI6+IfQArja6n2vYb3o AUtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=8GL/MkGCX9gzf3JIEjGfn47AAAeeIJvBmMFE4QYocV0=; b=cYPO/RJWfiNr0vyrFF/BgejgZITIcSoDYP9q+92p2Df6XTMPKXWf3p5X7WVM1YMzFP u1SNeEc+ChjKU12VMqC3dYHPqMzqTLwQKPLK0wD85bRO7Gy5k22RT89jXl2JeJspNMoS QRZ/OryuivIeEeN9iPDB3ZiPB0pdTYhkdBLMfy+cxYs5uhnMdIXcyiO9R2AVjZ+NH+Wz pAAhFZv/tMaaHk9lfEWAAJfmisY8FHcBLoC2gxv8PgIzgByL23APzNCOsB+Lk2tJamJf zVho/6+ZXGkr+XnEGZ73gxwVd4JYQhY9ksu1TE5efEZlOOmtv66VmQ+dqHbjpB6+Qxox Ye4A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=hdQq5yiW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x7si3416074pge.559.2018.04.19.10.42.41; Thu, 19 Apr 2018 10:42:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=hdQq5yiW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753528AbeDSRk7 (ORCPT + 99 others); Thu, 19 Apr 2018 13:40:59 -0400 Received: from mail-oi0-f49.google.com ([209.85.218.49]:33841 "EHLO mail-oi0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752887AbeDSRk4 (ORCPT ); Thu, 19 Apr 2018 13:40:56 -0400 Received: by mail-oi0-f49.google.com with SMTP id e23-v6so5620039oiy.1; Thu, 19 Apr 2018 10:40:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=8GL/MkGCX9gzf3JIEjGfn47AAAeeIJvBmMFE4QYocV0=; b=hdQq5yiWFnaJadrUuSVClCU5C7mpWI06jbDSfbBQi2apSi/QrU66NJ6swSPN469Nda 5iMo4By7XPkhgJW/DSvdYOxchK2VdK5SpVmGGFtu1Yw7GMFiJ+XN16Qgkcjpe5F5pfvf P/UrErecYyXzb7eyesr+OCmGBdx9Xuwh30k8fSVKCE3MqYWG84TOKYnt8XbyhBoYhjbD eIiHYKtD4XsoEb5apt3iEPfWR5RMvegoc7es+iqf3l26MX9R7vjltKqkH++7L280Y3Mr n/frQ+MFLozu6TO6t8NmYFtg+5V0TtVD4XujEGKX8MxV3i+X0+NF0/UJvmMqPjdu1Huu 5GUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=8GL/MkGCX9gzf3JIEjGfn47AAAeeIJvBmMFE4QYocV0=; b=DIsnll58OxgF26mKU9+4jUykwE63HxYoFg3W557c9XqMyXe1w5c+PpV2eX/bqpCL0N IjlXq9+ZB+qQGGwW3bYr0Hnf8JrAPaL4VWXJbh9LYoj6Ua7/qkSbnGA1M8v4ptpW50q6 wbhS2+6QA2d8vVgyqqelku54Xogwt5oh7BgRGLR8fgjikgUy6+Pu4j1vIqt3koJnBlSL 75KkUcvNxorw4hW7m0vj2zkucym4bIcUz5aAEXfS7N3q/aBd4GZ777pYloP32j/Gy3Cg so4f+V39ebH2vmPpUM/8VObZebKipkHqWCOwpcF7elrTlqSDRg6msTjTucJ6BX0fpBZ1 lMqw== X-Gm-Message-State: ALQs6tCP9P/gDsuZJRKMnxtRTT2u31Ni/LaK07H/wiAgPxqN0S0ePNLw 7Rl3Yvo7BCdQRGEC/rtghvE= X-Received: by 2002:aca:75cc:: with SMTP id q195-v6mr3908216oic.319.1524159656002; Thu, 19 Apr 2018 10:40:56 -0700 (PDT) Received: from nuclearis2_1.gtech (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id z51-v6sm2335725otc.25.2018.04.19.10.40.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Apr 2018 10:40:55 -0700 (PDT) Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> From: "Alex G." Message-ID: Date: Thu, 19 Apr 2018 12:40:54 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180419164528.GD5635@pd.tnic> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org SURPRISE!!! On 04/19/2018 11:45 AM, Borislav Petkov wrote: > On Thu, Apr 19, 2018 at 11:26:57AM -0500, Alex G. wrote: >> At a very high level, I'm working with Dell on improving server >> reliability, with a focus on NVME hotplug and surprise removal. One of >> the features we don't support is surprise removal of NVME drives; >> hotplug is supported with 'prepare to remove'. This is one of the >> reasons NVME is not on feature parity with SAS and SATA. > > Ok, first question: is surprise removal something purely mechanical or > do you need firmware support for it? In the sense that you need to tell > the firmware that you will be removing the drive. SURPRISE!!! removal only means that the system was not expecting the drive to be yanked. An example is removing a USB flash drive without first unmounting it and removing the usb device (echo 0 > /sys/bus/usb/.../authorized). PCIe removal and hotplug is fairly well spec'd, and NVMe rides on that without issue. It's much easier and faster for an OS to just follow the spec and handle things on its own. Interference from firmware only comes in with EFI/ACPI and FFS. From a purely technical point of view, firmware has nothing to do with this. From a firmware-centric view, unfortunately, firmware wants the ability to log errors to the BMC... and hotplug events. Does firmware need to know that a drive will be removed? I'm not aware of any such requirement. I think the main purpose of 'prepare to remove' is to shut down any traffic on the link. This way, link removal does not generate PCIe errors which may otherwise end up crashing the OS. > I'm sceptical, though, as it has "surprise" in the name so I'm guessing > the firmware doesn't know about it, the drive physically disappears and > the FW starts spewing PCIe errors... It's not the FW that spews out errors. It's the hardware. It's very likely that a device which is actively used will have several DMA transactions already queued up and lots of traffic going through the link. When the link dies and the traffic can't be delivered, Unsupported Request errors are very common. On the r740xd, FW just hides those errors from the OS with no further notification. On this machine BIOS sets things up such that non-posted requests report fatal (PCIe) errors. FW still tries very hard to hide this from the OS, and I think the heuristic is that if the drive physical presence is gone, don't even report the error. There are a lot of problems with the approach, but one thing to keep in mind is that the FW was written at a time when OSes were more than happy to crash at any PCIe error reported through GHES. Alex >> I'm not sure if this is the example you're looking for, but >> take an r740xd server, and slowly unplug an Intel NVME drives at an >> angle. You're likely to crash the machine. > > No no, that's actually a great example! > > Thx. >