Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp815323ybt; Wed, 24 Jun 2020 11:53:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyiwlQjghIROEUReLA20qBd42NH9uwlgZcP/c6NhD6Q0gt74fnl4fFM4b2I69VTK8iVqFfQ X-Received: by 2002:a17:906:5250:: with SMTP id y16mr26430813ejm.3.1593024828648; Wed, 24 Jun 2020 11:53:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593024828; cv=none; d=google.com; s=arc-20160816; b=h6AwZI9vvkS2C3PV1+aC/bvUSCbF3Z1f8yuWUxrwjYuq/y+9qUKLlOXdhEZX7p3Ej4 inLtC43siJHTKeUD04repRAuvChsBt0TsTSQjNTnTkhxMr8N5KJ7CQ/Z2+IpXEEiLpgO S+JwgZoV7FniulCWTwfnW3feUSBq3vHDTxIQCXnGlcqTa5BKrynCn8fgmRqr7dizv9Pk 3IIZrFAAF9dYbFfWB+4ZflYdoPgS1PHjPmyVyRVc2YYFGn3LHZBtyQMoUC9U8BPMWM7B MLg9+TIgyPlgxu7JAP8ZPQiyR9LUthp98Nz97/m9E68K4xS5LUtbc43xa3byWxgGJcv2 hs4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:content-transfer-encoding :content-id:mime-version:comments:references:in-reply-to:subject:cc :to:from; bh=O6I2ZVaMcC5abNg0VHg6q/Rz82BD4r6NC7dgL9vDJZ8=; b=Ka85yyiHSnqGd2l315xjpv9yg5MjGPkEdi9pAvbhWtvau/vz2NlJUEt2WDo2c3dxdV x1F0UB8H1p9A1/UqwPdl97Z0SEKCTkk4h/u+xesGEgrA8jkw60KzTXRmUucQbFcwK21M tPN/b1JDgzNZtuzhKYv/xObtTC2LcGRIVANkOqjXCu3EL+yWufjqFchxPS3FQncZmmS8 Z6VN/QqWJeNq8b0ijqJguOKz7RiVjpz/0zC8Qls3oYmC8jwVF1XYD1Giqh65xBvByTVJ 6ICweuphjtgTXEO1e/BMfbaC+7LXgKV+irHYGbJ7dbRWoGCxB2jReIuWei5/Eh7uwOss XyGw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k23si14096509ejk.35.2020.06.24.11.53.24; Wed, 24 Jun 2020 11:53:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405996AbgFXSxM convert rfc822-to-8bit (ORCPT + 99 others); Wed, 24 Jun 2020 14:53:12 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:46427 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405119AbgFXSxM (ORCPT ); Wed, 24 Jun 2020 14:53:12 -0400 Received: from c-67-160-6-8.hsd1.wa.comcast.net ([67.160.6.8] helo=famine.localdomain) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1joAW5-0007Ib-6N; Wed, 24 Jun 2020 18:53:01 +0000 Received: by famine.localdomain (Postfix, from userid 1000) id BA0645FEE7; Wed, 24 Jun 2020 11:52:58 -0700 (PDT) Received: from famine (localhost [127.0.0.1]) by famine.localdomain (Postfix) with ESMTP id AE2EB9FB38; Wed, 24 Jun 2020 11:52:58 -0700 (PDT) From: Jay Vosburgh To: sathyanarayanan.kuppuswamy@linux.intel.com Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, ashok.raj@intel.com, Yicong Yang Subject: Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices In-reply-to: <25283.1591332444@famine> References: <25283.1591332444@famine> Comments: In-reply-to Jay Vosburgh message dated "Thu, 04 Jun 2020 21:47:24 -0700." X-Mailer: MH-E 8.6+git; nmh 1.6; GNU Emacs 27.0.50 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <3399.1593024778.1@famine> Content-Transfer-Encoding: 8BIT Date: Wed, 24 Jun 2020 11:52:58 -0700 Message-ID: <3400.1593024778@famine> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jay Vosburgh wrote: >sathyanarayanan.kuppuswamy@linux.intel.com wrote: > >From: Kuppuswamy Sathyanarayanan >> >>Fatal (DPC) error recovery is currently broken for non-hotplug >>capable devices. With current implementation, after successful >>fatal error recovery, non-hotplug capable device state won't be >>restored properly. You can find related issues in following links. >> >>https://lkml.org/lkml/2020/5/27/290 >>https://lore.kernel.org/linux-pci/12115.1588207324@famine/ >>https://lkml.org/lkml/2020/3/28/328 >> >>Current fatal error recovery implementation relies on hotplug handler >>for detaching/re-enumerating the affected devices/drivers on DLLSC >>state changes. So when dealing with non-hotplug capable devices, >>recovery code does not restore the state of the affected devices >>correctly. Correct implementation should call report_slot_reset() >>function after resetting the link to restore the state of the >>device/driver. >> >>So use PCI_ERS_RESULT_NEED_RESET as error status for successful >>reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure >>case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset() >>is called after reset link operation which will also fix the above >>mentioned issue. >> >>[original patch is from jay.vosburgh@canonical.com] >>[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/] >>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") >>Signed-off-by: Jay Vosburgh >>Signed-off-by: Kuppuswamy Sathyanarayanan > > I've tested this patch set on one of our test machines, and it >resolves the issue. I plan to test with other systems tomorrow. I've done testing on two different systems that exhibit the original issue and this patch set appears to behave as expected. Has anyone else (Yicong?) had an opportunity to test this? Can this be considered for acceptance, or is additional feedback or review needed? -J >>--- >> drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++-- >> 1 file changed, 22 insertions(+), 2 deletions(-) >> >>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >>index 14bb8f54723e..5fe8561c7185 100644 >>--- a/drivers/pci/pcie/err.c >>+++ b/drivers/pci/pcie/err.c >>@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, >> pci_dbg(dev, "broadcast error_detected message\n"); >> if (state == pci_channel_io_frozen) { >> pci_walk_bus(bus, report_frozen_detected, &status); >>- status = reset_link(dev); >>- if (status != PCI_ERS_RESULT_RECOVERED) { >>+ /* >>+ * After resetting the link using reset_link() call, the >>+ * possible value of error status is either >>+ * PCI_ERS_RESULT_DISCONNECT (failure case) or >>+ * PCI_ERS_RESULT_NEED_RESET (success case). >>+ * So ignore the return value of report_error_detected() >>+ * call for fatal errors. Instead use >>+ * PCI_ERS_RESULT_NEED_RESET as initial status value. >>+ * >>+ * Ignoring the status return value of report_error_detected() >>+ * call will also help in case of EDR mode based error >>+ * recovery. In EDR mode AER and DPC Capabilities are owned by >>+ * firmware and hence report_error_detected() call will possibly >>+ * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore >>+ * the return value of report_error_detected() then >>+ * pcie_do_recovery() would report incorrect status after >>+ * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER >>+ * in non EDR case should not have any functional impact. >>+ */ >>+ status = PCI_ERS_RESULT_NEED_RESET; >>+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) { >>+ status = PCI_ERS_RESULT_DISCONNECT; >> pci_warn(dev, "link reset failed\n"); >> goto failed; >> } >>-- >>2.17.1 --- -Jay Vosburgh, jay.vosburgh@canonical.com