Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp2415406ybh; Fri, 24 Jul 2020 12:10:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw/ZNynXzThpeLg6s/4X1QDG/qsCOXgH+17HBOcx+ScBfwp0u0uReUKT05ZSGUcAqeyQ53I X-Received: by 2002:a17:906:1c4b:: with SMTP id l11mr10301384ejg.307.1595617832465; Fri, 24 Jul 2020 12:10:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595617832; cv=none; d=google.com; s=arc-20160816; b=nIia0TFSkTqXAvH14w5PN7NV6l295doQOkrvxkKMiS5Xg1XDkOxMOyqg4InFUAC1fE 7EOdakW8ZByoUy3Bekzyny3sJPerf1feGHtVSCzfSMDDgC5G1j951I8y2vXgFcAcyUDX 0+7Kw93/G1W5ENWZomUFla8BnCfoWkrTXeyVHom0n6ctWPa6siPJg+C6OHSU/wUiQGyq UBpCX1+eK15aK/aFX8qgUr5UW3JQ1hED1lgAuizy8XyzBZqfSMkWuvj4HnMoo3otOu57 1OrYU9Gz0pJ8tcJD4NirqkYgIB1Nr7Pwjm0noQ5KCwMne1g167w7Bp/WCu0TKWO183V5 pOTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:reply-to:message-id:date:subject:cc:to :from:ironport-sdr:ironport-sdr; bh=3r+Z/qwDKPdufdwex36AU9hIcMVXaPmpyGOjLujEqYc=; b=YrZeGyLDzdhnB3E283XL7X9MYx2xylH2X8/pNa1U3UGJodEVOy8zrVTub96ODBbcme rm2y8n8/DTp7FXHf4sM4I61uL7GXoouD1NrOyIb9K3l9xDblGwxihaLfRs03KhgLF/ui Yz5Vv1er+vubNjwQfls0RLvEe2B/RSK5lVkws34KCoaNwO81D4zWHQcht0DVWKYw1XHS gp6bdEGEy/6ru776yVU81JK6m/NOoHhhlzrfNqo5IoHaAZcvos5ks+IemwbvhoLBjhl/ dgOH+glG6etqYp5BW60Awg+okOX+Jd1aTFjIVg5Azw3tcYQSkA77zG46Rtn9dD+8OFQv OuGQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g17si1065133ejw.454.2020.07.24.12.10.10; Fri, 24 Jul 2020 12:10:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726701AbgGXTIC (ORCPT + 99 others); Fri, 24 Jul 2020 15:08:02 -0400 Received: from mga12.intel.com ([192.55.52.136]:45715 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726381AbgGXTIC (ORCPT ); Fri, 24 Jul 2020 15:08:02 -0400 IronPort-SDR: GBR8bqbgV9z/+bVnCf3ee0KKf3btc5wPpUf0bCsAOQukn3YOlQVmOfuSIQDLj28blvY4ZJZ8FV pMH3jsWiZPyA== X-IronPort-AV: E=McAfee;i="6000,8403,9692"; a="130328294" X-IronPort-AV: E=Sophos;i="5.75,391,1589266800"; d="scan'208";a="130328294" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jul 2020 12:08:01 -0700 IronPort-SDR: Owy6CufN3MA4RtuWmHr1DJ/EIMsQB/j95gTNub98sZpRdKxfWYj17mSesXnaoj8biylRxpgJ+o xE8l2Qu/S5yQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,391,1589266800"; d="scan'208";a="272659690" Received: from pittner-mobl1.amr.corp.intel.com (HELO localhost.localdomain) ([10.254.77.166]) by fmsmga008.fm.intel.com with ESMTP; 24 Jul 2020 12:08:00 -0700 From: sathyanarayanan.kuppuswamy@linux.intel.com To: bhelgaas@google.com Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, ashok.raj@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, Jay Vosburgh Subject: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call Date: Fri, 24 Jul 2020 12:07:55 -0700 Message-Id: X-Mailer: git-send-email 2.17.1 Reply-To: 20200714230803.GA92891@bjorn-Precision-5520 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Kuppuswamy Sathyanarayanan Current pcie_do_recovery() implementation has following two issues: 1. Fatal (DPC) error recovery is currently broken for non-hotplug capable devices. Current fatal error recovery implementation relies on PCIe hotplug (pciehp) handler for detaching and re-enumerating the affected devices/drivers. pciehp handler listens for DLLSC state changes and handles device/driver detachment on DLLSC_LINK_DOWN event and re-enumeration on DLLSC_LINK_UP event. So when dealing with non-hotplug capable devices, recovery code does not restore the state of the affected devices correctly. Correct implementation should restore the device state and call report_slot_reset() function after resetting the link to restore the state of the device/driver. You can find fatal non-hotplug related issues reported in following links: https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/ https://lore.kernel.org/linux-pci/12115.1588207324@famine/ https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/ 2. For non-fatal errors if report_error_detected() or report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then current pcie_do_recovery() implementation does not do the requested explicit device reset, instead just calls the report_slot_reset() on all affected devices. Notifying about the reset via report_slot_reset() without doing the actual device reset is incorrect. To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after successful reset_link() operation. This will ensure ->slot_reset() be called after reset_link() operation for fatal errors. Also call pci_bus_reset() to do slot/bus reset() before triggering device specific ->slot_reset() callback. Also, using pci_bus_reset() will restore the state of the devices after performing the reset operation. Even though using pci_bus_reset() will do redundant reset operation after ->reset_link() for fatal errors, it should should affect the functional behavior. [original patch is from jay.vosburgh@canonical.com] [original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/] Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") Signed-off-by: Jay Vosburgh Signed-off-by: Kuppuswamy Sathyanarayanan --- Changes since v2: * Changed the subject of patch to "PCI/ERR: Fix reset logic in pcie_do_recovery() call". v2 patch link is, https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppuswamy@linux.intel.com/ * Squashed "PCI/ERR: Add reset support for non fatal errors" patch. drivers/pci/pcie/err.c | 41 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 14bb8f54723e..b5eb6ba65be1 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -165,8 +165,29 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, pci_dbg(dev, "broadcast error_detected message\n"); if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, &status); + /* + * After resetting the link using reset_link() call, the + * possible value of error status is either + * PCI_ERS_RESULT_DISCONNECT (failure case) or + * PCI_ERS_RESULT_NEED_RESET (success case). + * So ignore the return value of report_error_detected() + * call for fatal errors. + * + * In EDR mode, since AER and DPC Capabilities are owned by + * firmware, reported_error_detected() will return error + * status PCI_ERS_RESULT_NO_AER_DRIVER. Continuing + * pcie_do_recovery() with error status as + * PCI_ERS_RESULT_NO_AER_DRIVER will report recovery failure + * irrespective of recovery status. But successful reset_link() + * call usually recovers all fatal errors. So ignoring the + * status result of report_error_detected() also helps EDR based + * error recovery. + */ status = reset_link(dev); - if (status != PCI_ERS_RESULT_RECOVERED) { + if (status == PCI_ERS_RESULT_RECOVERED) { + status = PCI_ERS_RESULT_NEED_RESET; + } else { + status = PCI_ERS_RESULT_DISCONNECT; pci_warn(dev, "link reset failed\n"); goto failed; } @@ -182,10 +203,22 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, if (status == PCI_ERS_RESULT_NEED_RESET) { /* - * TODO: Should call platform-specific - * functions to reset slot before calling - * drivers' slot_reset callbacks? + * TODO: Optimize the call to pci_reset_bus() + * + * There are two components to pci_reset_bus(). + * + * 1. Do platform specific slot/bus reset. + * 2. Save/Restore all devices in the bus. + * + * For hotplug capable devices and fatal errors, + * device is already in reset state due to link + * reset. So repeating platform specific slot/bus + * reset via pci_reset_bus() call is redundant. So + * can optimize this logic and conditionally call + * pci_reset_bus(). */ + pci_reset_bus(dev); + status = PCI_ERS_RESULT_RECOVERED; pci_dbg(dev, "broadcast slot_reset message\n"); pci_walk_bus(bus, report_slot_reset, &status); -- 2.17.1