Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1555583imm; Tue, 15 May 2018 22:52:09 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqBEZ3EWkzqabvpAnEM7S6cAbijycskP38RkHiF+wlPzoqSvxyiwaoSiDyPTEy8ouL8S88x X-Received: by 2002:a65:5a4a:: with SMTP id z10-v6mr10466695pgs.243.1526449929769; Tue, 15 May 2018 22:52:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526449929; cv=none; d=google.com; s=arc-20160816; b=D7p8WdH+NWMN77LwIhqHsOFlF6rvILfx4igpFGJfeMVBbZz/fIBfAcsIjvFkuqml4o u/EicMFWTEqQcK81XATi0vzKzW78WE4HXmfcAoblF/AukS4GRh6WXQCLv65GE4vp5ghp juFkZn7xLxsENVk/MImpBKHC5JSM7dFpBcUl2Oe5ClAYM9uglJXJskiFUStft7KzQGVc sx7zeb6ahqri+xiVzhEvkHrrZMnX3CEAo+RK/MIhBtZ7Y16bgCtdH+tKrIUqXOALgyG0 n5eIrrb7Lf0W/LH1M/Bd5KSM1PY2KbzyUJUga4hxWxQNZh3sXOVh0Z/oV4qLx+Zoq1XL Fi3Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version:dkim-signature:dkim-signature :arc-authentication-results; bh=I4w5oTgQhCW92T25VYIfkDzI+bXlf9+URyQQzIwPvqQ=; b=DHgd5eVZoH1LUc6mxH6RNFE++FfCDf9fYGLzoJLhzH+7SWDRC4tON228d48AMa5W0J dYrQlzpdbIQF84EdaV4ZaKzPy9zCKjJWY9euL40lEcR6FWrkZ9XluKigiy52MQ/ccMl5 QYnSnMSd9r9fFwQOMAUHEVZetuX9OfsljFUKHu1A1hY29DYmuGsqPcRmbHCMpGtJLS4m 0Xbk4qzkb/u+WjjHPoM8xu9bSuajZ3DaDNy6sWj/608zbHcrRSm8XPvqg2YH4UfxMgG7 w8wMaWpILQclVoQ7udIImNRzefmlN4PJd/HK+pNBhAB5Z0ndyGY/tdAH2yWB9SfBDNW1 davQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=l19LN1Rx; dkim=pass header.i=@codeaurora.org header.s=default header.b=XMUImCRl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a189-v6si1489946pgc.296.2018.05.15.22.51.55; Tue, 15 May 2018 22:52:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=l19LN1Rx; dkim=pass header.i=@codeaurora.org header.s=default header.b=XMUImCRl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752792AbeEPFuE (ORCPT + 99 others); Wed, 16 May 2018 01:50:04 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:49652 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752411AbeEPFuA (ORCPT ); Wed, 16 May 2018 01:50:00 -0400 Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 6EAFC6085F; Wed, 16 May 2018 05:49:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1526449799; bh=0JlGcd1eJRg6odwaaa7X/a2C8cRJ0riApnH04vhUVWQ=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=l19LN1RxuC/egO2qmMahMtoKQ6zd77wrE3gcpSufEsjVXQ8G1VfdBhLPryOLQtrdy TQ7OeGXX8qtsXnOeTO2ClTuREMDS13HbrRTFjIvr0+dAZ9nly6KHiNMa9HNdwlRdZm 0lr8vKlLO+BwD20E4DVQIFmNJpdkcKJ4JILLnkKo= X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on pdx-caf-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=2.0 tests=ALL_TRUSTED,BAYES_00, DKIM_SIGNED,T_DKIM_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from mail.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.codeaurora.org (Postfix) with ESMTP id D6E296021A; Wed, 16 May 2018 05:49:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1526449797; bh=0JlGcd1eJRg6odwaaa7X/a2C8cRJ0riApnH04vhUVWQ=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=XMUImCRl8H7P4I6815l7PmS3w5CnNMSZXwIWSKic7WrlVsGNk7Y7Gu34e7QCQpXCo cckYbFD1DRz183xTyTZTksuyVg+v+/alu5xt3+zf0C9OZTRdRjZox4onLt/XY25WfS pnkzScUVUohDc8hiFzcUvq/Bx68ccu3nb257Z550= MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Wed, 16 May 2018 11:19:57 +0530 From: poza@codeaurora.org To: Bjorn Helgaas Cc: Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Keith Busch , Wei Zhang , Sinan Kaya , Timur Tabi , linux-pci-owner@vger.kernel.org Subject: Re: [PATCH v16 3/9] PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices In-Reply-To: <20180515235952.GC11156@bhelgaas-glaptop.roam.corp.google.com> References: <1526035408-31328-1-git-send-email-poza@codeaurora.org> <1526035408-31328-4-git-send-email-poza@codeaurora.org> <20180515235952.GC11156@bhelgaas-glaptop.roam.corp.google.com> Message-ID: X-Sender: poza@codeaurora.org User-Agent: Roundcube Webmail/1.2.5 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-05-16 05:29, Bjorn Helgaas wrote: > On Fri, May 11, 2018 at 06:43:22AM -0400, Oza Pawandeep wrote: >> This patch alters the behavior of handling of ERR_FATAL, where removal >> of devices is initiated, followed by reset link, followed by >> re-enumeration. >> >> So the errors are handled in a different way as follows: >> ERR_NONFATAL => call driver recovery entry points >> ERR_FATAL => remove and re-enumerate >> >> please refer to Documentation/PCI/pci-error-recovery.txt for more >> details. >> >> Signed-off-by: Oza Pawandeep >> Reviewed-by: Keith Busch >> >> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c >> b/drivers/pci/pcie/aer/aerdrv_core.c >> index 0ea5acc..649dd1f 100644 >> --- a/drivers/pci/pcie/aer/aerdrv_core.c >> +++ b/drivers/pci/pcie/aer/aerdrv_core.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include "aerdrv.h" >> +#include "../../pci.h" >> >> #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE >> | \ >> PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE) >> @@ -475,35 +476,84 @@ static pci_ers_result_t reset_link(struct >> pci_dev *dev) >> } >> >> /** >> - * do_recovery - handle nonfatal/fatal error recovery process >> + * do_fatal_recovery - handle fatal error recovery process >> + * @dev: pointer to a pci_dev data structure of agent detecting an >> error >> + * >> + * Invoked when an error is fatal. Once being invoked, removes the >> devices >> + * benetah this AER agent, followed by reset link e.g. secondary bus >> reset >> + * followed by re-enumeration of devices. >> + */ >> + >> +static void do_fatal_recovery(struct pci_dev *dev) >> +{ >> + struct pci_dev *udev; >> + struct pci_bus *parent; >> + struct pci_dev *pdev, *temp; >> + pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED; >> + struct aer_broadcast_data result_data; >> + >> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) >> + udev = dev; >> + else >> + udev = dev->bus->self; >> + >> + parent = udev->subordinate; >> + pci_lock_rescan_remove(); >> + list_for_each_entry_safe_reverse(pdev, temp, &parent->devices, >> + bus_list) { >> + pci_dev_get(pdev); >> + pci_dev_set_disconnected(pdev, NULL); >> + if (pci_has_subordinate(pdev)) >> + pci_walk_bus(pdev->subordinate, >> + pci_dev_set_disconnected, NULL); >> + pci_stop_and_remove_bus_device(pdev); >> + pci_dev_put(pdev); >> + } >> + >> + result = reset_link(udev); > > I don't like the fact that for DPC, the link reset happens before we > call > the driver .remove() methods, while for AER, the reset happens *after* > the > .remove() methods. That means the .remove() methods may work > differently > for AER vs. DPC, e.g., they may be able to access the device if AER is > handling the error, but they would not be able to access it if DPC is > handling it. > > I don't know how to fix this, and I think we can keep this patch as it > is > for now, but I think we should fix it eventually. point noted, will see to this eventually. > >> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { >> + /* >> + * If the error is reported by a bridge, we think this error >> + * is related to the downstream link of the bridge, so we >> + * do error recovery on all subordinates of the bridge instead >> + * of the bridge and clear the error status of the bridge. >> + */ >> + pci_walk_bus(dev->subordinate, report_resume, &result_data); >> + pci_cleanup_aer_uncorrect_error_status(dev); >> + } >> + >> + if (result == PCI_ERS_RESULT_RECOVERED) { >> + if (pcie_wait_for_link(udev, true)) >> + pci_rescan_bus(udev->bus); >> + } else { >> + pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); >> + pci_info(dev, "AER: Device recovery failed\n"); >> + } >> + >> + pci_unlock_rescan_remove(); >> +} >> + >> +/** >> + * do_nonfatal_recovery - handle nonfatal error recovery process >> * @dev: pointer to a pci_dev data structure of agent detecting an >> error >> - * @severity: error severity type >> * >> * Invoked when an error is nonfatal/fatal. Once being invoked, >> broadcast >> * error detected message to all downstream drivers within a >> hierarchy in >> * question and return the returned code. >> */ >> -static void do_recovery(struct pci_dev *dev, int severity) >> +static void do_nonfatal_recovery(struct pci_dev *dev) >> { >> - pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED; >> + pci_ers_result_t status; >> enum pci_channel_state state; >> >> - if (severity == AER_FATAL) >> - state = pci_channel_io_frozen; >> - else >> - state = pci_channel_io_normal; >> + state = pci_channel_io_normal; >> >> status = broadcast_error_message(dev, >> state, >> "error_detected", >> report_error_detected); >> >> - if (severity == AER_FATAL) { >> - result = reset_link(dev); >> - if (result != PCI_ERS_RESULT_RECOVERED) >> - goto failed; >> - } >> - >> if (status == PCI_ERS_RESULT_CAN_RECOVER) >> status = broadcast_error_message(dev, >> state, >> @@ -562,8 +612,10 @@ static void handle_error_source(struct >> pcie_device *aerdev, >> if (pos) >> pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS, >> info->status); >> - } else >> - do_recovery(dev, info->severity); >> + } else if (info->severity == AER_NONFATAL) >> + do_nonfatal_recovery(dev); >> + else if (info->severity == AER_FATAL) >> + do_fatal_recovery(dev); >> } >> >> #ifdef CONFIG_ACPI_APEI_PCIEAER >> @@ -627,8 +679,10 @@ static void aer_recover_work_func(struct >> work_struct *work) >> continue; >> } >> cper_print_aer(pdev, entry.severity, entry.regs); >> - if (entry.severity != AER_CORRECTABLE) >> - do_recovery(pdev, entry.severity); >> + if (entry.severity == AER_NONFATAL) >> + do_nonfatal_recovery(pdev); >> + else if (entry.severity == AER_FATAL) >> + do_fatal_recovery(pdev); >> pci_dev_put(pdev); >> } >> } >> -- >> 2.7.4 >>