Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752992Ab3EHIez (ORCPT ); Wed, 8 May 2013 04:34:55 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:42513 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751190Ab3EHIew (ORCPT ); Wed, 8 May 2013 04:34:52 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.7.4 Message-ID: <518A0E1D.60104@jp.fujitsu.com> Date: Wed, 08 May 2013 17:34:37 +0900 From: Takao Indoh User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130328 Thunderbird/17.0.5 MIME-Version: 1.0 To: alex.williamson@redhat.com CC: ddutile@redhat.com, bhelgaas@google.com, linux-pci@vger.kernel.org, iommu@lists.linux-foundation.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA References: <1366779539-3584-1-git-send-email-indou.takao@jp.fujitsu.com> <1367944782.2868.106.camel@ul30vt.home> <51895FAB.5010805@redhat.com> <1367964279.2868.128.camel@ul30vt.home> In-Reply-To: <1367964279.2868.128.camel@ul30vt.home> Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11598 Lines: 295 (2013/05/08 7:04), Alex Williamson wrote: > On Tue, 2013-05-07 at 16:10 -0400, Don Dutile wrote: >> On 05/07/2013 12:39 PM, Alex Williamson wrote: >>> On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote: >>>> This patch resets PCIe devices on boot to stop ongoing DMA. When >>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each >>>> PCIe root port and downstream port to reset its downstream endpoint. >>>> >>>> Problem: >>>> This patch solves the problem that kdump can fail when intel_iommu=on is >>>> specified. When intel_iommu=on is specified, many dma-remapping errors >>>> occur in second kernel and it causes problems like driver error or PCI >>>> SERR, at last kdump fails. This problem is caused as follows. >>>> 1) Devices are working on first kernel. >>>> 2) Switch to second kernel(kdump kernel). The devices are still working >>>> and its DMA continues during this switch. >>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes >>>> dma-remapping errors. >>>> >>>> Solution: >>>> All DMA transactions have to be stopped before iommu is initialized. By >>>> this patch devices are reset and in-flight DMA is stopped before >>>> pci_iommu_init. >>>> >>>> To invoke hot reset on an endpoint, its upstream link need to be reset. >>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root >>>> port/downstream port whose child is PCIe endpoint, and then reset link >>>> between them. If the endpoint is VGA device, it is skipped because the >>>> monitor blacks out if VGA controller is reset. >>>> >>>> Actually this is v8 patch but quite different from v7 and it's been so >>>> long since previous post, so I start over again. >>>> Previous post: >>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump >>>> https://lkml.org/lkml/2012/11/26/814 >>>> >>>> Signed-off-by: Takao Indoh >>>> --- >>>> Documentation/kernel-parameters.txt | 2 + >>>> drivers/pci/pci.c | 103 +++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 105 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >>>> index 4609e81..2a31ade 100644 >>>> --- a/Documentation/kernel-parameters.txt >>>> +++ b/Documentation/kernel-parameters.txt >>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. >>>> any pair of devices, possibly at the cost of >>>> reduced performance. This also guarantees >>>> that hot-added devices will work. >>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot >>>> + reset >>>> cbiosize=nn[KMG] The fixed amount of bus space which is >>>> reserved for the CardBus bridge's IO window. >>>> The default value is 256 bytes. >>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>> index b099e00..42385c9 100644 >>>> --- a/drivers/pci/pci.c >>>> +++ b/drivers/pci/pci.c >>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >>>> } >>>> EXPORT_SYMBOL(pci_fixup_cardbus); >>>> >>>> +/* >>>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe >>>> + * endpoint except VGA device. >>>> + */ >>>> +static int __init need_reset(struct pci_dev *dev) >>>> +{ >>>> + struct pci_bus *subordinate; >>>> + struct pci_dev *child; >>>> + >>>> + if (!pci_is_pcie(dev) || !dev->subordinate || >>>> + list_empty(&dev->subordinate->devices) || >>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&& >>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >>>> + return 0; >>>> + >>>> + subordinate = dev->subordinate; >>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY)) >>>> + /* Don't reset switch, bridge, VGA device */ >>>> + return 0; >>>> + } >>>> + >>>> + return 1; >>>> +} >>>> + >>>> +static void __init save_config(struct pci_dev *dev) >>>> +{ >>>> + struct pci_bus *subordinate; >>>> + struct pci_dev *child; >>>> + >>>> + if (!need_reset(dev)) >>>> + return; >>>> + >>>> + subordinate = dev->subordinate; >>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>> + dev_info(&child->dev, "save state\n"); >>>> + pci_save_state(child); >>>> + } >>>> +} >>>> + >>>> +static void __init restore_config(struct pci_dev *dev) >>>> +{ >>>> + struct pci_bus *subordinate; >>>> + struct pci_dev *child; >>>> + >>>> + if (!need_reset(dev)) >>>> + return; >>>> + >>>> + subordinate = dev->subordinate; >>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>> + dev_info(&child->dev, "restore state\n"); >>>> + pci_restore_state(child); >>>> + } >>>> +} >>>> + >>>> +static void __init do_device_reset(struct pci_dev *dev) >>>> +{ >>>> + u16 ctrl; >>>> + >>>> + if (!need_reset(dev)) >>>> + return; >>>> + >>>> + dev_info(&dev->dev, "Reset Secondary bus\n"); >>>> + >>>> + /* Assert Secondary Bus Reset */ >>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl); >>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>> + >>>> + msleep(2); >>>> + >>>> + /* De-assert Secondary Bus Reset */ >>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET; >>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>> +} >>>> + >>>> +static int __initdata pcie_reset_devices; >>>> +static int __init reset_pcie_devices(void) >>>> +{ >>>> + struct pci_dev *dev = NULL; >>>> + >>>> + if (!pcie_reset_devices) >>>> + return 0; >>>> + >>>> + for_each_pci_dev(dev) >>>> + save_config(dev); >>>> + >>>> + for_each_pci_dev(dev) >>>> + do_device_reset(dev); >>> >>> So we do a reset on each root port or downstream port, but the PCI to >>> PCI bridge spec (1.2) states: >>> >>> 11.1.1. Secondary Reset Signal >>> The secondary reset signal, RST#, is a logical OR of the >>> primary interface RST# signal and >>> the state of the Secondary Bus Reset bit of the Bridge >>> Control register (see Section 3.2.5.18). >>> >>> I read that to mean that in a legacy topology, doing a reset on the top >>> level bridge triggers a reset down the entire hierarchy. How does this >>> apply to a PCIe topology? Can we just do a reset on the top level root >>> ports? If so, that would also imply that a reset propagates down >> On PCIe, a reset does not propagate down the tree (from upstream link to >> downstream link of a bridge/switch). > > I don't think this is actually true. Section 6.6.1 of the spec defines > 3 types of conventional resets, cold, warm and hot. I believe the one > we're interested in is hot, which is an in-band mechanism for > propagating a conventional reset across a link. Section 4.2.5.11 > defines hot reset: > > Hot Reset uses bit 0 (Hot Reset) in the Training Control field > (see Table 4-5 and Table 4-6) within the TS1 and TS2 Ordered > Sets. > > A Link can enter Hot Reset if directed by a higher Layer. A Link > can also reach the Hot Reset state by receiving two consecutive > TS1 Ordered Sets with the Hot Reset bit asserted (see Section > 4.2.6.11). > > Section 4.2.6.11 then goes on as: > > * Lanes that were not directed by a higher Layer to initiate Hot > Reset (i.e., received two consecutive TS1 Ordered Sets with the > Hot Reset bit asserted on any configured Lanes): > * If any Lane of an Upstream Port of a Switch receives two > consecutive TS1 Ordered Sets with the Hot Reset bit > asserted, all configured Downstream Ports must > transition to Hot Reset as soon as possible. > * All Lanes in the configured Link transmit TS1 Ordered > Sets with the Hot Reset bit asserted and the configured > Link and Lane numbers. > > Also included is the footnote: > > Note: Generally, Lanes of a Downstream or optional crosslink > Port will be directed to Hot Reset, and Lanes of an Upstream or > optional crosslink Port will enter Hot Reset by receiving two > consecutive TS1 Ordered Sets with the Hot Reset bit asserted on > any configured Lanes... > > So I think that means that if a hot reset is received by an upstream > port, it's propagated to the downstream ports and on to the downstream > links. The latter point is where there's potential for interpretation. > > Going back to sections 6.6.1, we can generate a hot reset via: > > * For any Root or Switch Downstream Port, setting the Secondary > Bus Reset bit of the Bridge Control register associated with the > Port must cause a hot reset to be sent. > * For a Switch, the following must cause a hot reset to be sent on > all Downstream Ports: > * Setting the Secondary Bus Reset bit of the Bridge > Control register associated with the Upstream Port > * Receiving a hot reset on the Upstream Port > > Sounds like it propagates to me. But, re-reading this patch, I think > the goal is to attempt a bus reset on the most downstream > root/downstream port, but it's pretty confusing. Yes, I also think a hot reset is propagated to all downstream link. This patch does bus reset only on root port/downstream port whose children are endpoints since I need to skip display class devices to avoid monitor blacks out. Thanks, Takao Indoh > > I also have a need to add a bus reset interface, in my case though the > end goal is specifically to reset display class devices to get them into > a usable state. I just want to provide the kernel interfaces though, > it's up to the drivers how to use them. Thanks, > > Alex > >>> through PCIe-to-PCI bridges, so legacy PCI devices may be reset and the >>> option name is misleading. >>> >> In earlier reply, I stated that the functions ought to be renamed to >> reflect their intentions: endpoint reset. >> It is not a reset-all-pci-devices interface, as it implies >> >>> Even so, you still have root complex endpoints, which are not getting >>> reset through this, so it's really not a complete solution. Thanks, >>> >>> Alex >>> >>>> + >>>> + msleep(1000); >>>> + >>>> + for_each_pci_dev(dev) >>>> + restore_config(dev); >>>> + >>>> + return 0; >>>> +} >>>> +fs_initcall_sync(reset_pcie_devices); >>>> + >>>> static int __init pci_setup(char *str) >>>> { >>>> while (str) { >>>> @@ -3920,6 +4021,8 @@ static int __init pci_setup(char *str) >>>> pcie_bus_config = PCIE_BUS_PEER2PEER; >>>> } else if (!strncmp(str, "pcie_scan_all", 13)) { >>>> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); >>>> + } else if (!strncmp(str, "pcie_reset_devices", 18)) { >>>> + pcie_reset_devices = 1; >>>> } else { >>>> printk(KERN_ERR "PCI: Unknown option `%s'\n", >>>> str); >>> >>> >>> >>> _______________________________________________ >>> iommu mailing list >>> iommu@lists.linux-foundation.org >>> https://lists.linuxfoundation.org/mailman/listinfo/iommu >> > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/