Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753370AbcLIHzs (ORCPT ); Fri, 9 Dec 2016 02:55:48 -0500 Received: from cn.fujitsu.com ([59.151.112.132]:57889 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1752293AbcLIHzr (ORCPT ); Fri, 9 Dec 2016 02:55:47 -0500 X-IronPort-AV: E=Sophos;i="5.22,518,1449504000"; d="scan'208";a="13762384" Subject: Re: [PATCH] pci-error-recover: doc cleanup To: References: <1481184974-12505-1-git-send-email-caoj.fnst@cn.fujitsu.com> <20161208070539.0f00ce71@lwn.net> <58496AA4.5030602@cn.fujitsu.com> <584A513B.9080409@cn.fujitsu.com> CC: Jonathan Corbet , "linux-pci@vger.kernel.org" , , "linux-kernel@vger.kernel.org" , Bjorn Helgaas From: Cao jin Message-ID: <584A6470.60502@cn.fujitsu.com> Date: Fri, 9 Dec 2016 15:59:44 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.167.226.69] X-yoursite-MailScanner-ID: E89424670070.A3644 X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: caoj.fnst@cn.fujitsu.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3678 Lines: 106 On 12/09/2016 02:44 PM, Linas Vepstas wrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jin wrote: >> >> >> On 12/09/2016 02:24 PM, Linas Vepstas wrote: >>> I suppose I'm confused, but I recall that link resets are non-fatal. >>> Fatal errors typically require that the the pci adapter be completely >>> reset, any adapter firmware to be reloaded from scratch, the device >>> driver has to kill all device state and start from scratch. Its huge. >>> If the fatal error is on pci device that is under a block device >>> holding a file system, then (usually) there is no way to recover, >>> because the block layer (and file system) cannot deal with a block >>> device that disappeared and then reappeared some few seconds later. >>> (maybe some future zfs or lvm or btrfs might be able to deal with >>> this, but not today) >>> >>> By contrast, link resets are far more gentle: the device driver might >>> have to discard some half-full FIFO's, or cancel some in-flight >>> commands, but can otherwise gracefully recover without telling the >>> higher layers that there were any problems. >>> >>> --linas >>> >> >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: >> >> Fatal errors are uncorrectable error conditions which render the >> particular Link and related hardware unreliable. For Fatal errors, a >> reset of the components on the Link may be required to return to >> reliable operation. Platform handling of Fatal errors, and any efforts >> to limit the effects of these errors, is platform implementation specific. >> >> Link reset means set *secondary bus reset* bit in pci bridge config >> space, can reset the link and device simultaneously, is the strongest >> kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device reset? > At least I don't find the exact words saying that. -- Sincerely, Cao jin > The intent was that some PCI link errors are due to vibration, > ground-bounce, humidity, etc. and that these errors can be detected > and do not corrupt the device state or the device driver state. Since > they are not associated with data corruption (or rather, the > corruption is local to the link), these can be recovered by reseting > just the link, without resetting the whole adapter. They may require > reseting some device-driver state, but not all of it. > > However, this was all decided before the PCI-E spec was written, so > maybe the newer PCI-E specs now say something different. > > --linas > >> >>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: >>>> >>>> >>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >>>>> On Thu, 8 Dec 2016 16:16:14 +0800 >>>>> Cao jin wrote: >>>>> >>>>>> The platform resets the link, and then calls the link_reset() callback >>>>>> on all affected device drivers. This is a PCI-Express specific state >>>>>> -and is done whenever a non-fatal error has been detected that can be >>>>>> +and is done whenever a fatal error has been detected that can be >>>>>> "solved" by resetting the link. This call informs the driver of the >>>>> >>>>> As far as I can tell, the original text was correct here; why do you >>>>> think this change needs to be made? >>>>> >>>> >>>> See do_recovery() in aer core, reset_link() is called only seeing fatal >>>> error. >>>> >>>> -- >>>> Sincerely, >>>> Cao jin >>>> >>>> >>> >>> >>> >> >> -- >> Sincerely, >> Cao jin >> >> > > > . >