Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754766AbcLNCkR (ORCPT ); Tue, 13 Dec 2016 21:40:17 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:57244 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754620AbcLNCj7 (ORCPT ); Tue, 13 Dec 2016 21:39:59 -0500 Date: Wed, 14 Dec 2016 13:39:50 +1100 From: Gavin Shan To: Andrew Donnellan Cc: linasvepstas@gmail.com, Cao jin , Jonathan Corbet , "linux-pci@vger.kernel.org" , linux-doc@vger.kernel.org, "linux-kernel@vger.kernel.org" , Bjorn Helgaas Subject: Re: [PATCH] pci-error-recover: doc cleanup Reply-To: Gavin Shan References: <1481184974-12505-1-git-send-email-caoj.fnst@cn.fujitsu.com> <20161208070539.0f00ce71@lwn.net> <58496AA4.5030602@cn.fujitsu.com> <3ed3151c-eeef-940c-8a9c-49cf53a51d49@au1.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3ed3151c-eeef-940c-8a9c-49cf53a51d49@au1.ibm.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16121402-0012-0000-0000-000001F71037 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16121402-0013-0000-0000-000006A0D8C9 Message-Id: <20161214023949.GA9896@gwshan> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-12-14_01:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1612140042 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1473 Lines: 33 On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote: >On 09/12/16 17:24, Linas Vepstas wrote: >>I suppose I'm confused, but I recall that link resets are non-fatal. >>Fatal errors typically require that the the pci adapter be completely >>reset, any adapter firmware to be reloaded from scratch, the device >>driver has to kill all device state and start from scratch. Its huge. > >Is there a difference in terminology between an AER fatal error and what >EEH/IBM people think of as a fatal error? > They are different things. AER fatal error can lead to frozen PE error, not fenced PHB error basing on the configuration on PHB. >>If the fatal error is on pci device that is under a block device >>holding a file system, then (usually) there is no way to recover, >>because the block layer (and file system) cannot deal with a block >>device that disappeared and then reappeared some few seconds later. >>(maybe some future zfs or lvm or btrfs might be able to deal with >>this, but not today) > >Is this still true? I'm not at all familiar with the block device side of it, >but the cxlflash driver has reasonably full EEH support, including surviving >a full PHB fence and complete reset. > It's still true, especially when the recovery is going to affect the rootfs. On completion of error recovery, the driver (if necessary) and filesystem needs to be reloaded which depends on script or daemon and they are unavailable in this scenario. Thanks, Gavin