Received: by 10.192.165.156 with SMTP id m28csp678244imm; Mon, 16 Apr 2018 07:03:09 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+o/MS5Z5ChtoCClrw0HzouxKd731us6PbiLqJcWlNb8TekDQ71qZlLO0P8WrcY6Bm+t1wE X-Received: by 2002:a17:902:3c5:: with SMTP id d63-v6mr13934532pld.163.1523887389178; Mon, 16 Apr 2018 07:03:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523887389; cv=none; d=google.com; s=arc-20160816; b=aowaygk1OffLyIvAXCI7S9/SLO6mQMBcJSljMrJNFMfENb/L6XxFG7KFCGT8XO1YSb jUeXp2yH5ua+a1jjnxjqEuUtAljWYbFnqqE79NTg/cdz+7jkpOJKYDdM4pTlIeGX1/zt xzLGgoMV2PiDAwRMKhowWwQeLFmJ0hm4/ZZl3xH6j7Noz5jPx1DKISP210GE/0otIwIV o5bCV2lw5W0CPVBEIZ0KrMiAStMPKL7Knj1ov5mrveKOFvOHqhvdMeUgx9UDqrKl0O4b DPlW4W1vclST3Ns/WFfCinb48oQsCjbRziZg/o97Q63/aubCx4NIcPXb26Ac0kL6aJgT SiJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dmarc-filter:arc-authentication-results; bh=/uS7OzsRGjkZoy4ZHl9ExmbwOG77PG4tRv7eEMD1oy4=; b=QhjYdpzAku+4FHfG+kziF9ltpDlkSJsaxU8RSF61HZ+lvuAygaDu5FSGHoIq+Zb2kG JTeLXUt/rK6kqEDkOyyn7ZxIOQNki1S/vEBnNHjnjjlYq1AwKCOCTp0LoNiUpdmleinc WuZ94T10+RW58SD+86DxwC3YimR5p6qIlYfUXwpcmNQcllTipjmV46az6JWKwlqiU2Hu uNUShhz+KQX2Oy6jCk/HDCmIJRxMDTsqUIc8efu/EDzWB7o3vfvC1Cxf6smZnIb/F1kC ZDsE3nes8gsdEoB2djYKPjnWHZNgpa1TYH53tI+Z+7TU4C1hKLgntoKvvRLTi6s1a43t oJtg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q5si9846708pgc.620.2018.04.16.07.02.52; Mon, 16 Apr 2018 07:03:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755399AbeDPOBJ (ORCPT + 99 others); Mon, 16 Apr 2018 10:01:09 -0400 Received: from mail.kernel.org ([198.145.29.99]:58140 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755108AbeDPOBI (ORCPT ); Mon, 16 Apr 2018 10:01:08 -0400 Received: from localhost (50-81-22-222.client.mchsi.com [50.81.22.222]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id ADD7D217D9; Mon, 16 Apr 2018 14:01:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org ADD7D217D9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=helgaas@kernel.org Date: Mon, 16 Apr 2018 09:01:06 -0500 From: Bjorn Helgaas To: poza@codeaurora.org Cc: Sinan Kaya , Keith Busch , Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Wei Zhang , Timur Tabi , Alex Williamson , linux-pci-owner@vger.kernel.org Subject: Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system Message-ID: <20180416140106.GB28657@bhelgaas-glaptop.roam.corp.google.com> References: <20180412143954.GB4810@localhost.localdomain> <20180412150231.GD4810@localhost.localdomain> <20180412170911.GA6424@localhost.localdomain> <20180416031726.GB158153@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 16, 2018 at 11:21:15AM +0530, poza@codeaurora.org wrote: > On 2018-04-16 11:03, poza@codeaurora.org wrote: > > On 2018-04-16 08:47, Bjorn Helgaas wrote: > > > On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote: > > > > > > > You indicated that you want to unify the AER and DPC behavior. Let's > > > > settle on what we want to do one more time. We have been going forth > > > > and back on the direction. > > > > > > My thinking is that as much as possible, similar events should be > > > handled similarly, whether the mechanism is AER, DPC, EEH, etc. > > > Ideally, drivers shouldn't have to be aware of which mechanism is in > > > use. > > > > > > Error recovery includes conventional PCI as well, but right now I > > > think we're only concerned with PCIe. The following error types are > > > from PCIe r4.0, sec 6.2.2: > > > > > > ERR_COR > > > Corrected by hardware with no software intervention. Software > > > involved for logging only. > > > > > > Handled by AER via pci_error_handlers; DPC is never involved. > > > > > > Link is unaffected. > > > > > > ERR_NONFATAL > > > A transaction is unreliable but the link is fully functional. > > > > > > If DPC is not supported, handled by AER via pci_error_handlers and > > > the link is unaffected. > > > > > > If DPC supported, handled by DPC (because we set > > > PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate. > > > > > > ERR_FATAL > > > The link is unreliable. > > > > > > If DPC is not supported, handled by AER via pci_error_handlers and > > > the link is reset. > > > > > > If DPC supported, handled by DPC via remove/re-enumerate. > > > > > > It doesn't seem right to me that we handle both ERR_NONFATAL and > > > ERR_FATAL events differently if we happen to have DPC support in a > > > switch. > > > > > > Maybe we should consider triggering DPC only on ERR_FATAL? That would > > > keep DPC out of the ERR_NONFATAL cases. > > > > > > For ERR_FATAL, maybe we should bite the bullet and use > > > remove/re-enumerate for AER as well as for DPC. That would be painful > > > for higher-level software, but if we're willing to accept that pain > > > for new systems that support DPC, maybe life would be better overall > > > if it worked the same way on systems without DPC? > > > > > > Bjorn > > > > This had crossed my mind when I first looked at the code. > > DPC is getting triggered for both ERR_NONFATAL and ERR_FATAL case. > > I thought the primary purpose of DPC to recover fatal errors, by > > triggering HW recovery. > > but what if some platform wants to handle both FATAL and NON_FATAL > > with DPC ? AER usage is coordinated between the platform and the OS via _OSC (PCI Firmware spec r3.2, sec 4.5.1) My understanding is that if the platform grants the OS permission to control AER, the OS is free to set the AER control registers however it wishes. If the platform depends on certain AER control settings, it must decline to grant AER control to the OS. As far as I know, there's nothing new in _OSC related to DPC, but based on the implementation note in PCIe r4.0, sec 6.2.10, we currently assume the OS controls DPC if and only if it controls AER. Therefore, if a platform depends on certain DPC control settings, e.g., if it wants to handle both ERR_FATAL and ERR_NONFATAL with DPC, I would argue that the platform needs to retain control over AER. If the platform grants AER control to the OS, I think the OS is free to set both AER and DPC control registers as it desires. > > As you said AER FATAL cases and DPC FATAL cases should be handled > > similarly. e.g. remove/re-enumerate the devices. > > > > while NON_FATAL case; only AER would come into picture. if some > > platform would like to handle DPC NON_FATAL then it should follow > > AER NON_FATAL path (where it does not do remove/re-enumerate) > > > > And the case where hotplug is enabled, remove/re-enumerate more > > sense in case of ERR_FATAL. And the case where hotplug is > > disabled, only re-enumeration is required. (no need to remove the > > devices) but then do we need to handle this case specifically, > > what is the harm in removing the devices in all the cases followed > > by re-enumerate ? > > To Clarify the last line, what I meant here was, in case of > ERR_FATAL we can always remove/re-enumerate the devices irrespective > of hotplug is enabled or not. ERR_FATAL means the link is unreliable. A hotplug event may have occurred. In this case, I think remove/re-enumerate is a safe option. It might be more than is needed in some cases, but I'm not sure we can reliably determine when we don't need it, and if we remove/re-enumerate for some ERR_FATAL errors but not others, I think it complicates the situation for drivers and other higher-level software. > and in case of ERR_NONFATAL, DPC will follow AER path (where it just > tries to recover) although I am not very sure that how to handle > ERR_NONFATAL case if hotplug is enabled. Because as Keith suggested > device might have been changed run-time. ERR_NONFATAL means a particular transaction might be unreliable but the link itself is fully functional. No hotplug is possible if the link has stayed fully functional. Bjorn