Received: by 10.213.65.68 with SMTP id h4csp256422imn; Mon, 12 Mar 2018 12:49:06 -0700 (PDT) X-Google-Smtp-Source: AG47ELtKzFh0VvZqJk1UretsnzGuMYiS4VJbjoQ+XmxEpw+aYJrYwJAf9p+zjo5hDYql+fAtrsYT X-Received: by 10.98.150.82 with SMTP id c79mr9108952pfe.88.1520884146861; Mon, 12 Mar 2018 12:49:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1520884146; cv=none; d=google.com; s=arc-20160816; b=ftY9GPREKoGJJ8Mt+AFFZKiGm9IaSI3eNfQjEpAJiBN3UF90opleXi/ql5qa3LrAoT Rw8k0RrYlgiwqHHKEdMt5HEKJtBzOJCnN2zvbxAozPJI2At6atbysXMi3XwBwZJOkxN0 CS0Q45kzzhMbPhYbqtDnNoG+m2f8uVnWBWtHzkQ1rtv2QrCOOSuJzcHmEcZKQJzg39k8 Tzg3EJrJSJdoIPHw/Jstlo+fTd/3CFr7KK1Qgf5O8M4czhO9P2PZoAyC022AoU12nkZ6 Di7e0A7pR1I8QUH3r4frw39UUsjCIWzV9L03tZFwIVd2lvydcVdVq5AFCmoZH9V0YJqk iI7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dmarc-filter:arc-authentication-results; bh=l91DL9QyVeN1u1OWUcsCkaRn3pjAMO+3IEDOSRGRbdw=; b=o3dHPpDYrl/Eh84xAyczOKYG2DSHJx8chqzI/+NlOOiqvlRcY3DebzViAdn52KLmdw syr1yXmjOth8aOqg5IK26MIbVWso6QpPQZCxfTKWmuB0SjoBJEPNfB6X3X8HNGu4Ffp7 rcx6ELSds2kiKSf54SU+cBwoNi3W+nb6W7Qq4iADCeyaRqoUgQMaQszrLqop+ZajqZ8V ztGbN7sG5Z7PvvcUo4jiNhbA/TAvBtcjNLKhvV7FDr9FpnX3TGZQPIUwog4saXGYxmAu bnshHDrsxsHTdpEtXCAz20dZE5TojzAJc9I86iddSiEnIiD+xA80KClKLcqRX1a0xVHz y/MQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r7si5387039pgf.433.2018.03.12.12.48.52; Mon, 12 Mar 2018 12:49:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932271AbeCLTre (ORCPT + 99 others); Mon, 12 Mar 2018 15:47:34 -0400 Received: from mail.kernel.org ([198.145.29.99]:37038 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932242AbeCLTrc (ORCPT ); Mon, 12 Mar 2018 15:47:32 -0400 Received: from localhost (13.sub-174-234-141.myvzw.com [174.234.141.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id E589221770; Mon, 12 Mar 2018 19:47:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E589221770 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=helgaas@kernel.org Date: Mon, 12 Mar 2018 14:47:30 -0500 From: Bjorn Helgaas To: Keith Busch Cc: Sinan Kaya , Oza Pawandeep , Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Wei Zhang , Timur Tabi , Alex Williamson Subject: Re: [PATCH v12 0/6] Address error and recovery for AER and DPC Message-ID: <20180312194236.GA12195@bhelgaas-glaptop.roam.corp.google.com> References: <1519837457-3596-1-git-send-email-poza@codeaurora.org> <20180311220337.GA194000@bhelgaas-glaptop.roam.corp.google.com> <04ade52e-d1ea-fe67-bb26-246621d159e6@codeaurora.org> <20180312142551.GB18494@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180312142551.GB18494@localhost.localdomain> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [+cc Alex] On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote: > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote: > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote: > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote: > > > > > That difference has been there since the beginning of DPC, so it has > > > nothing to do with *this* series EXCEPT for the fact that it really > > > complicates the logic you're adding to reset_link() and > > > broadcast_error_message(). > > > > > > We ought to be able to simplify that somehow because the only real > > > difference between AER and DPC should be that DPC automatically > > > disables the link and AER does it in software. > > > > I agree this should be possible. Code execution path should be almost > > identical to fatal error case. > > > > Is there any reason why you went to stop driver path, Keith? > > The fact is the link is truly down during a DPC event. When the link > is enabled again, you don't know at that point if the device(s) on the > other side have changed. When DPC is triggered, the port takes the link down. When we handle an uncorrectable (nonfatal or fatal) AER event, software takes the link down. In both cases, devices on the other side are at least reset. Whenever the link goes down, it's conceivable the device could be replaced with a different one before the link comes back up. Is this why you remove and re-enumerate? (See tangent [1] below.) The point is that from the device's hardware perspective, these scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message and it sees the link go down). I think we should make them the same on the software side, too: the driver should see the same callbacks, in the same order, whether we're doing AER or DPC. If removing and re-enumerating is the right thing for DPC, I think that means it's also the right thing for AER. Along this line, we had a question a while back about logging AER information after a DPC trigger. Obviously we can't collect any logged information from the downstream devices while link is down, but I noticed the AER status bits are RW1CS, which means they're sticky and are not modified by hot reset or FLR. So we may be able to read and log the AER information after the DPC driver brings the link back up. We may want to do the same with AER, i.e., reset the downstream devices first, then collect the AER status bits after the link comes back up. > Calling a driver's error handler for the wrong device in an unknown > state may have undefined results. Enumerating the slot from scratch > should be safe, and will assign resources, tune bus settings, and > bind to the matching driver. I agree with this; I think this is heading toward doing remove/re-enumerate on AER errors as well because it has also reset the device. > Per spec, DPC is the recommended way for handling surprise removal > events and even recommends DPC capable slots *not* set 'Surprise' > in Slot Capabilities so that removals are always handled by DPC. This > service driver was developed with that use in mind. Thanks for this tip. The only thing I've found so far is the mention of Surprise Down triggering DPC in the last paragraph of sec 6.7.5. Are there other references I should look at? I haven't found the recommendation about not setting 'Surprise' in Slot Capabilities yet. Bjorn [1] Tangent: I have similar concerns with the device reset paths. We currently save some config state, reset the device, and restore the config state. It's theoretically possible that the device was replaced or came up with different firmware after the reset, so I would really prefer to remove and re-enumerate there, too. But that would make it hard for things up the stack that want to reset the device but not re-setup the whole stack. I wonder if DPC is going to cause trouble for that scenario. That's not an argument against DPC, but it might be a stronger reason to figure out how to deal with remove/re-enumerate in those stacks.