Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp2448528imm; Wed, 16 May 2018 13:03:40 -0700 (PDT) X-Google-Smtp-Source: AB8JxZq9YZpiuVEL87uABcL9ZHX4u5TPhvJrk/NQ0wbPYfNTxx356yGeEqEnC3qiJb5Kvsck6lbb X-Received: by 2002:a62:8105:: with SMTP id t5-v6mr2302935pfd.215.1526501020494; Wed, 16 May 2018 13:03:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526501020; cv=none; d=google.com; s=arc-20160816; b=AvZgH5Vmsh1sco+nI5jiRAwISsx7nrh0qz1bR3jIIEx4tFQlmG5IHz5r/lX5QwmlhL QICHge8QBUVPlYZKhd3jqFkcrHYih+npw+d9jzyM642cpoiOu5bVVCsAOhbh2tYSz3X1 W0kka1lc/rNmhqQyArFDdxQ+xWixv2DKFNDeYo5RCcFW8JH3/TagE+/nC1AfPbG8XJgw 6NDNs7vSbL2TwMuE+YE6y/t26ASTzb2JKROPG1CtPGvT9VrtV52gSDUI5d6hoW9UeFUW 4EQYNk152LfuVl8AN4ZqXQ2dPDQqdYyOpvHh38NNnwEjC4jY1zD9ReZlAoA1YTQ8d3w4 MitQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=+QnspkJcp8mrLAf6/SZHRlYY04tWN5qjbvMXKnoky9A=; b=Oa77lZkx6Hvn9bQKEaB5ia5NgmyH49qUWz68EIRbkQf9WBDBWbFGc6mviA/WN2Ojku bTGbLLdtqEFW+f74oN+6NuyMSa959IxYEb/qeK2Tuu7nP920zM4KyfI0vlb1lNK8kqKO MQN0tsI+f/D6zyNWkICQtXeFtFGH/8EyELwtvS8Geaw8xHcSfvK9+jkw8BTol/95EpEc bYVUiAOVSCzxSmf5we1KvI40NZ6G1kSSOmjSUck4OcUcvVOlgxcGkRftOlZRZlJSQdYU QafvNGdXWiWU3O7LXZWZ/moXLGlevtoUG2TaeA3QcO1YS0gCgwKm51SyTM9Ezpv6Me8E RhwQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=koK5Pf2O; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h185-v6si3272064pfc.348.2018.05.16.13.03.24; Wed, 16 May 2018 13:03:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=koK5Pf2O; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751254AbeEPUDI (ORCPT + 99 others); Wed, 16 May 2018 16:03:08 -0400 Received: from mail.kernel.org ([198.145.29.99]:59516 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750953AbeEPUDH (ORCPT ); Wed, 16 May 2018 16:03:07 -0400 Received: from localhost (unknown [69.71.5.252]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 3A1E120673; Wed, 16 May 2018 20:03:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1526500986; bh=T72wyA1HaBLHuSbfB57KFlFQi3Gy75l+yp8cqWPYTLU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=koK5Pf2O2dY/VaDJXYjw9NBu2q6RUT18fuVjNL0dG8lodaf81x72yI94XtCCG93EY vmgsG4/zvTMGFgBRwFZs1RrPlBEDoruSNTcBCVv/zdIJ8K115wu//iMmJh+7klFBZ4 MLlLjnp/V3ywrDgGllxsXUTckSmLCGls3rrKFWCU= Date: Wed, 16 May 2018 15:02:56 -0500 From: Bjorn Helgaas To: poza@codeaurora.org Cc: Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Keith Busch , Wei Zhang , Sinan Kaya , Timur Tabi , linux-pci-owner@vger.kernel.org Subject: Re: [PATCH v16 8/9] PCI/DPC: Unify and plumb error handling into DPC Message-ID: <20180516200256.GB236884@bhelgaas-glaptop.roam.corp.google.com> References: <1526035408-31328-1-git-send-email-poza@codeaurora.org> <1526035408-31328-9-git-send-email-poza@codeaurora.org> <20180515235632.GB11156@bhelgaas-glaptop.roam.corp.google.com> <2ca65f6b38668cfcf553833409ee38e3@codeaurora.org> <20180516105236.GA217390@bhelgaas-glaptop.roam.corp.google.com> <3822b5ea0b05f9836b991ce18f16a30f@codeaurora.org> <20180516130455.GA233993@bhelgaas-glaptop.roam.corp.google.com> <33496e8d8c663240c9db51b98ef6dd52@codeaurora.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <33496e8d8c663240c9db51b98ef6dd52@codeaurora.org> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 16, 2018 at 08:28:39PM +0530, poza@codeaurora.org wrote: > On 2018-05-16 18:34, Bjorn Helgaas wrote: > > On Wed, May 16, 2018 at 05:45:58PM +0530, poza@codeaurora.org wrote: > > > On 2018-05-16 16:22, Bjorn Helgaas wrote: > > > > On Wed, May 16, 2018 at 01:46:25PM +0530, poza@codeaurora.org wrote: > > > > > I am sorry I pasted the wrong snippet. > > > > > following needs to be fixed in v17. > > > > > from: > > > > > if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { > > > > > /* > > > > > * If the error is reported by a bridge, we think > > > > > this error > > > > > * is related to the downstream link of the bridge, > > > > > so we > > > > > * do error recovery on all subordinates of the bridge > > > > > instead > > > > > * of the bridge and clear the error status of the > > > > > bridge. > > > > > */ > > > > > pci_walk_bus(dev->subordinate, report_resume, > > > > > &result_data); > > > > > pci_cleanup_aer_uncorrect_error_status(dev); > > > > > } > > > > > > > > > > > > > > > to > > > > > > > > > > if (service==AER && dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { > > > > > /* > > > > > * If the error is reported by a bridge, we think > > > > > this error > > > > > * is related to the downstream link of the bridge, > > > > > so we > > > > > * do error recovery on all subordinates of the bridge > > > > > instead > > > > > * of the bridge and clear the error status of the > > > > > bridge. > > > > > */ > > > > > pci_walk_bus(dev->subordinate, report_resume, > > > > > &result_data); > > > > > pci_cleanup_aer_uncorrect_error_status(dev); > > > > > } > > > > > > > > > > this is only needed in case of AER. > > > > > > > > Oh, I missed this before. It makes sense to clear the AER status > > > > here, but why do we need to call report_resume()? We just called all > > > > the driver .remove() methods and detached the drivers from the > > > > devices. So I don't think report_resume() will do anything > > > > ("dev->driver" should be NULL) except set the dev->error_state to > > > > pci_channel_io_normal. We probably don't need that because we didn't > > > > change error_state in this fatal error path. > > > > > > if you remember, the path ends up calling > > > aer_error_resume > > > > > > the existing code ends up calling aer_error_resume as follows. > > > > > > do_recovery(pci_dev) > > > broadcast_error_message(..., error_detected, ...) > > > if (AER_FATAL) > > > reset_link(pci_dev) > > > udev = BRIDGE ? pci_dev : pci_dev->bus->self > > > driver->reset_link(udev) > > > aer_root_reset(udev) > > > if (CAN_RECOVER) > > > broadcast_error_message(..., mmio_enabled, ...) > > > if (NEED_RESET) > > > broadcast_error_message(..., slot_reset, ...) > > > broadcast_error_message(dev, ..., report_resume, ...) > > > if (BRIDGE) > > > report_resume > > > driver->resume > > > pcie_portdrv_err_resume > > > device_for_each_child(..., resume_iter) > > > resume_iter > > > driver->error_resume > > > aer_error_resume > > > pci_cleanup_aer_uncorrect_error_status(pci_dev) # only > > > if > > > BRIDGE > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) > > > > > > hence I think we have to call it in order to clear the root port > > > PCI_ERR_UNCOR_STATUS and PCI_EXP_DEVSTA. > > > makes sense ? > > > > I know I sent you the call graph above, so you would think I might > > understand it, but you would be mistaken :) This still doesn't make > > sense to me. > > > > I think your point is that we need to call aer_error_resume(). That > > is the aerdriver.error_resume() method. The AER driver only binds to > > root ports. > > > > This path: > > > > pcie_do_fatal_recovery > > pci_walk_bus(dev->subordinate, report_resume, &result_data) > > > > calls report_resume() for every device on the dev->subordinate bus > > (and for anything below those devices). There are no root ports on > > that dev->subordinate bus, because root ports are always on a root > > bus, never on a subordinate bus. > > > > So I don't see how report_resume() can ever get to aer_error_resume(). > > Can you instrument that path and verify that it actually does get > > there somehow? > > you are right....the call > pci_walk_bus(dev->subordinate, report_resume, &result_data); > does not call aer_error_resume() > > but > pci_walk_bus(udev->bus, report_resume, &result_data); > does call aer_error_resume() > > now if you look at the comment of the function: > /** > * aer_error_resume - clean up corresponding error status bits > * @dev: pointer to Root Port's pci_dev data structure > * > * Invoked by Port Bus driver during nonfatal recovery. > */ > > it is invoked during nonfatal recovery. > but the code handles both fatal and nonfatal clearing of error bits. > > if (dev->error_state == pci_channel_io_normal) > status &= ~mask; /* Clear corresponding nonfatal bits */ > else > status &= mask; /* Clear corresponding fatal bits */ > pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status); > > > so the question is, should we not call aer_error_resume during fatal > recovery ? > so that it clears the root port status, if of course the error is triggered > by AER running agent (RP, Switch) I'm sure we *should* clear AER status bits somewhere during ERR_FATAL recovery. As far as I can tell, the current code (before your patches) never calls aer_error_resume(). That might be a bug, but even if it is, it's something that should be fixed separately from *this* series. I think in this series, you should probably adjust the patch that adds do_fatal_recovery() so it doesn't call pci_walk_bus(..., report_resume). I don't think that does anything useful anyway, and that patch already changes AER so it doesn't call the pci_error_handlers callbacks (except .resume()). I think it would be cleaner to remove the ERR_FATAL use of .resume() at the same time you remove the others.