Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3891335imm; Tue, 17 Jul 2018 12:04:30 -0700 (PDT) X-Google-Smtp-Source: AAOMgpePoArXrMnxNV1TXd6MPM6q+5J2BxEDTnMQywn1I1TFd65ho9XivnAL9zJd6fLe4eRhdAAr X-Received: by 2002:a62:47c4:: with SMTP id p65-v6mr1955072pfi.170.1531854270563; Tue, 17 Jul 2018 12:04:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531854270; cv=none; d=google.com; s=arc-20160816; b=Qnp3r4C3OxeMIqMzBCuARZjtnQ81RpNGUKBtcSsnCXhn3EI8U3yuuDERCpwqddBRSS 8sY1/dvnqMuAXWPPk3O8qapIKk9fNhia/hQb9q+ov6CUUa/59udrIFQyFopviwF05MVN q8I34AHC1TJvSJClB/G6eC3OwYp6DTQojvmh8J5lX5Y92zSVk+igltlpgXveTtnMdjxA gsJ3pTCG7BMwWyHfX1VI0Mth6tkcoeqlGQWC7E38WiBWNzenjdQ2F5wxmtEn88ID+3qK dbzLw4vU0k+HvVg0WR9z89P6jXcJbO7kSLXqZm5CItKZ++SPtcYNfqanw1WPobxAxLqG U/Zw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=76kcQ00fh4oCntLN1JA9pRspVy/5R49cM5yPiOeOjF4=; b=Xwz4DtKWY1wjSqtWUiFjuZAJH92zD4HdVYXQHJ+Lv31hH8SZOy9iViAWSQNFrOtcxH sXLS4V+Ckn1U2dKZ27/0fnMxBGMKobqjppsR9z90E9blhYL0xhZZjXcRlReF9b3Z+GQ5 9Dwn2CuCQI/hj3FOVgIw8SRqePp0QNF7ssH9AWlzpzXWSSlt72uvz2gjTbwljYk1c68j W+Wh+w8Glp4xr9znGQS1S5qtyRc+0giVBepVRJ+etIS5yQg8ok5El8ojRr5JGheTlLOQ KPOnYnHIaO59RNpTSxe18O2c/ceKYCuoK1r7TQ5/25nimngSijJxk8j4BcQoEAtNx8TB 9LIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Y03qnqy5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m3-v6si1407894plt.71.2018.07.17.12.04.14; Tue, 17 Jul 2018 12:04:30 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Y03qnqy5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730128AbeGQThb (ORCPT + 99 others); Tue, 17 Jul 2018 15:37:31 -0400 Received: from mail.kernel.org ([198.145.29.99]:39156 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729720AbeGQThb (ORCPT ); Tue, 17 Jul 2018 15:37:31 -0400 Received: from localhost (unknown [69.71.4.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 16C832075E; Tue, 17 Jul 2018 19:03:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1531854211; bh=Csz2exf9vRrmukGiNgsm/DL/9FfGw2qAFis6UsOjo8Y=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Y03qnqy530GYAvpK2KZKdyFxRZQ7eYnouZOikZ/XJtCLSQcz8UvoJd5muN2479+mf TfE+K95DNUowFJy4vCptbQ2WhthT3Onhch9SLzxK9vRjB8PvJu/fQ9nBEZ0P1h6heO yZ59k9LvWCeoOk15oQ7chg42GE/vHiSQ9wYHqjQ4= Date: Tue, 17 Jul 2018 14:03:29 -0500 From: Bjorn Helgaas To: Oza Pawandeep Cc: Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Keith Busch , Wei Zhang , Sinan Kaya , Timur Tabi Subject: Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Message-ID: <20180717190329.GA128988@bhelgaas-glaptop.roam.corp.google.com> References: <1529661494-20936-1-git-send-email-poza@codeaurora.org> <1529661494-20936-2-git-send-email-poza@codeaurora.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1529661494-20936-2-git-send-email-poza@codeaurora.org> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Oza, Thanks for doing this! On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote: > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset > callbacks in case of ERR_NONFATAL. IIRC, the current strategy is: ERR_COR: log only ERR_NONFATAL: call driver callbacks (pci_error_handlers) ERR_FATAL: remove device and re-enumerate So these slot_reset callbacks are only used for ERR_NONFATAL, which are all uncorrectable errors, of course. This patch makes it so that when the slot_reset callbacks call pci_cleanup_aer_uncorrect_error_status(), we only clear the bits set by ERR_NONFATAL events (this is determined by PCI_ERR_UNCOR_SEVER). That makes good sense to me. All these status bits are RW1CS, so they will be preserved across a reset but will be cleared when we re-enumerate, in this path: pci_init_capabilities pci_aer_init pci_cleanup_aer_error_status_regs # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits > AER uncorrectable error status should take severity into account in order > to clear the bits, so that ERR_NONFATAL path does not clear the bit which > are marked with severity fatal. Two comments: 1) Can you split this into two patches, one that changes pci_cleanup_aer_uncorrect_error_status() so it looks like the error clearing code in aer_error_resume(), and a second that factors out the duplicate code? 2) Maybe use "sev" or "sever" instead of "mask" for the local variable, since there is also a separate PCI_ERR_UNCOR_MASK register, which is not involved here. 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer really describes what this does. Something like "pci_aer_clear_nonfatal_status()" would be more descriptive. But I see you have a subsequent patch (which I haven't looked at yet) that is related to this. 4) I don't think the driver slot_reset callbacks should be responsible for clearing these AER status bits. Can we clear them somewhere in the pcie_do_nonfatal_recovery() path and remove these calls from the drivers? 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per device when handling an error. We currently read it three times: aer_isr aer_isr_one_error find_source_device find_device_iter is_error_source read PCI_ERR_UNCOR_STATUS # 1 aer_process_err_devices get_device_error_info(e_info->dev[i]) read PCI_ERR_UNCOR_STATUS # 2 handle_error_source pcie_do_nonfatal_recovery ... report_slot_reset driver->err_handler->slot_reset pci_cleanup_aer_uncorrect_error_status read PCI_ERR_UNCOR_STATUS # 3 OK, that was more than two comments :) > Signed-off-by: Oza Pawandeep > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index a2e8838..d6cb1f0 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -360,13 +360,16 @@ EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting); > int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev) > { > int pos; > - u32 status; > + u32 status, mask; > > pos = dev->aer_cap; > if (!pos) > return -EIO; > > + /* Clean AER Root Error Status */ s/Clean/Clear/ (I see you copied this from aer_error_resume(), and maybe the second patch could remove or update that comment, too) > pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status); > + pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask); > + status &= ~mask; /* Clear corresponding nonfatal bits */ > if (status) > pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status); > > @@ -1336,8 +1339,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev) > */ > static void aer_error_resume(struct pci_dev *dev) > { > - int pos; > - u32 status, mask; > u16 reg16; > > /* Clean up Root device status */ > @@ -1345,11 +1346,7 @@ static void aer_error_resume(struct pci_dev *dev) > pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16); > > /* Clean AER Root Error Status */ > - pos = dev->aer_cap; > - pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status); > - pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask); > - status &= ~mask; /* Clear corresponding nonfatal bits */ > - pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status); > + pci_cleanup_aer_uncorrect_error_status(dev); > } > > static struct pcie_port_service_driver aerdriver = { > -- > 2.7.4 >