Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp1787738imm; Thu, 21 Jun 2018 02:20:22 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKwMfXZz9DVquNV0WrdIBOCd/YokGv78NQe+HY7fCSEpYp/Sq+UyiXrKXyn+E1DllpLHiP+ X-Received: by 2002:a63:ae43:: with SMTP id e3-v6mr21187352pgp.181.1529572822248; Thu, 21 Jun 2018 02:20:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529572822; cv=none; d=google.com; s=arc-20160816; b=Fb3rsJlVnEC7lxQhPqPdQuTvnLhw3IC08NqhCiQvvmLbsD8h5Ggcy72OJ+Ab5qmhuR 2OZx+Yg6vPAHwF8RQzcI9Ok5jNbwSKXdll/9XuG/nfKIIpCSbnK4roERuO3xPGcn5wjD xlEUhSe8lyfrj4BbCEsj1R1zq7J2JI0OILVP5ECMLeMOHRlDy5KXchTnhXv4gI77DyDV s/mvqH7REjdQJSzGJlYZgUFHikIMTpUfsAf8dUBXxZDfpKiQp03Xu5iHmogXCoqBc5zm yCSJg4r18JJ+nfBHfLf+td63F1SIB0GIc8t5qn9p2ZyjSymxWMVa2rJmra8ls9JkRvKa cXqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version:dkim-signature:dkim-signature :arc-authentication-results; bh=daUFvoW7Xrejce99VbTF+4+ztHdHQ+D8OCVdRTuBRgg=; b=xjaTYBvRKLsVNJQhJdb4gmbFe4K+sG0OH8wwWGDMhNlXk5czAJsM+9jzYGWUXiV1M3 A2ExvDFW7naY24b3Me3yE1tOmOfdwgfId1RYSLHL2khuKfoonJaD0ncKh3OpmHG/SyA6 brQ96wVVek+PZkJVMcYoQjD4PfIDSdNgmkTkUZ58RmtlhUAwmG5HdVw78GcdZy3TAllp 7xGY0wKWSqcKR9kxmbsLM2yrU0CHkAmPNYGABTAD2F32EQs45cCyYi9XWoYWxyr6vM+C M5SmX4uNBEAQlVTqzrC2HXtOU0s5mmBZ03NLVp/ClhZFaBAMQRGpcUIX+Ywr3yIPwMTK Dp5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=AXWIAXpL; dkim=pass header.i=@codeaurora.org header.s=default header.b=D6XTyR67; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w124-v6si4445861pfw.201.2018.06.21.02.20.08; Thu, 21 Jun 2018 02:20:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=AXWIAXpL; dkim=pass header.i=@codeaurora.org header.s=default header.b=D6XTyR67; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932819AbeFUJTW (ORCPT + 99 others); Thu, 21 Jun 2018 05:19:22 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:35368 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932636AbeFUJTT (ORCPT ); Thu, 21 Jun 2018 05:19:19 -0400 Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 65AA3605FD; Thu, 21 Jun 2018 09:19:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1529572759; bh=w1Q+5cMl34nqQwiso5WKRBLREhZAlfzjLAC7TlfJ81o=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=AXWIAXpL1LyC3BKghUyJLxG4Orugt1K1OQhMJqNaNPmeO55gfdCuGDLe/qTFojoIk KA15/a/sO6ECiY0S6hXNgu98Wi6lJgPubFw7HygZIxcaLG6k7Bh7R3cdWA8pulsdsX rfnaWJvBTsYi9BfwZyh2xZDsVwjx94VDmmWkZTeM= X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on pdx-caf-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=2.0 tests=ALL_TRUSTED,BAYES_00, DKIM_SIGNED,T_DKIM_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from mail.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.codeaurora.org (Postfix) with ESMTP id D361D6021A; Thu, 21 Jun 2018 09:19:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1529572757; bh=w1Q+5cMl34nqQwiso5WKRBLREhZAlfzjLAC7TlfJ81o=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=D6XTyR67LcDak/iAJzrc6LP3l9mv8op4Q50YavD5XO06ZfHlKqoxUfwDTX/6xP9so ii4H7kLb/WyrTjGQ//ZZrUKDP+yEtJHZ0WCekW4LuS7d+Y/Z95ACzH+0oCMXWCuPu0 ZMSVqEMNBKHLC2EpjL+SVYitDyla5sC87mHeLeZM= MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Thu, 21 Jun 2018 14:49:17 +0530 From: poza@codeaurora.org To: rajatxjain@gmail.com Cc: Rajat Jain , Bjorn Helgaas , Jonathan Corbet , Philippe Ombredanne , Kate Stewart , Thomas Gleixner , Greg Kroah-Hartman , Frederick Lawler , "Busch, Keith" , Gabriele Paoloni , Alexandru Gagniuc , Thomas Tai , Steven Rostedt , linux-pci , linux-doc , Linux Kernel Mailing List , Jes Sorensen , Kyle McMartin Subject: Re: [PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics In-Reply-To: References: <20180522222805.80314-1-rajatja@google.com> <20180523175808.28030-1-rajatja@google.com> <20180523175808.28030-6-rajatja@google.com> <7e146f62d1fa82a6f37848b22efc1b97@codeaurora.org> Message-ID: <52d480b22e70713966039443931c2697@codeaurora.org> X-Sender: poza@codeaurora.org User-Agent: Roundcube Webmail/1.2.5 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-06-19 22:01, Rajat Jain wrote: > On Mon, Jun 18, 2018 at 11:03 PM, wrote: >> On 2018-06-19 05:41, Rajat Jain wrote: >>> >>> Hello, >>> >>> On Sat, Jun 16, 2018 at 10:24 PM wrote: >>>> >>>> >>>> On 2018-05-23 23:28, Rajat Jain wrote: >>>> > Add the PCI AER statistics details to >>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>>> > and provide a pointer to it in >>>> > Documentation/PCI/pcieaer-howto.txt >>>> > >>>> > Signed-off-by: Rajat Jain >>>> > --- >>>> > v2: Move the documentation to Documentation/ABI/ >>>> > >>>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++ >>>> > Documentation/PCI/pcieaer-howto.txt | 5 + >>>> > 2 files changed, 108 insertions(+) >>>> > create mode 100644 >>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>>> > >>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>>> > new file mode 100644 >>>> > index 000000000000..f55c389290ac >>>> > --- /dev/null >>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>>> > @@ -0,0 +1,103 @@ >>>> > +========================== >>>> > +PCIe Device AER statistics >>>> > +========================== >>>> > +These attributes show up under all the devices that are AER capable. >>>> > These >>>> > +statistical counters indicate the errors "as seen/reported by the >>>> > device". >>>> > +Note that this may mean that if an end point is causing problems, the >>>> > AER >>>> > +counters may increment at its link partner (e.g. root port) because >>>> > the >>>> > +errors will be "seen" / reported by the link partner and not the the >>>> > +problematic end point itself (which may report all counters as 0 as it >>>> > never >>>> > +saw any problems). >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/dev_total_cor_errs >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Total number of correctable errors seen and reported by >>>> > this >>>> > + PCI device using ERR_COR. >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/dev_total_fatal_errs >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Total number of uncorrectable fatal errors seen and >>>> > reported >>>> > + by this PCI device using ERR_FATAL. >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/dev_total_nonfatal_errs >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Total number of uncorrectable non-fatal errors seen and >>>> > reported >>>> > + by this PCI device using ERR_NONFATAL. >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/dev_breakdown_correctable >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Breakdown of of correctable errors seen and reported by >>>> > this >>>> > + PCI device using ERR_COR. A sample result looks like >>>> > this: >>>> > +----------------------------------------- >>>> > +Receiver Error = 0x174 >>>> > +Bad TLP = 0x19 >>>> > +Bad DLLP = 0x3 >>>> > +RELAY_NUM Rollover = 0x0 >>>> > +Replay Timer Timeout = 0x1 >>>> > +Advisory Non-Fatal = 0x0 >>>> > +Corrected Internal Error = 0x0 >>>> > +Header Log Overflow = 0x0 >>>> > +----------------------------------------- >>>> why hex display ? decimal is easy to read as these are counters. >>> >>> >>> Have no particular preference. Since these can be potentially large >>> numbers, just had a random thought that hex might make it more >>> concise. I can change to decimal if that is preferable. >>> >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/dev_breakdown_uncorrectable >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Breakdown of of correctable errors seen and reported by >>>> > this >>>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample >>>> > result >>>> > + looks like this: >>>> > +----------------------------------------- >>>> > +Undefined = 0x0 >>>> > +Data Link Protocol = 0x0 >>>> > +Surprise Down Error = 0x0 >>>> > +Poisoned TLP = 0x0 >>>> > +Flow Control Protocol = 0x0 >>>> > +Completion Timeout = 0x0 >>>> > +Completer Abort = 0x0 >>>> > +Unexpected Completion = 0x0 >>>> > +Receiver Overflow = 0x0 >>>> > +Malformed TLP = 0x0 >>>> > +ECRC = 0x0 >>>> > +Unsupported Request = 0x0 >>>> > +ACS Violation = 0x0 >>>> > +Uncorrectable Internal Error = 0x0 >>>> > +MC Blocked TLP = 0x0 >>>> > +AtomicOp Egress Blocked = 0x0 >>>> > +TLP Prefix Blocked Error = 0x0 >>>> > +----------------------------------------- >>>> > + >>>> > +============================ >>>> > +PCIe Rootport AER statistics >>>> > +============================ >>>> > +These attributes showup under only the rootports that are AER capable. >>>> > These >>>> > +indicate the number of error messages as "reported to" the rootport. >>>> > Please note >>>> > +that the rootports also transmit (internally) the ERR_* messages for >>>> > errors seen >>>> > +by the internal rootport PCI device, so these counters includes them >>>> > and are >>>> > +thus cumulative of all the error messages on the PCI hierarchy >>>> > originating >>>> > +at that root port. >>>> >>>> what about switches and bridges ? >>> >>> >>> What about them? AIUI, the switches forward the ERR_ messages from >>> downstream devices to the rootport, like they do with standard >>> messages. They can potentially generate their own ERR_ message and >>> that would be reported no different than other end point devices. >> >> >> >> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be >> contained by switch >> and the error handling code thinks that, the error is contained by >> switch >> irrespective of >> AER or DPC, and it will think that the problem could be with >> Switch/bridge >> upstream link. >> >> hence the pci_dev of the switch where you should be increment your >> counters. >> of course ER_FATAL would have traversed till RP, but that doesnt meant >> that >> you account the error there. > > In this case, for the pci_dev for the rootport: > - rootport_total_fatal_errors will be incremented (since it will get > ERR_FATAL) > - dev_total_fatal_errors will not be incremented. ok but my confusion is: should you not be incrementing counter against pci_dev of switch ? and not the RP ? because the problem was with upstream link of the EP (e.g. switch) > > The dev_total_fatal_errors will be incremented only for the pci device > identified by the "Error Source Identification Register" in the PCIe > spec. Does this help clarify? > >> >> >>> >>>> Also Can you give some idea as e.g what is the difference between >>>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that >>>> both >>>> are same pci_dev. >>> >>> >>> For a pci_dev representing the rootport: >>> >>> dev_total_fatal_errors = how many times this PCI device *experienced* >>> a fatal problem on its own (i.e. either link issues while talking to >>> its link partner, or some internal errors). >>> >>> rootport_total_fatal_errors = how many times this rootport was >>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy >>> that originates at it (can be any link further downstream). This >>> includes the dev_total_fatal_errors also, because any errors detected >>> by the rootport are also "informed" to itself via ERR_* messages. In >>> reality, this is just the total number of ERR_FATAL messages received >>> at the rootport. This sysfs attribute will only exist for root ports. >>> >>>> >>>> rootport_total_fatal_errs gives me an idea that how many times >>>> things >>>> have been failed under this pci_dev ? >>> >>> >>> Yes, as above. >>> >>>> which means num of downstream link problems. but I am still trying >>>> to >>>> make sense as how it could be used, >>>> since we dont have BDF information associated with the number of >>>> errors >>>> anywhere (except these AER print messages) >>>> >>> >>> Agree. That is a limitation. The challenges being more record >>> keeping, >>> more complicated sysfs representation, and given that PCI devices may >>> come and go, how do we know it is the same device before we collate >>> their stats etc. >>> >>>> >>>> and dev_total_fatal_errs as you mentioned above that problematic EP, >>>> then say root-port will report it and increment >>>> dev_total_fatal_errs ++ >>>> does it also increment root-port_total_fatal_errs ++ in above >>>> scenario ? >>> >>> >>> Yes, as above, it will also root-port_total_fatal_errs++ for the root >>> port of that hierarchy. >>> >>> Thanks, >>> >>> Rajat >>> >>>> >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/rootport_total_cor_errs >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Total number of ERR_COR messages reported to rootport. >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/rootport_total_fatal_errs >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Total number of ERR_FATAL messages reported to rootport. >>>> > + >>>> > +Where: >>>> > /sys/bus/pci/devices//aer_stats/rootport_total_nonfatal_errs >>>> > +Date: May 2018 >>>> > +Kernel Version: 4.17.0 >>>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>>> > +Description: Total number of ERR_NONFATAL messages reported to >>>> > rootport. >>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt >>>> > b/Documentation/PCI/pcieaer-howto.txt >>>> > index acd0dddd6bb8..91b6e677cb8c 100644 >>>> > --- a/Documentation/PCI/pcieaer-howto.txt >>>> > +++ b/Documentation/PCI/pcieaer-howto.txt >>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the >>>> > device who sends >>>> > the error message to root port. Pls. refer to pci express specs for >>>> > other fields. >>>> > >>>> > +2.4 AER Statistics / Counters >>>> > + >>>> > +When PCIe AER errors are captured, the counters / statistics are also >>>> > exposed >>>> > +in form of sysfs attributes which are documented at >>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>>> > >>>> > 3. Developer Guide