Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp5384944imm; Tue, 19 Jun 2018 09:33:10 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKmTFssjnVIj2xW5Nori2H2Awu6OfLIrP3rMV45y4THMhXZK/w8LXenCpv/WHMK+1E8Pulw X-Received: by 2002:a62:b50f:: with SMTP id y15-v6mr18648642pfe.22.1529425990324; Tue, 19 Jun 2018 09:33:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529425990; cv=none; d=google.com; s=arc-20160816; b=m3PHGkcqKusGd5w7KjGKyp7kuQJD0+4fHzoKtDb3E7TUj7LpsODKK/rgWtIkxvW2wA hIfONVp0N2GOdhDtjVmJktSRZ/zX95crsqVy8ZC3rXhLDXZ+AXgray+Tk+gfjjzJ1uaC mxhClTivxh4s+2NuR1/YtKfp46lQ86yqT/fClB0YZS9fte8VcB+wgwVomLlmVATfdK6S g2qDM4J129CEnlzkETRdXB/DkM685ny20m/7pj/mD5awIQj2Zj3kvC2mlT7K3l0IxtlZ GpNMspq4x/ruUAxagRQN4U18ujxp+wwIO/8f1dfk5y67q3thHll00innp7HY3QxDsKRo drBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:reply-to:mime-version:dkim-signature :arc-authentication-results; bh=vAYP1KCUluNdTjQXmwIQfwWkjeUwLpHQdHeAz6KbW8I=; b=VIi/I0hp3Uc55PPnFA0+MyE/Rnlpg1TO+dGQQ4cH3r5QIUy9tlR0t6p/Pcj9LYTbl8 ooHyUjlKDxp2KqKbaa8AUN00OmKS6Sxj5cLoaz/AI0vzAPjaxSSUV9+8VJVyDBWMWZ6O YooCbr+FdpwnO970d+pohTnvxzE8Y8B/7dtuJ4FrW0gYWQuQwEWFBttO/P8xTK7NM2qR liyiIDnvnjSUWKrTpQSG7usgXBAL12eyb+sa5XZOB36jobNUIJUqcxWe5vrsQBCpmGbw vCXNVC2Rv2bXM6mXOxd5qYCLUJTlOI+lL/1rRHUZhdEmI5RTUbJA2ExfxvejNKFTxXqd njHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=hkzbzyBq; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e2-v6si24083pgq.382.2018.06.19.09.32.55; Tue, 19 Jun 2018 09:33:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=hkzbzyBq; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966984AbeFSQbY (ORCPT + 99 others); Tue, 19 Jun 2018 12:31:24 -0400 Received: from mail-ua0-f194.google.com ([209.85.217.194]:42318 "EHLO mail-ua0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965953AbeFSQbW (ORCPT ); Tue, 19 Jun 2018 12:31:22 -0400 Received: by mail-ua0-f194.google.com with SMTP id x18-v6so198164uaj.9; Tue, 19 Jun 2018 09:31:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:reply-to:in-reply-to:references:from:date:message-id :subject:to:cc; bh=vAYP1KCUluNdTjQXmwIQfwWkjeUwLpHQdHeAz6KbW8I=; b=hkzbzyBqqe2tsczYgPnyti+xpBiR4E/a2BlRqQ5nSE6evHh8ketXRlIwjSk8jSgnrR any0m7i1fMJvd3D0aUsPjHUvEG3TxVnQL7aYj9MRIi70NmDaWwS3CGL85QcxznF1RvA6 kMPFr2FsNbffiZ2kT/QthIPD04upIyRRZQCGmshgXc7KfvlQdEEOGLcfxkQUcaZrh9GB ZSod6qz16DNahu12btFkZoMPupC6juPvklThtgjYPqIlQGE+fMTUyYD1m09GEwSHGTsa XMlPZugs0WUA0tkU0g68R8K0JGuuEI8lQDCFNnVgrqXQCWopcjeAgroxwK8M64JgfWpZ RaJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:reply-to:in-reply-to:references :from:date:message-id:subject:to:cc; bh=vAYP1KCUluNdTjQXmwIQfwWkjeUwLpHQdHeAz6KbW8I=; b=gUXfIGDpZI14nVk1sdgORmrnE81Z1JE9fl45L2sZpkJCY5ptA0d9UVnNoeqc9hRUGB 7j4ztZhKQ39cTJ634oOpFnR8h3qCFQX7Xb3Q9SKKrQRjCpsv+tpmxPccEvgGix3sKEKS ns61cbrUMZMB8iJYbi2ADXSbF231JJEm92pA4nsz65cZ03uRShC5sh69dVLWXON3Adhf TU9fS6DPQRqyzwiIA5t0gCSQkG9xFfTChwcAvlT1xg/TYqCj6z92UE45/PPPHwMAOcBv duOWFq/7hZPdTCvBcXIXe8iaWZblYGUUuGIgDQ7sDD9Neenl/byi4fGAbwISJvyLXGeM LIsw== X-Gm-Message-State: APt69E1lJEia0yFAAUKrqVIrgia+H0TeaK9Rf0QzeNA7/MxXXXTIsVFw 6HHzenCMttssE/ot8Uzq87RPht4WS5+aYkN8WAw= X-Received: by 2002:ab0:4dda:: with SMTP id b26-v6mr10862921uah.178.1529425881362; Tue, 19 Jun 2018 09:31:21 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a67:3593:0:0:0:0:0 with HTTP; Tue, 19 Jun 2018 09:31:20 -0700 (PDT) Reply-To: rajatxjain@gmail.com In-Reply-To: <7e146f62d1fa82a6f37848b22efc1b97@codeaurora.org> References: <20180522222805.80314-1-rajatja@google.com> <20180523175808.28030-1-rajatja@google.com> <20180523175808.28030-6-rajatja@google.com> <7e146f62d1fa82a6f37848b22efc1b97@codeaurora.org> From: Rajat Jain Date: Tue, 19 Jun 2018 09:31:20 -0700 Message-ID: Subject: Re: [PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics To: Oza Pawandeep Cc: Rajat Jain , Bjorn Helgaas , Jonathan Corbet , Philippe Ombredanne , Kate Stewart , Thomas Gleixner , Greg Kroah-Hartman , Frederick Lawler , "Busch, Keith" , Gabriele Paoloni , Alexandru Gagniuc , Thomas Tai , Steven Rostedt , linux-pci , linux-doc , Linux Kernel Mailing List , Jes Sorensen , Kyle McMartin Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 18, 2018 at 11:03 PM, wrote: > On 2018-06-19 05:41, Rajat Jain wrote: >> >> Hello, >> >> On Sat, Jun 16, 2018 at 10:24 PM wrote: >>> >>> >>> On 2018-05-23 23:28, Rajat Jain wrote: >>> > Add the PCI AER statistics details to >>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>> > and provide a pointer to it in >>> > Documentation/PCI/pcieaer-howto.txt >>> > >>> > Signed-off-by: Rajat Jain >>> > --- >>> > v2: Move the documentation to Documentation/ABI/ >>> > >>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++ >>> > Documentation/PCI/pcieaer-howto.txt | 5 + >>> > 2 files changed, 108 insertions(+) >>> > create mode 100644 >>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>> > >>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>> > new file mode 100644 >>> > index 000000000000..f55c389290ac >>> > --- /dev/null >>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>> > @@ -0,0 +1,103 @@ >>> > +========================== >>> > +PCIe Device AER statistics >>> > +========================== >>> > +These attributes show up under all the devices that are AER capable. >>> > These >>> > +statistical counters indicate the errors "as seen/reported by the >>> > device". >>> > +Note that this may mean that if an end point is causing problems, the >>> > AER >>> > +counters may increment at its link partner (e.g. root port) because >>> > the >>> > +errors will be "seen" / reported by the link partner and not the the >>> > +problematic end point itself (which may report all counters as 0 as it >>> > never >>> > +saw any problems). >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/dev_total_cor_errs >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Total number of correctable errors seen and reported by >>> > this >>> > + PCI device using ERR_COR. >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/dev_total_fatal_errs >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Total number of uncorrectable fatal errors seen and >>> > reported >>> > + by this PCI device using ERR_FATAL. >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/dev_total_nonfatal_errs >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Total number of uncorrectable non-fatal errors seen and >>> > reported >>> > + by this PCI device using ERR_NONFATAL. >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/dev_breakdown_correctable >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Breakdown of of correctable errors seen and reported by >>> > this >>> > + PCI device using ERR_COR. A sample result looks like >>> > this: >>> > +----------------------------------------- >>> > +Receiver Error = 0x174 >>> > +Bad TLP = 0x19 >>> > +Bad DLLP = 0x3 >>> > +RELAY_NUM Rollover = 0x0 >>> > +Replay Timer Timeout = 0x1 >>> > +Advisory Non-Fatal = 0x0 >>> > +Corrected Internal Error = 0x0 >>> > +Header Log Overflow = 0x0 >>> > +----------------------------------------- >>> why hex display ? decimal is easy to read as these are counters. >> >> >> Have no particular preference. Since these can be potentially large >> numbers, just had a random thought that hex might make it more >> concise. I can change to decimal if that is preferable. >> >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/dev_breakdown_uncorrectable >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Breakdown of of correctable errors seen and reported by >>> > this >>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample >>> > result >>> > + looks like this: >>> > +----------------------------------------- >>> > +Undefined = 0x0 >>> > +Data Link Protocol = 0x0 >>> > +Surprise Down Error = 0x0 >>> > +Poisoned TLP = 0x0 >>> > +Flow Control Protocol = 0x0 >>> > +Completion Timeout = 0x0 >>> > +Completer Abort = 0x0 >>> > +Unexpected Completion = 0x0 >>> > +Receiver Overflow = 0x0 >>> > +Malformed TLP = 0x0 >>> > +ECRC = 0x0 >>> > +Unsupported Request = 0x0 >>> > +ACS Violation = 0x0 >>> > +Uncorrectable Internal Error = 0x0 >>> > +MC Blocked TLP = 0x0 >>> > +AtomicOp Egress Blocked = 0x0 >>> > +TLP Prefix Blocked Error = 0x0 >>> > +----------------------------------------- >>> > + >>> > +============================ >>> > +PCIe Rootport AER statistics >>> > +============================ >>> > +These attributes showup under only the rootports that are AER capable. >>> > These >>> > +indicate the number of error messages as "reported to" the rootport. >>> > Please note >>> > +that the rootports also transmit (internally) the ERR_* messages for >>> > errors seen >>> > +by the internal rootport PCI device, so these counters includes them >>> > and are >>> > +thus cumulative of all the error messages on the PCI hierarchy >>> > originating >>> > +at that root port. >>> >>> what about switches and bridges ? >> >> >> What about them? AIUI, the switches forward the ERR_ messages from >> downstream devices to the rootport, like they do with standard >> messages. They can potentially generate their own ERR_ message and >> that would be reported no different than other end point devices. > > > > yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be > contained by switch > and the error handling code thinks that, the error is contained by switch > irrespective of > AER or DPC, and it will think that the problem could be with Switch/bridge > upstream link. > > hence the pci_dev of the switch where you should be increment your counters. > of course ER_FATAL would have traversed till RP, but that doesnt meant that > you account the error there. In this case, for the pci_dev for the rootport: - rootport_total_fatal_errors will be incremented (since it will get ERR_FATAL) - dev_total_fatal_errors will not be incremented. The dev_total_fatal_errors will be incremented only for the pci device identified by the "Error Source Identification Register" in the PCIe spec. Does this help clarify? > > >> >>> Also Can you give some idea as e.g what is the difference between >>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that both >>> are same pci_dev. >> >> >> For a pci_dev representing the rootport: >> >> dev_total_fatal_errors = how many times this PCI device *experienced* >> a fatal problem on its own (i.e. either link issues while talking to >> its link partner, or some internal errors). >> >> rootport_total_fatal_errors = how many times this rootport was >> *informed* about a problem (via ERR_* messages) in the PCI hierarchy >> that originates at it (can be any link further downstream). This >> includes the dev_total_fatal_errors also, because any errors detected >> by the rootport are also "informed" to itself via ERR_* messages. In >> reality, this is just the total number of ERR_FATAL messages received >> at the rootport. This sysfs attribute will only exist for root ports. >> >>> >>> rootport_total_fatal_errs gives me an idea that how many times things >>> have been failed under this pci_dev ? >> >> >> Yes, as above. >> >>> which means num of downstream link problems. but I am still trying to >>> make sense as how it could be used, >>> since we dont have BDF information associated with the number of errors >>> anywhere (except these AER print messages) >>> >> >> Agree. That is a limitation. The challenges being more record keeping, >> more complicated sysfs representation, and given that PCI devices may >> come and go, how do we know it is the same device before we collate >> their stats etc. >> >>> >>> and dev_total_fatal_errs as you mentioned above that problematic EP, >>> then say root-port will report it and increment >>> dev_total_fatal_errs ++ >>> does it also increment root-port_total_fatal_errs ++ in above scenario ? >> >> >> Yes, as above, it will also root-port_total_fatal_errs++ for the root >> port of that hierarchy. >> >> Thanks, >> >> Rajat >> >>> >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/rootport_total_cor_errs >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Total number of ERR_COR messages reported to rootport. >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/rootport_total_fatal_errs >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Total number of ERR_FATAL messages reported to rootport. >>> > + >>> > +Where: >>> > /sys/bus/pci/devices//aer_stats/rootport_total_nonfatal_errs >>> > +Date: May 2018 >>> > +Kernel Version: 4.17.0 >>> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >>> > +Description: Total number of ERR_NONFATAL messages reported to >>> > rootport. >>> > diff --git a/Documentation/PCI/pcieaer-howto.txt >>> > b/Documentation/PCI/pcieaer-howto.txt >>> > index acd0dddd6bb8..91b6e677cb8c 100644 >>> > --- a/Documentation/PCI/pcieaer-howto.txt >>> > +++ b/Documentation/PCI/pcieaer-howto.txt >>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the >>> > device who sends >>> > the error message to root port. Pls. refer to pci express specs for >>> > other fields. >>> > >>> > +2.4 AER Statistics / Counters >>> > + >>> > +When PCIe AER errors are captured, the counters / statistics are also >>> > exposed >>> > +in form of sysfs attributes which are documented at >>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >>> > >>> > 3. Developer Guide