Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp4548883imm; Mon, 18 Jun 2018 17:34:30 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLzwav8h/ljXTC05n0KUBJsD1Cle5egWtOzdc+NJrpWvmxRIrWaaMe1jxia4ydBrv308Hd6 X-Received: by 2002:a62:a104:: with SMTP id b4-v6mr15816162pff.159.1529368470145; Mon, 18 Jun 2018 17:34:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529368470; cv=none; d=google.com; s=arc-20160816; b=1CINaMmvHpLIA8zGt1jJCjNA8j2PwDaF2/pFz4YAZaJuLG4F6FXULwxMFTGh7qdy3j 1Kq4NLeUcHyKspXPJ46jIkqwjFUgkUcKKLKjpLvcpBVUHJpfX3kamvXPRVyk3lyDuwa2 hGt15xxtNwPwDFL1mAsTCotJwWLoeSiWm7ubQZL+M4sExKPa6tlmnlA/gCCnlcOncuM4 JZufZ5ahMIjoK1zfk2aMy+MqzZeV6LGOSXwsS5KUCfdKbQPXmSu8Fb4KauCOc2YgNQfp NzEXen/tcYclVeX0gmqCZOfYmWRamTNg2Gk6TfW4Pz0Th+V1/c3FzzRRYtIZSp/KQHtb GbGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:reply-to:mime-version:dkim-signature :arc-authentication-results; bh=GFd7jA8yU6feOsR4M5Aru73EUraUje/YrmOlMlYp1b8=; b=bwNPtDfFjOQlYVndixrTCrPk/Bm3Ok2r6fYfOg9GL3D5sbbQ7xFWcq/dPfyu+tEwDZ UjbFSx+RmRCWqgfKhvi0eivQK0JbwQgV+Qpir0I1+uP6gkb66khgQE5lt5sesX8vop1a kt3hTBVenNc4eOaa/QHZU3qntK0M+QKNak/0jYzWkOUx+HnbcUAN9M8wvhfFAXZGqHzR SnJ07vZbOEOUcIelA4Tm0K+RU5FKPEsD9BO/UUXlF1zxbbCEIoKj8GLHHMZ8TE7TtzUM kiHFMvexo+FCmYUhh3T3YCFdK1lMGRngKL4cO+sIjSkblXTgs7XIgrziO+5pQeg3Z9PR uH7A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=E4+7PgFo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v4-v6si15766168pfk.116.2018.06.18.17.34.15; Mon, 18 Jun 2018 17:34:30 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=E4+7PgFo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S937118AbeFSAcV (ORCPT + 99 others); Mon, 18 Jun 2018 20:32:21 -0400 Received: from mail-ua0-f193.google.com ([209.85.217.193]:44863 "EHLO mail-ua0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S937066AbeFSAcS (ORCPT ); Mon, 18 Jun 2018 20:32:18 -0400 Received: by mail-ua0-f193.google.com with SMTP id f30-v6so11929866uab.11; Mon, 18 Jun 2018 17:32:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:reply-to:in-reply-to:references:from:date:message-id :subject:to:cc; bh=GFd7jA8yU6feOsR4M5Aru73EUraUje/YrmOlMlYp1b8=; b=E4+7PgFoWwlaHwFQoD45LgT1o8CE/dGLLL4w6eNUzGOAmm0wQh330j9B4I1xVhJOSK /sYyN2I9e3SBkImhq+YUURdD33PQ+y7KFtR9pgkicUBTkspiYwtUDcuaWBa1Xr0g8+ye A4TP67z5YVXvwoeAtk9QWfYp4mWMJSYxZJUJ0/agblynTz/ZcDabA0UJobFO5lrWIDEC ZIJcqXTlt71i3Vh+mQOwS+Ju6yqk41zbisjds/6GKKcpBIfwwWg0Ea8z7BrkJmk1jhBR w4En+7t+Gjf+IxTe9JAUnkqlzjcRqZnN4+eUDQGy4zwfxG6gwW2dfXDipknbMVU7xK7d wSXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:reply-to:in-reply-to:references :from:date:message-id:subject:to:cc; bh=GFd7jA8yU6feOsR4M5Aru73EUraUje/YrmOlMlYp1b8=; b=q5dIA/gFRdiRbAERx7ds4ryFCbHnrAGlq1kVFLTFT8ne0zvMRZl/o4tt0SXdMl3OQR S+zUKfQqxUgyOUXOsY2iNVnqKcgJ9IRiCi7d2U6aCoY4emFC9hAqb24na9un05IJSdmW G2AK8H3jzSgBe1By89DIxflwy036aYVUzghdKJoCMQQ1yqilmChNzjmakYB6ZxsMloe0 MsrNMbEU/UO862YMIiQbuu2ty7kL+emNNEHut0obdMTyqI4kbKFbCuDNeHu4jJn2UysR ubme1BkTsJbNWD4gUzJJwDQWbAJ/rnkxX9mQpFfhY3dW7A+woLgVvApFOpwpO0N28awe szlg== X-Gm-Message-State: APt69E3eLrzi9HZvmFqDxBKkGpeOj9alOKWtaxlm/YidMn9mK73xK5dc wJk/NYeudjuXVx8Z+seZDoCKNcxzflCL/Ty4cw4= X-Received: by 2002:ab0:1571:: with SMTP id p46-v6mr8507126uae.129.1529368336971; Mon, 18 Jun 2018 17:32:16 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a67:3593:0:0:0:0:0 with HTTP; Mon, 18 Jun 2018 17:32:16 -0700 (PDT) Reply-To: rajatxjain@gmail.com In-Reply-To: References: <20180522222805.80314-1-rajatja@google.com> <20180523175808.28030-1-rajatja@google.com> <20180523175808.28030-6-rajatja@google.com> From: Rajat Jain Date: Mon, 18 Jun 2018 17:32:16 -0700 Message-ID: Subject: Re: [PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics To: Rajat Jain Cc: Oza Pawandeep , Bjorn Helgaas , Jonathan Corbet , Philippe Ombredanne , Kate Stewart , Thomas Gleixner , Greg Kroah-Hartman , Frederick Lawler , "Busch, Keith" , Gabriele Paoloni , Alexandru Gagniuc , Thomas Tai , Steven Rostedt , linux-pci , linux-doc , Linux Kernel Mailing List , Jes Sorensen , Kyle McMartin Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry, correction needed in my statement below: On Mon, Jun 18, 2018 at 5:11 PM, Rajat Jain wrote: > Hello, > > On Sat, Jun 16, 2018 at 10:24 PM wrote: >> >> On 2018-05-23 23:28, Rajat Jain wrote: >> > Add the PCI AER statistics details to >> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >> > and provide a pointer to it in >> > Documentation/PCI/pcieaer-howto.txt >> > >> > Signed-off-by: Rajat Jain >> > --- >> > v2: Move the documentation to Documentation/ABI/ >> > >> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++ >> > Documentation/PCI/pcieaer-howto.txt | 5 + >> > 2 files changed, 108 insertions(+) >> > create mode 100644 >> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >> > >> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >> > new file mode 100644 >> > index 000000000000..f55c389290ac >> > --- /dev/null >> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >> > @@ -0,0 +1,103 @@ >> > +========================== >> > +PCIe Device AER statistics >> > +========================== >> > +These attributes show up under all the devices that are AER capable. >> > These >> > +statistical counters indicate the errors "as seen/reported by the >> > device". >> > +Note that this may mean that if an end point is causing problems, the >> > AER >> > +counters may increment at its link partner (e.g. root port) because >> > the >> > +errors will be "seen" / reported by the link partner and not the the >> > +problematic end point itself (which may report all counters as 0 as it >> > never >> > +saw any problems). >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/dev_total_cor_errs >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Total number of correctable errors seen and reported by >> > this >> > + PCI device using ERR_COR. >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/dev_total_fatal_errs >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Total number of uncorrectable fatal errors seen and >> > reported >> > + by this PCI device using ERR_FATAL. >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/dev_total_nonfatal_errs >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Total number of uncorrectable non-fatal errors seen and >> > reported >> > + by this PCI device using ERR_NONFATAL. >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/dev_breakdown_correctable >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Breakdown of of correctable errors seen and reported by >> > this >> > + PCI device using ERR_COR. A sample result looks like this: >> > +----------------------------------------- >> > +Receiver Error = 0x174 >> > +Bad TLP = 0x19 >> > +Bad DLLP = 0x3 >> > +RELAY_NUM Rollover = 0x0 >> > +Replay Timer Timeout = 0x1 >> > +Advisory Non-Fatal = 0x0 >> > +Corrected Internal Error = 0x0 >> > +Header Log Overflow = 0x0 >> > +----------------------------------------- >> why hex display ? decimal is easy to read as these are counters. > > Have no particular preference. Since these can be potentially large > numbers, just had a random thought that hex might make it more > concise. I can change to decimal if that is preferable. > >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/dev_breakdown_uncorrectable >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Breakdown of of correctable errors seen and reported by >> > this >> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample result >> > + looks like this: >> > +----------------------------------------- >> > +Undefined = 0x0 >> > +Data Link Protocol = 0x0 >> > +Surprise Down Error = 0x0 >> > +Poisoned TLP = 0x0 >> > +Flow Control Protocol = 0x0 >> > +Completion Timeout = 0x0 >> > +Completer Abort = 0x0 >> > +Unexpected Completion = 0x0 >> > +Receiver Overflow = 0x0 >> > +Malformed TLP = 0x0 >> > +ECRC = 0x0 >> > +Unsupported Request = 0x0 >> > +ACS Violation = 0x0 >> > +Uncorrectable Internal Error = 0x0 >> > +MC Blocked TLP = 0x0 >> > +AtomicOp Egress Blocked = 0x0 >> > +TLP Prefix Blocked Error = 0x0 >> > +----------------------------------------- >> > + >> > +============================ >> > +PCIe Rootport AER statistics >> > +============================ >> > +These attributes showup under only the rootports that are AER capable. >> > These >> > +indicate the number of error messages as "reported to" the rootport. >> > Please note >> > +that the rootports also transmit (internally) the ERR_* messages for >> > errors seen >> > +by the internal rootport PCI device, so these counters includes them >> > and are >> > +thus cumulative of all the error messages on the PCI hierarchy >> > originating >> > +at that root port. >> >> what about switches and bridges ? > > What about them? AIUI, the switches forward the ERR_ messages from > downstream devices to the rootport, like they do with standard > messages. They can potentially generate their own ERR_ message and > that would be reported no different than other end point devices. > >> Also Can you give some idea as e.g what is the difference between >> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that both >> are same pci_dev. > > For a pci_dev representing the rootport: > > dev_total_fatal_errors = how many times this PCI device *experienced* > a fatal problem on its own (i.e. either link issues while talking to > its link partner, or some internal errors). > > rootport_total_fatal_errors = how many times this rootport was > *informed* about a problem (via ERR_* messages) in the PCI hierarchy Read the above sentence as: " rootport_total_fatal_errors = how many times this rootport was *informed* about a FATAL problem (via ERR_FATAL messages) in the PCI hierarchy" > that originates at it (can be any link further downstream). This > includes the dev_total_fatal_errors also, because any errors detected > by the rootport are also "informed" to itself via ERR_* messages. In > reality, this is just the total number of ERR_FATAL messages received > at the rootport. This sysfs attribute will only exist for root ports. > >> >> rootport_total_fatal_errs gives me an idea that how many times things >> have been failed under this pci_dev ? > > Yes, as above. > >> which means num of downstream link problems. but I am still trying to >> make sense as how it could be used, >> since we dont have BDF information associated with the number of errors >> anywhere (except these AER print messages) >> > > Agree. That is a limitation. The challenges being more record keeping, > more complicated sysfs representation, and given that PCI devices may > come and go, how do we know it is the same device before we collate > their stats etc. > >> >> and dev_total_fatal_errs as you mentioned above that problematic EP, >> then say root-port will report it and increment >> dev_total_fatal_errs ++ >> does it also increment root-port_total_fatal_errs ++ in above scenario ? > > Yes, as above, it will also root-port_total_fatal_errs++ for the root > port of that hierarchy. > > Thanks, > > Rajat > >> >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/rootport_total_cor_errs >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Total number of ERR_COR messages reported to rootport. >> > + >> > +Where: /sys/bus/pci/devices//aer_stats/rootport_total_fatal_errs >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Total number of ERR_FATAL messages reported to rootport. >> > + >> > +Where: >> > /sys/bus/pci/devices//aer_stats/rootport_total_nonfatal_errs >> > +Date: May 2018 >> > +Kernel Version: 4.17.0 >> > +Contact: linux-pci@vger.kernel.org, rajatja@google.com >> > +Description: Total number of ERR_NONFATAL messages reported to >> > rootport. >> > diff --git a/Documentation/PCI/pcieaer-howto.txt >> > b/Documentation/PCI/pcieaer-howto.txt >> > index acd0dddd6bb8..91b6e677cb8c 100644 >> > --- a/Documentation/PCI/pcieaer-howto.txt >> > +++ b/Documentation/PCI/pcieaer-howto.txt >> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the >> > device who sends >> > the error message to root port. Pls. refer to pci express specs for >> > other fields. >> > >> > +2.4 AER Statistics / Counters >> > + >> > +When PCIe AER errors are captured, the counters / statistics are also >> > exposed >> > +in form of sysfs attributes which are documented at >> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats >> > >> > 3. Developer Guide