This patchset exposes the AER stats via the sysfs attributes.
Rajat Jain (5):
PCI/AER: Define and allocate aer_stats structure for AER capable
devices
PCI/AER: Add sysfs stats for AER capable devices
PCP/AER: Add sysfs attributes to provide breakdown of AERs
PCI/AER: Add sysfs attributes for rootport cumulative stats
Documentation/PCI: Add details of PCI AER statistics
Documentation/PCI/pcieaer-howto.txt | 35 +++++
drivers/pci/pci-sysfs.c | 3 +
drivers/pci/pci.h | 4 +-
drivers/pci/pcie/aer/Makefile | 2 +-
drivers/pci/pcie/aer/aerdrv.h | 15 ++
drivers/pci/pcie/aer/aerdrv_core.c | 11 ++
drivers/pci/pcie/aer/aerdrv_errprint.c | 7 +-
drivers/pci/pcie/aer/aerdrv_stats.c | 192 +++++++++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 +
10 files changed, 269 insertions(+), 4 deletions(-)
create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
--
2.17.0.441.gb46fe60e1d-goog
Add the PCI AER statistics details to
Documentation/PCI/pcieaer-howto.txt
Signed-off-by: Rajat Jain <[email protected]>
---
Documentation/PCI/pcieaer-howto.txt | 35 +++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index acd0dddd6bb8..86ee9f9ff5e1 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for
other fields.
+2.4 AER statistics
+
+When AER messages are captured, the statistics are exposed via the following
+sysfs attributes under the "aer_stats" folder for the device:
+
+2.4.1 Device sysfs Attributes
+
+These attributes show up under all the devices that are AER capable. These
+indicate the errors "as seen by the device". Note that this may mean that if
+an end point is causing problems, the AER counters may increment at its link
+partner (e.g. root port) because the errors will be "seen" by the link partner
+and not the the problematic end point itself (which may report all counters
+as 0 as it never saw any problems).
+
+ * dev_total_cor_errs: number of correctable errors seen by the device.
+ * dev_total_fatal_errs: number of fatal uncorrectable errors seen by the device.
+ * dev_total_nonfatal_errs: number of nonfatal uncorr errors seen by the device.
+ * dev_breakdown_correctable: Provides a breakdown of different type of
+ correctable errors seen.
+ * dev_breakdown_uncorrectable: Provides a breakdown of different type of
+ uncorrectable errors seen.
+
+2.4.1 Rootport sysfs Attributes
+
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please note
+that the rootports also transmit (internally) the ERR_* messages for errors seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+ * rootport_total_cor_errs: number of ERR_COR messages reported to rootport.
+ * rootport_total_fatal_errs: number of ERR_FATAL messages reported to rootport.
+ * rootport_total_nonfatal_errs: number of ERR_NONFATAL messages reporeted to
+ rootport.
3. Developer Guide
--
2.17.0.441.gb46fe60e1d-goog
Add sysfs attributes to provide breakdown of the AERs seen,
into different type of correctable or uncorrectable errors:
dev_breakdown_correctable
dev_breakdown_uncorrectable
Signed-off-by: Rajat Jain <[email protected]>
---
drivers/pci/pcie/aer/aerdrv.h | 6 ++++++
drivers/pci/pcie/aer/aerdrv_errprint.c | 6 ++++--
drivers/pci/pcie/aer/aerdrv_stats.c | 25 +++++++++++++++++++++++++
3 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index b5d5ad6f2c03..048fbd7c9633 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -89,6 +89,12 @@ int pci_aer_stats_init(struct pci_dev *pdev);
void pci_aer_stats_exit(struct pci_dev *pdev);
void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
+extern const char
+*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+
+extern const char
+*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+
#ifdef CONFIG_ACPI_APEI
int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
#else
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 5e8b98deda08..5585f309f1a8 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -68,7 +68,8 @@ static const char *aer_error_layer[] = {
"Transaction Layer"
};
-static const char *aer_correctable_error_string[] = {
+const char
+*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS] = {
"Receiver Error", /* Bit Position 0 */
NULL,
NULL,
@@ -87,7 +88,8 @@ static const char *aer_correctable_error_string[] = {
"Header Log Overflow", /* Bit Position 15 */
};
-static const char *aer_uncorrectable_error_string[] = {
+const char
+*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS] = {
"Undefined", /* Bit Position 0 */
NULL,
NULL,
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
index 87b7119d0a86..5f0a6e144f56 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -61,10 +61,35 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ unsigned int i; \
+ char *str = buf; \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ u64 *stats = pdev->aer_stats->stats_array; \
+ for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
+ if (strings_array[i]) \
+ str += sprintf(str, "%s = 0x%llx\n", \
+ strings_array[i], stats[i]); \
+ } \
+ return str-buf; \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
+ aer_correctable_error_string);
+aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
+ aer_uncorrectable_error_string);
+
static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_cor_errs.attr,
&dev_attr_dev_total_fatal_errs.attr,
&dev_attr_dev_total_nonfatal_errs.attr,
+ &dev_attr_dev_breakdown_correctable.attr,
+ &dev_attr_dev_breakdown_uncorrectable.attr,
NULL
};
--
2.17.0.441.gb46fe60e1d-goog
Add sysfs attributes for rootport statistics (that are cumulative
of all the ERR_* messages seen on this PCI hierarchy).
Signed-off-by: Rajat Jain <[email protected]>
---
drivers/pci/pcie/aer/aerdrv.h | 2 ++
drivers/pci/pcie/aer/aerdrv_core.c | 2 ++
drivers/pci/pcie/aer/aerdrv_stats.c | 31 +++++++++++++++++++++++++++++
3 files changed, 35 insertions(+)
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 048fbd7c9633..77d8355551d9 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -88,6 +88,8 @@ irqreturn_t aer_irq(int irq, void *context);
int pci_aer_stats_init(struct pci_dev *pdev);
void pci_aer_stats_exit(struct pci_dev *pdev);
void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
+void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_source *e_src);
extern const char
*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 42a6f913069a..0f70e22563f3 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -424,6 +424,8 @@ static void aer_isr_one_error(struct pcie_device *p_device,
struct aer_rpc *rpc = get_service_data(p_device);
struct aer_err_info *e_info = &rpc->e_info;
+ pci_rootport_aer_stats_incr(p_device->port, e_src);
+
/*
* There is a possibility that both correctable error and
* uncorrectable error being logged. Report correctable error first.
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
index 5f0a6e144f56..a526e26c8683 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -60,6 +60,9 @@ static DEVICE_ATTR_RO(field)
aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+aer_stats_aggregate_attr(rootport_total_cor_errs);
+aer_stats_aggregate_attr(rootport_total_fatal_errs);
+aer_stats_aggregate_attr(rootport_total_nonfatal_errs);
#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
static ssize_t \
@@ -90,6 +93,9 @@ static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_nonfatal_errs.attr,
&dev_attr_dev_breakdown_correctable.attr,
&dev_attr_dev_breakdown_uncorrectable.attr,
+ &dev_attr_rootport_total_cor_errs.attr,
+ &dev_attr_rootport_total_fatal_errs.attr,
+ &dev_attr_rootport_total_nonfatal_errs.attr,
NULL
};
@@ -102,6 +108,12 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
if (!pdev->aer_stats)
return 0;
+ if ((a == &dev_attr_rootport_total_cor_errs.attr ||
+ a == &dev_attr_rootport_total_fatal_errs.attr ||
+ a == &dev_attr_rootport_total_nonfatal_errs.attr) &&
+ pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT)
+ return 0;
+
return a->mode;
}
@@ -144,6 +156,25 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
counter[i]++;
}
+void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_source *e_src)
+{
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (unlikely(!aer_stats))
+ return;
+
+ if (e_src->status & PCI_ERR_ROOT_COR_RCV)
+ aer_stats->rootport_total_cor_errs++;
+
+ if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
+ if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
+ aer_stats->rootport_total_fatal_errs++;
+ else
+ aer_stats->rootport_total_nonfatal_errs++;
+ }
+}
+
int pci_aer_stats_init(struct pci_dev *pdev)
{
pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
--
2.17.0.441.gb46fe60e1d-goog
Define a structure to hold the AER statistics. There are 2 groups
of statistics: dev_* counters that are to be collected for all AER
capable devices and rootport_* counters that are collected for all
(AER capable) rootports only. Allocate and free this structure when
device is added or released (thus counters survive the lifetime of the
device).
Add a new file aerdrv_stats.c to hold the AER stats collection logic.
Signed-off-by: Rajat Jain <[email protected]>
---
drivers/pci/pcie/aer/Makefile | 2 +-
drivers/pci/pcie/aer/aerdrv.h | 6 +++
drivers/pci/pcie/aer/aerdrv_core.c | 9 ++++
drivers/pci/pcie/aer/aerdrv_stats.c | 64 +++++++++++++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 ++
6 files changed, 84 insertions(+), 1 deletion(-)
create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
diff --git a/drivers/pci/pcie/aer/Makefile b/drivers/pci/pcie/aer/Makefile
index 09bd890875a3..a06f9cc2bde5 100644
--- a/drivers/pci/pcie/aer/Makefile
+++ b/drivers/pci/pcie/aer/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_PCIEAER) += aerdriver.o
obj-$(CONFIG_PCIE_ECRC) += ecrc.o
-aerdriver-objs := aerdrv_errprint.o aerdrv_core.o aerdrv.o
+aerdriver-objs := aerdrv_errprint.o aerdrv_core.o aerdrv.o aerdrv_stats.o
aerdriver-$(CONFIG_ACPI) += aerdrv_acpi.o
obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index b4c950683cc7..d8b9fba536ed 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -33,6 +33,10 @@
PCI_ERR_UNC_MALF_TLP)
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
+
+#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
+#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
+
struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
@@ -81,6 +85,8 @@ void aer_isr(struct work_struct *work);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
irqreturn_t aer_irq(int irq, void *context);
+int pci_aer_stats_init(struct pci_dev *pdev);
+void pci_aer_stats_exit(struct pci_dev *pdev);
#ifdef CONFIG_ACPI_APEI
int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 36e622d35c48..42a6f913069a 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -95,9 +95,18 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
int pci_aer_init(struct pci_dev *dev)
{
dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+
+ if (!dev->aer_cap || pci_aer_stats_init(dev))
+ return -EIO;
+
return pci_cleanup_aer_error_status_regs(dev);
}
+void pci_aer_exit(struct pci_dev *dev)
+{
+ pci_aer_stats_exit(dev);
+}
+
/**
* add_error_device - list device to be handled
* @e_info: pointer to error info
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
new file mode 100644
index 000000000000..b9f251992209
--- /dev/null
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Google Inc, All Rights Reserved.
+ * Rajat Jain ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * AER Statistics - exposed to userspace via /sysfs attributes.
+ */
+
+#include <linux/pci.h>
+#include "aerdrv.h"
+
+/* AER stats for the device */
+struct aer_stats {
+
+ /*
+ * Fields for all AER capable devices. They indicate the errors
+ * "as seen by this device". Note that this may mean that if an
+ * end point is causing problems, the AER counters may increment
+ * at its link partner (e.g. root port) because the errors will be
+ * "seen" by the link partner and not the the problematic end point
+ * itself (which may report all counters as 0 as it never saw any
+ * problems).
+ */
+ /* Individual counters for different type of correctable errors */
+ u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+ /* Individual counters for different type of uncorrectable errors */
+ u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+ /* Total number of correctable errors seen by this device */
+ u64 dev_total_cor_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_fatal_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_nonfatal_errs;
+
+ /*
+ * Fields for Root ports only, these indicate the total number of
+ * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
+ * rootport, INCLUDING the ones that are generated internally (by
+ * the rootport itself)
+ */
+ u64 rootport_total_cor_errs;
+ u64 rootport_total_fatal_errs;
+ u64 rootport_total_nonfatal_errs;
+};
+
+int pci_aer_stats_init(struct pci_dev *pdev)
+{
+ pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ if (!pdev->aer_stats) {
+ dev_err(&pdev->dev, "No memory for aer_stats\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void pci_aer_stats_exit(struct pci_dev *pdev)
+{
+ kfree(pdev->aer_stats);
+ pdev->aer_stats = NULL;
+}
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 384020757b81..dd662c241373 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
+ pci_aer_exit(dev);
pci_vpd_release(dev);
pci_iov_release(dev);
pci_free_cap_save_buffers(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 21965e0dbe62..5c84b1304de7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -299,6 +299,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
+ struct aer_stats *aer_stats; /* AER stats for this device */
#endif
u8 pcie_cap; /* PCIe capability offset */
u8 msi_cap; /* MSI capability offset */
@@ -1470,10 +1471,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
void pci_no_aer(void);
bool pci_aer_available(void);
int pci_aer_init(struct pci_dev *dev);
+void pci_aer_exit(struct pci_dev *dev);
#else
static inline void pci_no_aer(void) { }
static inline bool pci_aer_available(void) { return false; }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
+static inline void pci_aer_exit(struct pci_dev *d) { }
#endif
#ifdef CONFIG_PCIE_ECRC
--
2.17.0.441.gb46fe60e1d-goog
Add the following AER sysfs stats to represent the counters for each
kind of error as seen by the device:
dev_total_cor_errs
dev_total_fatal_errs
dev_total_nonfatal_errs
Signed-off-by: Rajat Jain <[email protected]>
---
drivers/pci/pci-sysfs.c | 3 ++
drivers/pci/pci.h | 4 +-
drivers/pci/pcie/aer/aerdrv.h | 1 +
drivers/pci/pcie/aer/aerdrv_errprint.c | 1 +
drivers/pci/pcie/aer/aerdrv_stats.c | 72 ++++++++++++++++++++++++++
5 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 366d93af051d..730f985a3dc9 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1743,6 +1743,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
#endif
&pci_bridge_attr_group,
&pcie_dev_attr_group,
+#ifdef CONFIG_PCIEAER
+ &aer_stats_attr_group,
+#endif
NULL,
};
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c358e7a07f3f..9a28ec600225 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -181,7 +181,9 @@ extern const struct attribute_group *pci_dev_groups[];
extern const struct attribute_group *pcibus_groups[];
extern const struct device_type pci_dev_type;
extern const struct attribute_group *pci_bus_groups[];
-
+#ifdef CONFIG_PCIEAER
+extern const struct attribute_group aer_stats_attr_group;
+#endif
/**
* pci_match_one_device - Tell if a PCI device structure has a matching
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index d8b9fba536ed..b5d5ad6f2c03 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
irqreturn_t aer_irq(int irq, void *context);
int pci_aer_stats_init(struct pci_dev *pdev);
void pci_aer_stats_exit(struct pci_dev *pdev);
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
#ifdef CONFIG_ACPI_APEI
int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 21ca5e1b0ded..5e8b98deda08 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_err(dev, " [%2d] Unknown Error Bit%s\n",
i, info->first_error == i ? " (First)" : "");
}
+ pci_dev_aer_stats_incr(dev, info);
}
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
index b9f251992209..87b7119d0a86 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -47,6 +47,78 @@ struct aer_stats {
u64 rootport_total_nonfatal_errs;
};
+#define aer_stats_aggregate_attr(field) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "0x%llx\n", pdev->aer_stats->field); \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_aggregate_attr(dev_total_cor_errs);
+aer_stats_aggregate_attr(dev_total_fatal_errs);
+aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+
+static struct attribute *aer_stats_attrs[] __ro_after_init = {
+ &dev_attr_dev_total_cor_errs.attr,
+ &dev_attr_dev_total_fatal_errs.attr,
+ &dev_attr_dev_total_nonfatal_errs.attr,
+ NULL
+};
+
+static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_stats)
+ return 0;
+
+ return a->mode;
+}
+
+const struct attribute_group aer_stats_attr_group = {
+ .name = "aer_stats",
+ .attrs = aer_stats_attrs,
+ .is_visible = aer_stats_attrs_are_visible,
+};
+
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ int status, i, max = -1;
+ u64 *counter = NULL;
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (unlikely(!aer_stats))
+ return;
+
+ switch (info->severity) {
+ case AER_CORRECTABLE:
+ aer_stats->dev_total_cor_errs++;
+ counter = &aer_stats->dev_cor_errs[0];
+ max = AER_MAX_TYPEOF_CORRECTABLE_ERRS;
+ break;
+ case AER_NONFATAL:
+ aer_stats->dev_total_nonfatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ case AER_FATAL:
+ aer_stats->dev_total_fatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ }
+
+ status = (info->status & ~info->mask);
+ for (i = 0; i < max; i++)
+ if (status & (1 << i))
+ counter[i]++;
+}
+
int pci_aer_stats_init(struct pci_dev *pdev)
{
pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
--
2.17.0.441.gb46fe60e1d-goog
On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the following AER sysfs stats to represent the counters for each
> kind of error as seen by the device:
>
> dev_total_cor_errs
> dev_total_fatal_errs
> dev_total_nonfatal_errs
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> drivers/pci/pci-sysfs.c | 3 ++
> drivers/pci/pci.h | 4 +-
> drivers/pci/pcie/aer/aerdrv.h | 1 +
> drivers/pci/pcie/aer/aerdrv_errprint.c | 1 +
> drivers/pci/pcie/aer/aerdrv_stats.c | 72 ++++++++++++++++++++++++++
> 5 files changed, 80 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 366d93af051d..730f985a3dc9 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1743,6 +1743,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
> #endif
> &pci_bridge_attr_group,
> &pcie_dev_attr_group,
> +#ifdef CONFIG_PCIEAER
> + &aer_stats_attr_group,
> +#endif
> NULL,
> };
So if the device is removed as part of recovery, then these get reset,
right? So if the device fails intermittently, these counters would keep
getting reset. Is this the intent?
(snip)
> /**
> * pci_match_one_device - Tell if a PCI device structure has a matching
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index d8b9fba536ed..b5d5ad6f2c03 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
> irqreturn_t aer_irq(int irq, void *context);
> int pci_aer_stats_init(struct pci_dev *pdev);
> void pci_aer_stats_exit(struct pci_dev *pdev);
> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
>
> #ifdef CONFIG_ACPI_APEI
> int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 21ca5e1b0ded..5e8b98deda08 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev,
> pci_err(dev, " [%2d] Unknown Error Bit%s\n",
> i, info->first_error == i ? " (First)" : "");
> }
> + pci_dev_aer_stats_incr(dev, info);
What about AER errors that are contained by DPC?
Alex
On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/PCI/pcieaer-howto.txt
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> Documentation/PCI/pcieaer-howto.txt | 35 +++++++++++++++++++++++++++++
> 1 file changed, 35 insertions(+)
>
> diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
> index acd0dddd6bb8..86ee9f9ff5e1 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device who sends
> the error message to root port. Pls. refer to pci express specs for
> other fields.
>
> +2.4 AER statistics
> +
> +When AER messages are captured, the statistics are exposed via the following
> +sysfs attributes under the "aer_stats" folder for the device:
> +
> +2.4.1 Device sysfs Attributes
> +
> +These attributes show up under all the devices that are AER capable. These
> +indicate the errors "as seen by the device". Note that this may mean that if
> +an end point is causing problems, the AER counters may increment at its link
> +partner (e.g. root port) because the errors will be "seen" by the link partner
> +and not the the problematic end point itself (which may report all counters
> +as 0 as it never saw any problems).
I was afraid of that. Is there a way to look at the requester ID to log
AER errors to the correct device?
Alex
Hi,
On Tue, May 22, 2018 at 3:52 PM, Alex G. <[email protected]> wrote:
> On 05/22/2018 05:28 PM, Rajat Jain wrote:
>> Add the PCI AER statistics details to
>> Documentation/PCI/pcieaer-howto.txt
>>
>> Signed-off-by: Rajat Jain <[email protected]>
>> ---
>> Documentation/PCI/pcieaer-howto.txt | 35 +++++++++++++++++++++++++++++
>> 1 file changed, 35 insertions(+)
>>
>> diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
>> index acd0dddd6bb8..86ee9f9ff5e1 100644
>> --- a/Documentation/PCI/pcieaer-howto.txt
>> +++ b/Documentation/PCI/pcieaer-howto.txt
>> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device who sends
>> the error message to root port. Pls. refer to pci express specs for
>> other fields.
>>
>> +2.4 AER statistics
>> +
>> +When AER messages are captured, the statistics are exposed via the following
>> +sysfs attributes under the "aer_stats" folder for the device:
>> +
>> +2.4.1 Device sysfs Attributes
>> +
>> +These attributes show up under all the devices that are AER capable. These
>> +indicate the errors "as seen by the device". Note that this may mean that if
>> +an end point is causing problems, the AER counters may increment at its link
>> +partner (e.g. root port) because the errors will be "seen" by the link partner
>> +and not the the problematic end point itself (which may report all counters
>> +as 0 as it never saw any problems).
>
> I was afraid of that. Is there a way to look at the requester ID to log
> AER errors to the correct device?
I do not think it is possible to pin point the source of the problem.
Errors may be caused due to sub optimal link tuning, or signal
integrity, or either of the link partners. Both the link partners will
detect and report the errors that they "see".
The bits and errors defined by the PCIe spec, follow the same semantics i.e.
=> the spec defines the different error conditions "as
seen/encountered by the device",
=> Thus the device reports those errors to the root port
=> which is what we are counting and reporting here.
IMHO, any interpretation / analysis of this error data / counters
should be left to the user so that he can look at different devices
and the errors they see, and then conclude on what might be the
problem.
Thanks,
Rajat
>
> Alex
On Tue, May 22, 2018 at 3:50 PM, Alex G. <[email protected]> wrote:
>
>
> On 05/22/2018 05:28 PM, Rajat Jain wrote:
>> Add the following AER sysfs stats to represent the counters for each
>> kind of error as seen by the device:
>>
>> dev_total_cor_errs
>> dev_total_fatal_errs
>> dev_total_nonfatal_errs
>>
>> Signed-off-by: Rajat Jain <[email protected]>
>> ---
>> drivers/pci/pci-sysfs.c | 3 ++
>> drivers/pci/pci.h | 4 +-
>> drivers/pci/pcie/aer/aerdrv.h | 1 +
>> drivers/pci/pcie/aer/aerdrv_errprint.c | 1 +
>> drivers/pci/pcie/aer/aerdrv_stats.c | 72 ++++++++++++++++++++++++++
>> 5 files changed, 80 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index 366d93af051d..730f985a3dc9 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -1743,6 +1743,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
>> #endif
>> &pci_bridge_attr_group,
>> &pcie_dev_attr_group,
>> +#ifdef CONFIG_PCIEAER
>> + &aer_stats_attr_group,
>> +#endif
>> NULL,
>> };
>
> So if the device is removed as part of recovery, then these get reset,
> right? So if the device fails intermittently, these counters would keep
> getting reset. Is this the intent?
Umm, kind of.
* One argument is that if a PCI device is removed and then
re-enumerated, how do we know it is the same device and has not been
replaced by another device for e.g.? Note that the root port counters
that have the cumulative counters for all the errors seen will still
have them logged in the situation you describe.
>
> (snip)
>
>> /**
>> * pci_match_one_device - Tell if a PCI device structure has a matching
>> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
>> index d8b9fba536ed..b5d5ad6f2c03 100644
>> --- a/drivers/pci/pcie/aer/aerdrv.h
>> +++ b/drivers/pci/pcie/aer/aerdrv.h
>> @@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
>> irqreturn_t aer_irq(int irq, void *context);
>> int pci_aer_stats_init(struct pci_dev *pdev);
>> void pci_aer_stats_exit(struct pci_dev *pdev);
>> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
>>
>> #ifdef CONFIG_ACPI_APEI
>> int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
>> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
>> index 21ca5e1b0ded..5e8b98deda08 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
>> @@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev,
>> pci_err(dev, " [%2d] Unknown Error Bit%s\n",
>> i, info->first_error == i ? " (First)" : "");
>> }
>> + pci_dev_aer_stats_incr(dev, info);
>
> What about AER errors that are contained by DPC?
Thanks, You are right, this patch does not take care of the DPC. I'll
try to read up on DPC and can integrate it if it turns out to be easy
enough.
Thanks,
Rajat
>
> Alex
On 5/22/2018 7:27 PM, Rajat Jain wrote:
>> What about AER errors that are contained by DPC?
> Thanks, You are right, this patch does not take care of the DPC. I'll
> try to read up on DPC and can integrate it if it turns out to be easy
> enough.
>
I'd focus on AER for the moment. DPC is going through major restructuring
and will only handle AER_FATAL errors moving forward.
> Thanks,
>
> Rajat
>
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
On Tue, May 22, 2018 at 03:28:02PM -0700, Rajat Jain wrote:
> Add the following AER sysfs stats to represent the counters for each
> kind of error as seen by the device:
>
> dev_total_cor_errs
> dev_total_fatal_errs
> dev_total_nonfatal_errs
You need Documentation/ABI/ updates for new sysfs files please.
thanks,
greg k-h
On Tue, May 22, 2018 at 03:28:05PM -0700, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/PCI/pcieaer-howto.txt
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> Documentation/PCI/pcieaer-howto.txt | 35 +++++++++++++++++++++++++++++
> 1 file changed, 35 insertions(+)
>
> diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
> index acd0dddd6bb8..86ee9f9ff5e1 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device who sends
> the error message to root port. Pls. refer to pci express specs for
> other fields.
>
> +2.4 AER statistics
> +
> +When AER messages are captured, the statistics are exposed via the following
> +sysfs attributes under the "aer_stats" folder for the device:
> +
> +2.4.1 Device sysfs Attributes
> +
> +These attributes show up under all the devices that are AER capable. These
> +indicate the errors "as seen by the device". Note that this may mean that if
> +an end point is causing problems, the AER counters may increment at its link
> +partner (e.g. root port) because the errors will be "seen" by the link partner
> +and not the the problematic end point itself (which may report all counters
> +as 0 as it never saw any problems).
> +
> + * dev_total_cor_errs: number of correctable errors seen by the device.
> + * dev_total_fatal_errs: number of fatal uncorrectable errors seen by the device.
> + * dev_total_nonfatal_errs: number of nonfatal uncorr errors seen by the device.
> + * dev_breakdown_correctable: Provides a breakdown of different type of
> + correctable errors seen.
> + * dev_breakdown_uncorrectable: Provides a breakdown of different type of
> + uncorrectable errors seen.
> +
> +2.4.1 Rootport sysfs Attributes
> +
> +These attributes showup under only the rootports that are AER capable. These
> +indicate the number of error messages as "reported to" the rootport. Please note
> +that the rootports also transmit (internally) the ERR_* messages for errors seen
> +by the internal rootport PCI device, so these counters includes them and are
> +thus cumulative of all the error messages on the PCI hierarchy originating
> +at that root port.
> +
> + * rootport_total_cor_errs: number of ERR_COR messages reported to rootport.
> + * rootport_total_fatal_errs: number of ERR_FATAL messages reported to rootport.
> + * rootport_total_nonfatal_errs: number of ERR_NONFATAL messages reporeted to
> + rootport.
These all belong in Documentation/ABI/ please.
thanks,
greg k-h
On Tue, May 22, 2018 at 03:28:02PM -0700, Rajat Jain wrote:
> +#define aer_stats_aggregate_attr(field) \
> + static ssize_t \
> + field##_show(struct device *dev, struct device_attribute *attr, \
> + char *buf) \
> +{ \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + return sprintf(buf, "0x%llx\n", pdev->aer_stats->field); \
> +} \
Use tabs at the end please, otherwise your trailing \ look horrid.
> +static DEVICE_ATTR_RO(field)
> +
> +aer_stats_aggregate_attr(dev_total_cor_errs);
> +aer_stats_aggregate_attr(dev_total_fatal_errs);
> +aer_stats_aggregate_attr(dev_total_nonfatal_errs);
> +
> +static struct attribute *aer_stats_attrs[] __ro_after_init = {
> + &dev_attr_dev_total_cor_errs.attr,
> + &dev_attr_dev_total_fatal_errs.attr,
> + &dev_attr_dev_total_nonfatal_errs.attr,
> + NULL
> +};
> +
> +static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
> + struct attribute *a, int n)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + if (!pdev->aer_stats)
> + return 0;
> +
> + return a->mode;
> +}
> +
> +const struct attribute_group aer_stats_attr_group = {
> + .name = "aer_stats",
> + .attrs = aer_stats_attrs,
> + .is_visible = aer_stats_attrs_are_visible,
> +};
> +
> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> + int status, i, max = -1;
> + u64 *counter = NULL;
> + struct aer_stats *aer_stats = pdev->aer_stats;
> +
> + if (unlikely(!aer_stats))
> + return;
Can you measure the speed difference with and without that unlikely()
macro? If not, please don't use it. Hint, the cpu and compiler are
always always better at this than we are...
thanks,
greg k-h
On Tue, May 22, 2018 at 03:28:03PM -0700, Rajat Jain wrote:
> Add sysfs attributes to provide breakdown of the AERs seen,
> into different type of correctable or uncorrectable errors:
>
> dev_breakdown_correctable
> dev_breakdown_uncorrectable
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> drivers/pci/pcie/aer/aerdrv.h | 6 ++++++
> drivers/pci/pcie/aer/aerdrv_errprint.c | 6 ++++--
> drivers/pci/pcie/aer/aerdrv_stats.c | 25 +++++++++++++++++++++++++
> 3 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index b5d5ad6f2c03..048fbd7c9633 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -89,6 +89,12 @@ int pci_aer_stats_init(struct pci_dev *pdev);
> void pci_aer_stats_exit(struct pci_dev *pdev);
> void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
>
> +extern const char
> +*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
> +
> +extern const char
> +*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
> +
> #ifdef CONFIG_ACPI_APEI
> int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
> #else
> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 5e8b98deda08..5585f309f1a8 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -68,7 +68,8 @@ static const char *aer_error_layer[] = {
> "Transaction Layer"
> };
>
> -static const char *aer_correctable_error_string[] = {
> +const char
> +*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS] = {
> "Receiver Error", /* Bit Position 0 */
> NULL,
> NULL,
> @@ -87,7 +88,8 @@ static const char *aer_correctable_error_string[] = {
> "Header Log Overflow", /* Bit Position 15 */
> };
>
> -static const char *aer_uncorrectable_error_string[] = {
> +const char
> +*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS] = {
> "Undefined", /* Bit Position 0 */
> NULL,
> NULL,
> diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
> index 87b7119d0a86..5f0a6e144f56 100644
> --- a/drivers/pci/pcie/aer/aerdrv_stats.c
> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
> @@ -61,10 +61,35 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
> aer_stats_aggregate_attr(dev_total_fatal_errs);
> aer_stats_aggregate_attr(dev_total_nonfatal_errs);
>
> +#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
> + static ssize_t \
> + field##_show(struct device *dev, struct device_attribute *attr, \
> + char *buf) \
> +{ \
> + unsigned int i; \
> + char *str = buf; \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + u64 *stats = pdev->aer_stats->stats_array; \
> + for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
> + if (strings_array[i]) \
> + str += sprintf(str, "%s = 0x%llx\n", \
> + strings_array[i], stats[i]); \
> + } \
> + return str-buf; \
> +} \
Again with the tabs instead of spaces please.
thanks,
greg k-h
On Tue, May 22, 2018 at 03:28:01PM -0700, Rajat Jain wrote:
> --- /dev/null
> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018 Google Inc, All Rights Reserved.
> + * Rajat Jain ([email protected])
Google has the copyright, not you, right? You might want to make that a
bit more explicit by putting a blank line somewhere here...
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
If you have a SPDX line, you do not need this paragraph. Please drop it
so we don't have to delete it later on.
thanks,
greg k-h
On 05/22/2018 06:28 PM, Rajat Jain wrote:
> Define a structure to hold the AER statistics. There are 2 groups
> of statistics: dev_* counters that are to be collected for all AER
> capable devices and rootport_* counters that are collected for all
> (AER capable) rootports only. Allocate and free this structure when
> device is added or released (thus counters survive the lifetime of the
> device).
>
> Add a new file aerdrv_stats.c to hold the AER stats collection logic.
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> drivers/pci/pcie/aer/Makefile | 2 +-
> drivers/pci/pcie/aer/aerdrv.h | 6 +++
> drivers/pci/pcie/aer/aerdrv_core.c | 9 ++++
> drivers/pci/pcie/aer/aerdrv_stats.c | 64 +++++++++++++++++++++++++++++
> drivers/pci/probe.c | 1 +
> include/linux/pci.h | 3 ++
> 6 files changed, 84 insertions(+), 1 deletion(-)
> create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
>
> diff --git a/drivers/pci/pcie/aer/Makefile b/drivers/pci/pcie/aer/Makefile
>
> -aerdriver-objs := aerdrv_errprint.o aerdrv_core.o aerdrv.o
> +aerdriver-objs := aerdrv_errprint.o aerdrv_core.o aerdrv.o aerdrv_stats.o
> aerdriver-$(CONFIG_ACPI) += aerdrv_acpi.o
>
> obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index b4c950683cc7..d8b9fba536ed 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -33,6 +33,10 @@
> PCI_ERR_UNC_MALF_TLP)
>
> #define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
> +
> +#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
> +#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
> +
> struct aer_err_info {
> struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
> int error_dev_num;
> @@ -81,6 +85,8 @@ void aer_isr(struct work_struct *work);
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
> void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
> irqreturn_t aer_irq(int irq, void *context);
> +int pci_aer_stats_init(struct pci_dev *pdev);
> +void pci_aer_stats_exit(struct pci_dev *pdev);
>
> #ifdef CONFIG_ACPI_APEI
> int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 36e622d35c48..42a6f913069a 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -95,9 +95,18 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
> int pci_aer_init(struct pci_dev *dev)
> {
> dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
> +
> + if (!dev->aer_cap || pci_aer_stats_init(dev))
> + return -EIO;
> +
> return pci_cleanup_aer_error_status_regs(dev);
> }
>
> +void pci_aer_exit(struct pci_dev *dev)
> +{
> + pci_aer_stats_exit(dev);
> +}
> +
> /**
> * add_error_device - list device to be handled
> * @e_info: pointer to error info
> diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
> new file mode 100644
> index 000000000000..b9f251992209
> --- /dev/null
> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0
Fix the formatting please - that gross // gibberish doesn't belong there.
Jes
On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote:
> > +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
> > @@ -0,0 +1,64 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> Fix the formatting please - that gross // gibberish doesn't belong there.
Sorry, Jes. The Chief Penguin has Spoken, and that's the preferred
syntax:
2. Style:
The SPDX license identifier is added in form of a comment. The comment
style depends on the file type::
C source: // SPDX-License-Identifier: <SPDX License Expression>
(you can dig up the discussion around this on the mailing list if you
like. Linus actually thinks that C++ single-line comments are one of
the few things that language got right)
On 05/23/2018 09:20 AM, Jes Sorensen wrote:
> On 05/22/2018 06:28 PM, Rajat Jain wrote:
>> new file mode 100644
>> index 000000000000..b9f251992209
>> --- /dev/null
>> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
>> @@ -0,0 +1,64 @@
>> +// SPDX-License-Identifier: GPL-2.0
>
> Fix the formatting please - that gross // gibberish doesn't belong there.
Deep breath in. Deep breath out.
git grep SPDX
Although I don't like it, this format is already too common.
Cheers,
Alex
On 05/23/2018 10:26 AM, Alex G. wrote:
> On 05/23/2018 09:20 AM, Jes Sorensen wrote:
>> On 05/22/2018 06:28 PM, Rajat Jain wrote:
>>> new file mode 100644
>>> index 000000000000..b9f251992209
>>> --- /dev/null
>>> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
>>> @@ -0,0 +1,64 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>
>> Fix the formatting please - that gross // gibberish doesn't belong there.
>
> Deep breath in. Deep breath out.
>
> git grep SPDX
>
> Although I don't like it, this format is already too common.
So? Just because some people did something wrong doesn't mean you should
continue to do it.
Jes
On 05/23/2018 09:32 AM, Jes Sorensen wrote:
> On 05/23/2018 10:26 AM, Matthew Wilcox wrote:
>> On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote:
>>>> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
>>>> @@ -0,0 +1,64 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>
>>> Fix the formatting please - that gross // gibberish doesn't belong there.
>>
>> Sorry, Jes. The Chief Penguin has Spoken, and that's the preferred
>> syntax:
>>
>> 2. Style:
>>
>> The SPDX license identifier is added in form of a comment. The comment
>> style depends on the file type::
>>
>> C source: // SPDX-License-Identifier: <SPDX License Expression>
>>
>> (you can dig up the discussion around this on the mailing list if you
>> like. Linus actually thinks that C++ single-line comments are one of
>> the few things that language got right)
>
> Well I'll agree to disagree with Linus on this one. It's ugly as fsck
> and allows for ambiguous statements in the code.
You misspelled "fuck".
Alex
On 05/23/2018 10:26 AM, Matthew Wilcox wrote:
> On Wed, May 23, 2018 at 10:20:10AM -0400, Jes Sorensen wrote:
>>> +++ b/drivers/pci/pcie/aer/aerdrv_stats.c
>>> @@ -0,0 +1,64 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>
>> Fix the formatting please - that gross // gibberish doesn't belong there.
>
> Sorry, Jes. The Chief Penguin has Spoken, and that's the preferred
> syntax:
>
> 2. Style:
>
> The SPDX license identifier is added in form of a comment. The comment
> style depends on the file type::
>
> C source: // SPDX-License-Identifier: <SPDX License Expression>
>
> (you can dig up the discussion around this on the mailing list if you
> like. Linus actually thinks that C++ single-line comments are one of
> the few things that language got right)
Well I'll agree to disagree with Linus on this one. It's ugly as fsck
and allows for ambiguous statements in the code.
Jes
On Wed, 23 May 2018 09:33:30 -0500
"Alex G." <[email protected]> wrote:
> > Well I'll agree to disagree with Linus on this one. It's ugly as fsck
> > and allows for ambiguous statements in the code.
>
> You misspelled "fuck".
No, Jes is Danish. That's how they spell it.
-- Steve
Add the PCI AER statistics details to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
and provide a pointer to it in
Documentation/PCI/pcieaer-howto.txt
Signed-off-by: Rajat Jain <[email protected]>
---
v2: Move the documentation to Documentation/ABI/
.../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
Documentation/PCI/pcieaer-howto.txt | 5 +
2 files changed, 108 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
new file mode 100644
index 000000000000..f55c389290ac
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -0,0 +1,103 @@
+==========================
+PCIe Device AER statistics
+==========================
+These attributes show up under all the devices that are AER capable. These
+statistical counters indicate the errors "as seen/reported by the device".
+Note that this may mean that if an end point is causing problems, the AER
+counters may increment at its link partner (e.g. root port) because the
+errors will be "seen" / reported by the link partner and not the the
+problematic end point itself (which may report all counters as 0 as it never
+saw any problems).
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of correctable errors seen and reported by this
+ PCI device using ERR_COR.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of uncorrectable fatal errors seen and reported
+ by this PCI device using ERR_FATAL.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of uncorrectable non-fatal errors seen and reported
+ by this PCI device using ERR_NONFATAL.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Breakdown of of correctable errors seen and reported by this
+ PCI device using ERR_COR. A sample result looks like this:
+-----------------------------------------
+Receiver Error = 0x174
+Bad TLP = 0x19
+Bad DLLP = 0x3
+RELAY_NUM Rollover = 0x0
+Replay Timer Timeout = 0x1
+Advisory Non-Fatal = 0x0
+Corrected Internal Error = 0x0
+Header Log Overflow = 0x0
+-----------------------------------------
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Breakdown of of correctable errors seen and reported by this
+ PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
+ looks like this:
+-----------------------------------------
+Undefined = 0x0
+Data Link Protocol = 0x0
+Surprise Down Error = 0x0
+Poisoned TLP = 0x0
+Flow Control Protocol = 0x0
+Completion Timeout = 0x0
+Completer Abort = 0x0
+Unexpected Completion = 0x0
+Receiver Overflow = 0x0
+Malformed TLP = 0x0
+ECRC = 0x0
+Unsupported Request = 0x0
+ACS Violation = 0x0
+Uncorrectable Internal Error = 0x0
+MC Blocked TLP = 0x0
+AtomicOp Egress Blocked = 0x0
+TLP Prefix Blocked Error = 0x0
+-----------------------------------------
+
+============================
+PCIe Rootport AER statistics
+============================
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please note
+that the rootports also transmit (internally) the ERR_* messages for errors seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_COR messages reported to rootport.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_FATAL messages reported to rootport.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_NONFATAL messages reported to rootport.
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index acd0dddd6bb8..91b6e677cb8c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for
other fields.
+2.4 AER Statistics / Counters
+
+When PCIe AER errors are captured, the counters / statistics are also exposed
+in form of sysfs attributes which are documented at
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
3. Developer Guide
--
2.17.0.441.gb46fe60e1d-goog
Add sysfs attributes to provide breakdown of the AERs seen,
into different type of correctable or uncorrectable errors:
dev_breakdown_correctable
dev_breakdown_uncorrectable
Signed-off-by: Rajat Jain <[email protected]>
---
v2: Use tabs instead of spaces, fix the subject, and print
all non zero counters.
drivers/pci/pcie/aer/aerdrv.h | 6 ++++++
drivers/pci/pcie/aer/aerdrv_errprint.c | 6 ++++--
drivers/pci/pcie/aer/aerdrv_stats.c | 28 ++++++++++++++++++++++++++
3 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index b5d5ad6f2c03..048fbd7c9633 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -89,6 +89,12 @@ int pci_aer_stats_init(struct pci_dev *pdev);
void pci_aer_stats_exit(struct pci_dev *pdev);
void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
+extern const char
+*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+
+extern const char
+*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+
#ifdef CONFIG_ACPI_APEI
int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
#else
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 5e8b98deda08..5585f309f1a8 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -68,7 +68,8 @@ static const char *aer_error_layer[] = {
"Transaction Layer"
};
-static const char *aer_correctable_error_string[] = {
+const char
+*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS] = {
"Receiver Error", /* Bit Position 0 */
NULL,
NULL,
@@ -87,7 +88,8 @@ static const char *aer_correctable_error_string[] = {
"Header Log Overflow", /* Bit Position 15 */
};
-static const char *aer_uncorrectable_error_string[] = {
+const char
+*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS] = {
"Undefined", /* Bit Position 0 */
NULL,
NULL,
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
index 5555beffef2b..e47321b267f6 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -58,10 +58,38 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ unsigned int i; \
+ char *str = buf; \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ u64 *stats = pdev->aer_stats->stats_array; \
+ for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
+ if (strings_array[i]) \
+ str += sprintf(str, "%s = 0x%llx\n", \
+ strings_array[i], stats[i]); \
+ else if (stats[i]) \
+ str += sprintf(str, #stats_array "bit[%d] = 0x%llx\n",\
+ i, stats[i]); \
+ } \
+ return str-buf; \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
+ aer_correctable_error_string);
+aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
+ aer_uncorrectable_error_string);
+
static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_cor_errs.attr,
&dev_attr_dev_total_fatal_errs.attr,
&dev_attr_dev_total_nonfatal_errs.attr,
+ &dev_attr_dev_breakdown_correctable.attr,
+ &dev_attr_dev_breakdown_uncorrectable.attr,
NULL
};
--
2.17.0.441.gb46fe60e1d-goog
This patchset exposes the AER stats via the sysfs attributes.
Patchset v2 has minor changes to v1 based on the review comments,
no functional change.
Primarily:
* Fix license header
* Use tabs instead of spaces
* Remove use on unlikely() etc
* Move documentation to Documentation/ABI/
Rajat Jain (5):
PCI/AER: Define and allocate aer_stats structure for AER capable
devices
PCI/AER: Add sysfs stats for AER capable devices
PCI/AER: Add sysfs attributes to provide breakdown of AERs
PCI/AER: Add sysfs attributes for rootport cumulative stats
Documentation/ABI: Add details of PCI AER statistics
.../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++
Documentation/PCI/pcieaer-howto.txt | 5 +
drivers/pci/pci-sysfs.c | 3 +
drivers/pci/pci.h | 4 +-
drivers/pci/pcie/aer/Makefile | 2 +-
drivers/pci/pcie/aer/aerdrv.h | 15 ++
drivers/pci/pcie/aer/aerdrv_core.c | 11 +
drivers/pci/pcie/aer/aerdrv_errprint.c | 7 +-
drivers/pci/pcie/aer/aerdrv_stats.c | 192 ++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 +
11 files changed, 342 insertions(+), 4 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
--
2.17.0.441.gb46fe60e1d-goog
Add the following AER sysfs stats to represent the counters for each
kind of error as seen by the device:
dev_total_cor_errs
dev_total_fatal_errs
dev_total_nonfatal_errs
Signed-off-by: Rajat Jain <[email protected]>
---
v2: Use tabs instead of spaces at the end of macro lines, and remove
the use of unlikely() as per Greg's suggestion.
drivers/pci/pci-sysfs.c | 3 ++
drivers/pci/pci.h | 4 +-
drivers/pci/pcie/aer/aerdrv.h | 1 +
drivers/pci/pcie/aer/aerdrv_errprint.c | 1 +
drivers/pci/pcie/aer/aerdrv_stats.c | 72 ++++++++++++++++++++++++++
5 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 366d93af051d..730f985a3dc9 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1743,6 +1743,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
#endif
&pci_bridge_attr_group,
&pcie_dev_attr_group,
+#ifdef CONFIG_PCIEAER
+ &aer_stats_attr_group,
+#endif
NULL,
};
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c358e7a07f3f..9a28ec600225 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -181,7 +181,9 @@ extern const struct attribute_group *pci_dev_groups[];
extern const struct attribute_group *pcibus_groups[];
extern const struct device_type pci_dev_type;
extern const struct attribute_group *pci_bus_groups[];
-
+#ifdef CONFIG_PCIEAER
+extern const struct attribute_group aer_stats_attr_group;
+#endif
/**
* pci_match_one_device - Tell if a PCI device structure has a matching
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index d8b9fba536ed..b5d5ad6f2c03 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -87,6 +87,7 @@ void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
irqreturn_t aer_irq(int irq, void *context);
int pci_aer_stats_init(struct pci_dev *pdev);
void pci_aer_stats_exit(struct pci_dev *pdev);
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
#ifdef CONFIG_ACPI_APEI
int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 21ca5e1b0ded..5e8b98deda08 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -155,6 +155,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_err(dev, " [%2d] Unknown Error Bit%s\n",
i, info->first_error == i ? " (First)" : "");
}
+ pci_dev_aer_stats_incr(dev, info);
}
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
index 2f48d6bc81f1..5555beffef2b 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -44,6 +44,78 @@ struct aer_stats {
u64 rootport_total_nonfatal_errs;
};
+#define aer_stats_aggregate_attr(field) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "0x%llx\n", pdev->aer_stats->field); \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_aggregate_attr(dev_total_cor_errs);
+aer_stats_aggregate_attr(dev_total_fatal_errs);
+aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+
+static struct attribute *aer_stats_attrs[] __ro_after_init = {
+ &dev_attr_dev_total_cor_errs.attr,
+ &dev_attr_dev_total_fatal_errs.attr,
+ &dev_attr_dev_total_nonfatal_errs.attr,
+ NULL
+};
+
+static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_stats)
+ return 0;
+
+ return a->mode;
+}
+
+const struct attribute_group aer_stats_attr_group = {
+ .name = "aer_stats",
+ .attrs = aer_stats_attrs,
+ .is_visible = aer_stats_attrs_are_visible,
+};
+
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ int status, i, max = -1;
+ u64 *counter = NULL;
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ switch (info->severity) {
+ case AER_CORRECTABLE:
+ aer_stats->dev_total_cor_errs++;
+ counter = &aer_stats->dev_cor_errs[0];
+ max = AER_MAX_TYPEOF_CORRECTABLE_ERRS;
+ break;
+ case AER_NONFATAL:
+ aer_stats->dev_total_nonfatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ case AER_FATAL:
+ aer_stats->dev_total_fatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ }
+
+ status = (info->status & ~info->mask);
+ for (i = 0; i < max; i++)
+ if (status & (1 << i))
+ counter[i]++;
+}
+
int pci_aer_stats_init(struct pci_dev *pdev)
{
pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
--
2.17.0.441.gb46fe60e1d-goog
Define a structure to hold the AER statistics. There are 2 groups
of statistics: dev_* counters that are to be collected for all AER
capable devices and rootport_* counters that are collected for all
(AER capable) rootports only. Allocate and free this structure when
device is added or released (thus counters survive the lifetime of the
device).
Add a new file aerdrv_stats.c to hold the AER stats collection logic.
Signed-off-by: Rajat Jain <[email protected]>
---
v2: Fix the license header as per Greg's suggestions
(Since there is disagreement with using "//" vs "/* */" for license
I decided to keep the one preferred by Linus, also used by others
in this directory)
drivers/pci/pcie/aer/Makefile | 2 +-
drivers/pci/pcie/aer/aerdrv.h | 6 +++
drivers/pci/pcie/aer/aerdrv_core.c | 9 +++++
drivers/pci/pcie/aer/aerdrv_stats.c | 61 +++++++++++++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 ++
6 files changed, 81 insertions(+), 1 deletion(-)
create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
diff --git a/drivers/pci/pcie/aer/Makefile b/drivers/pci/pcie/aer/Makefile
index 09bd890875a3..a06f9cc2bde5 100644
--- a/drivers/pci/pcie/aer/Makefile
+++ b/drivers/pci/pcie/aer/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_PCIEAER) += aerdriver.o
obj-$(CONFIG_PCIE_ECRC) += ecrc.o
-aerdriver-objs := aerdrv_errprint.o aerdrv_core.o aerdrv.o
+aerdriver-objs := aerdrv_errprint.o aerdrv_core.o aerdrv.o aerdrv_stats.o
aerdriver-$(CONFIG_ACPI) += aerdrv_acpi.o
obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index b4c950683cc7..d8b9fba536ed 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -33,6 +33,10 @@
PCI_ERR_UNC_MALF_TLP)
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
+
+#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
+#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
+
struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
@@ -81,6 +85,8 @@ void aer_isr(struct work_struct *work);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
irqreturn_t aer_irq(int irq, void *context);
+int pci_aer_stats_init(struct pci_dev *pdev);
+void pci_aer_stats_exit(struct pci_dev *pdev);
#ifdef CONFIG_ACPI_APEI
int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 36e622d35c48..42a6f913069a 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -95,9 +95,18 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
int pci_aer_init(struct pci_dev *dev)
{
dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+
+ if (!dev->aer_cap || pci_aer_stats_init(dev))
+ return -EIO;
+
return pci_cleanup_aer_error_status_regs(dev);
}
+void pci_aer_exit(struct pci_dev *dev)
+{
+ pci_aer_stats_exit(dev);
+}
+
/**
* add_error_device - list device to be handled
* @e_info: pointer to error info
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
new file mode 100644
index 000000000000..2f48d6bc81f1
--- /dev/null
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 Google Inc, All Rights Reserved.
+ *
+ * Rajat Jain ([email protected])
+ *
+ * AER Statistics - exposed to userspace via /sysfs attributes.
+ */
+
+#include <linux/pci.h>
+#include "aerdrv.h"
+
+/* AER stats for the device */
+struct aer_stats {
+
+ /*
+ * Fields for all AER capable devices. They indicate the errors
+ * "as seen by this device". Note that this may mean that if an
+ * end point is causing problems, the AER counters may increment
+ * at its link partner (e.g. root port) because the errors will be
+ * "seen" by the link partner and not the the problematic end point
+ * itself (which may report all counters as 0 as it never saw any
+ * problems).
+ */
+ /* Individual counters for different type of correctable errors */
+ u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+ /* Individual counters for different type of uncorrectable errors */
+ u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+ /* Total number of correctable errors seen by this device */
+ u64 dev_total_cor_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_fatal_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_nonfatal_errs;
+
+ /*
+ * Fields for Root ports only, these indicate the total number of
+ * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
+ * rootport, INCLUDING the ones that are generated internally (by
+ * the rootport itself)
+ */
+ u64 rootport_total_cor_errs;
+ u64 rootport_total_fatal_errs;
+ u64 rootport_total_nonfatal_errs;
+};
+
+int pci_aer_stats_init(struct pci_dev *pdev)
+{
+ pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ if (!pdev->aer_stats) {
+ dev_err(&pdev->dev, "No memory for aer_stats\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void pci_aer_stats_exit(struct pci_dev *pdev)
+{
+ kfree(pdev->aer_stats);
+ pdev->aer_stats = NULL;
+}
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 384020757b81..dd662c241373 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
+ pci_aer_exit(dev);
pci_vpd_release(dev);
pci_iov_release(dev);
pci_free_cap_save_buffers(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 21965e0dbe62..5c84b1304de7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -299,6 +299,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
+ struct aer_stats *aer_stats; /* AER stats for this device */
#endif
u8 pcie_cap; /* PCIe capability offset */
u8 msi_cap; /* MSI capability offset */
@@ -1470,10 +1471,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
void pci_no_aer(void);
bool pci_aer_available(void);
int pci_aer_init(struct pci_dev *dev);
+void pci_aer_exit(struct pci_dev *dev);
#else
static inline void pci_no_aer(void) { }
static inline bool pci_aer_available(void) { return false; }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
+static inline void pci_aer_exit(struct pci_dev *d) { }
#endif
#ifdef CONFIG_PCIE_ECRC
--
2.17.0.441.gb46fe60e1d-goog
Add sysfs attributes for rootport statistics (that are cumulative
of all the ERR_* messages seen on this PCI hierarchy).
Signed-off-by: Rajat Jain <[email protected]>
---
v2: same as v1
drivers/pci/pcie/aer/aerdrv.h | 2 ++
drivers/pci/pcie/aer/aerdrv_core.c | 2 ++
drivers/pci/pcie/aer/aerdrv_stats.c | 31 +++++++++++++++++++++++++++++
3 files changed, 35 insertions(+)
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 048fbd7c9633..77d8355551d9 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -88,6 +88,8 @@ irqreturn_t aer_irq(int irq, void *context);
int pci_aer_stats_init(struct pci_dev *pdev);
void pci_aer_stats_exit(struct pci_dev *pdev);
void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
+void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_source *e_src);
extern const char
*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 42a6f913069a..0f70e22563f3 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -424,6 +424,8 @@ static void aer_isr_one_error(struct pcie_device *p_device,
struct aer_rpc *rpc = get_service_data(p_device);
struct aer_err_info *e_info = &rpc->e_info;
+ pci_rootport_aer_stats_incr(p_device->port, e_src);
+
/*
* There is a possibility that both correctable error and
* uncorrectable error being logged. Report correctable error first.
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c b/drivers/pci/pcie/aer/aerdrv_stats.c
index e47321b267f6..898c9bc02ec2 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -57,6 +57,9 @@ static DEVICE_ATTR_RO(field)
aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+aer_stats_aggregate_attr(rootport_total_cor_errs);
+aer_stats_aggregate_attr(rootport_total_fatal_errs);
+aer_stats_aggregate_attr(rootport_total_nonfatal_errs);
#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
static ssize_t \
@@ -90,6 +93,9 @@ static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_nonfatal_errs.attr,
&dev_attr_dev_breakdown_correctable.attr,
&dev_attr_dev_breakdown_uncorrectable.attr,
+ &dev_attr_rootport_total_cor_errs.attr,
+ &dev_attr_rootport_total_fatal_errs.attr,
+ &dev_attr_rootport_total_nonfatal_errs.attr,
NULL
};
@@ -102,6 +108,12 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
if (!pdev->aer_stats)
return 0;
+ if ((a == &dev_attr_rootport_total_cor_errs.attr ||
+ a == &dev_attr_rootport_total_fatal_errs.attr ||
+ a == &dev_attr_rootport_total_nonfatal_errs.attr) &&
+ pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT)
+ return 0;
+
return a->mode;
}
@@ -144,6 +156,25 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
counter[i]++;
}
+void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_source *e_src)
+{
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ if (e_src->status & PCI_ERR_ROOT_COR_RCV)
+ aer_stats->rootport_total_cor_errs++;
+
+ if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
+ if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
+ aer_stats->rootport_total_fatal_errs++;
+ else
+ aer_stats->rootport_total_nonfatal_errs++;
+ }
+}
+
int pci_aer_stats_init(struct pci_dev *pdev)
{
pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
--
2.17.0.441.gb46fe60e1d-goog
On Wed, May 23, 2018 at 10:58:04AM -0700, Rajat Jain wrote:
> ---
> v2: Fix the license header as per Greg's suggestions
> (Since there is disagreement with using "//" vs "/* */" for license
> I decided to keep the one preferred by Linus, also used by others
> in this directory)
The rules are pretty simple for how to do this, and they are documented
in Documentation/process/license-rules.rst, please just follow that like
the rest of the kernel has done.
thanks,
greg k-h
On 2018-05-23 23:28, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> and provide a pointer to it in
> Documentation/PCI/pcieaer-howto.txt
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> v2: Move the documentation to Documentation/ABI/
>
> .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
> Documentation/PCI/pcieaer-howto.txt | 5 +
> 2 files changed, 108 insertions(+)
> create mode 100644
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> new file mode 100644
> index 000000000000..f55c389290ac
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> @@ -0,0 +1,103 @@
> +==========================
> +PCIe Device AER statistics
> +==========================
> +These attributes show up under all the devices that are AER capable.
> These
> +statistical counters indicate the errors "as seen/reported by the
> device".
> +Note that this may mean that if an end point is causing problems, the
> AER
> +counters may increment at its link partner (e.g. root port) because
> the
> +errors will be "seen" / reported by the link partner and not the the
> +problematic end point itself (which may report all counters as 0 as it
> never
> +saw any problems).
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Total number of correctable errors seen and reported by
> this
> + PCI device using ERR_COR.
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Total number of uncorrectable fatal errors seen and
> reported
> + by this PCI device using ERR_FATAL.
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Total number of uncorrectable non-fatal errors seen and
> reported
> + by this PCI device using ERR_NONFATAL.
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Breakdown of of correctable errors seen and reported by
> this
> + PCI device using ERR_COR. A sample result looks like this:
> +-----------------------------------------
> +Receiver Error = 0x174
> +Bad TLP = 0x19
> +Bad DLLP = 0x3
> +RELAY_NUM Rollover = 0x0
> +Replay Timer Timeout = 0x1
> +Advisory Non-Fatal = 0x0
> +Corrected Internal Error = 0x0
> +Header Log Overflow = 0x0
> +-----------------------------------------
why hex display ? decimal is easy to read as these are counters.
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Breakdown of of correctable errors seen and reported by
> this
> + PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
> + looks like this:
> +-----------------------------------------
> +Undefined = 0x0
> +Data Link Protocol = 0x0
> +Surprise Down Error = 0x0
> +Poisoned TLP = 0x0
> +Flow Control Protocol = 0x0
> +Completion Timeout = 0x0
> +Completer Abort = 0x0
> +Unexpected Completion = 0x0
> +Receiver Overflow = 0x0
> +Malformed TLP = 0x0
> +ECRC = 0x0
> +Unsupported Request = 0x0
> +ACS Violation = 0x0
> +Uncorrectable Internal Error = 0x0
> +MC Blocked TLP = 0x0
> +AtomicOp Egress Blocked = 0x0
> +TLP Prefix Blocked Error = 0x0
> +-----------------------------------------
> +
> +============================
> +PCIe Rootport AER statistics
> +============================
> +These attributes showup under only the rootports that are AER capable.
> These
> +indicate the number of error messages as "reported to" the rootport.
> Please note
> +that the rootports also transmit (internally) the ERR_* messages for
> errors seen
> +by the internal rootport PCI device, so these counters includes them
> and are
> +thus cumulative of all the error messages on the PCI hierarchy
> originating
> +at that root port.
what about switches and bridges ?
Also Can you give some idea as e.g what is the difference between
dev_total_fatal_errs and rootport_total_fatal_errs (assuming that both
are same pci_dev.
rootport_total_fatal_errs gives me an idea that how many times things
have been failed under this pci_dev ?
which means num of downstream link problems. but I am still trying to
make sense as how it could be used,
since we dont have BDF information associated with the number of errors
anywhere (except these AER print messages)
and dev_total_fatal_errs as you mentioned above that problematic EP,
then say root-port will report it and increment
dev_total_fatal_errs ++
does it also increment root-port_total_fatal_errs ++ in above scenario ?
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Total number of ERR_COR messages reported to rootport.
> +
> +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Total number of ERR_FATAL messages reported to rootport.
> +
> +Where:
> /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
> +Date: May 2018
> +Kernel Version: 4.17.0
> +Contact: [email protected], [email protected]
> +Description: Total number of ERR_NONFATAL messages reported to
> rootport.
> diff --git a/Documentation/PCI/pcieaer-howto.txt
> b/Documentation/PCI/pcieaer-howto.txt
> index acd0dddd6bb8..91b6e677cb8c 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
> device who sends
> the error message to root port. Pls. refer to pci express specs for
> other fields.
>
> +2.4 AER Statistics / Counters
> +
> +When PCIe AER errors are captured, the counters / statistics are also
> exposed
> +in form of sysfs attributes which are documented at
> +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>
> 3. Developer Guide
Hello,
On Sat, Jun 16, 2018 at 10:24 PM <[email protected]> wrote:
>
> On 2018-05-23 23:28, Rajat Jain wrote:
> > Add the PCI AER statistics details to
> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > and provide a pointer to it in
> > Documentation/PCI/pcieaer-howto.txt
> >
> > Signed-off-by: Rajat Jain <[email protected]>
> > ---
> > v2: Move the documentation to Documentation/ABI/
> >
> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
> > Documentation/PCI/pcieaer-howto.txt | 5 +
> > 2 files changed, 108 insertions(+)
> > create mode 100644
> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > new file mode 100644
> > index 000000000000..f55c389290ac
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > @@ -0,0 +1,103 @@
> > +==========================
> > +PCIe Device AER statistics
> > +==========================
> > +These attributes show up under all the devices that are AER capable.
> > These
> > +statistical counters indicate the errors "as seen/reported by the
> > device".
> > +Note that this may mean that if an end point is causing problems, the
> > AER
> > +counters may increment at its link partner (e.g. root port) because
> > the
> > +errors will be "seen" / reported by the link partner and not the the
> > +problematic end point itself (which may report all counters as 0 as it
> > never
> > +saw any problems).
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Total number of correctable errors seen and reported by
> > this
> > + PCI device using ERR_COR.
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Total number of uncorrectable fatal errors seen and
> > reported
> > + by this PCI device using ERR_FATAL.
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Total number of uncorrectable non-fatal errors seen and
> > reported
> > + by this PCI device using ERR_NONFATAL.
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Breakdown of of correctable errors seen and reported by
> > this
> > + PCI device using ERR_COR. A sample result looks like this:
> > +-----------------------------------------
> > +Receiver Error = 0x174
> > +Bad TLP = 0x19
> > +Bad DLLP = 0x3
> > +RELAY_NUM Rollover = 0x0
> > +Replay Timer Timeout = 0x1
> > +Advisory Non-Fatal = 0x0
> > +Corrected Internal Error = 0x0
> > +Header Log Overflow = 0x0
> > +-----------------------------------------
> why hex display ? decimal is easy to read as these are counters.
Have no particular preference. Since these can be potentially large
numbers, just had a random thought that hex might make it more
concise. I can change to decimal if that is preferable.
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Breakdown of of correctable errors seen and reported by
> > this
> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
> > + looks like this:
> > +-----------------------------------------
> > +Undefined = 0x0
> > +Data Link Protocol = 0x0
> > +Surprise Down Error = 0x0
> > +Poisoned TLP = 0x0
> > +Flow Control Protocol = 0x0
> > +Completion Timeout = 0x0
> > +Completer Abort = 0x0
> > +Unexpected Completion = 0x0
> > +Receiver Overflow = 0x0
> > +Malformed TLP = 0x0
> > +ECRC = 0x0
> > +Unsupported Request = 0x0
> > +ACS Violation = 0x0
> > +Uncorrectable Internal Error = 0x0
> > +MC Blocked TLP = 0x0
> > +AtomicOp Egress Blocked = 0x0
> > +TLP Prefix Blocked Error = 0x0
> > +-----------------------------------------
> > +
> > +============================
> > +PCIe Rootport AER statistics
> > +============================
> > +These attributes showup under only the rootports that are AER capable.
> > These
> > +indicate the number of error messages as "reported to" the rootport.
> > Please note
> > +that the rootports also transmit (internally) the ERR_* messages for
> > errors seen
> > +by the internal rootport PCI device, so these counters includes them
> > and are
> > +thus cumulative of all the error messages on the PCI hierarchy
> > originating
> > +at that root port.
>
> what about switches and bridges ?
What about them? AIUI, the switches forward the ERR_ messages from
downstream devices to the rootport, like they do with standard
messages. They can potentially generate their own ERR_ message and
that would be reported no different than other end point devices.
> Also Can you give some idea as e.g what is the difference between
> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that both
> are same pci_dev.
For a pci_dev representing the rootport:
dev_total_fatal_errors = how many times this PCI device *experienced*
a fatal problem on its own (i.e. either link issues while talking to
its link partner, or some internal errors).
rootport_total_fatal_errors = how many times this rootport was
*informed* about a problem (via ERR_* messages) in the PCI hierarchy
that originates at it (can be any link further downstream). This
includes the dev_total_fatal_errors also, because any errors detected
by the rootport are also "informed" to itself via ERR_* messages. In
reality, this is just the total number of ERR_FATAL messages received
at the rootport. This sysfs attribute will only exist for root ports.
>
> rootport_total_fatal_errs gives me an idea that how many times things
> have been failed under this pci_dev ?
Yes, as above.
> which means num of downstream link problems. but I am still trying to
> make sense as how it could be used,
> since we dont have BDF information associated with the number of errors
> anywhere (except these AER print messages)
>
Agree. That is a limitation. The challenges being more record keeping,
more complicated sysfs representation, and given that PCI devices may
come and go, how do we know it is the same device before we collate
their stats etc.
>
> and dev_total_fatal_errs as you mentioned above that problematic EP,
> then say root-port will report it and increment
> dev_total_fatal_errs ++
> does it also increment root-port_total_fatal_errs ++ in above scenario ?
Yes, as above, it will also root-port_total_fatal_errs++ for the root
port of that hierarchy.
Thanks,
Rajat
>
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Total number of ERR_COR messages reported to rootport.
> > +
> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Total number of ERR_FATAL messages reported to rootport.
> > +
> > +Where:
> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
> > +Date: May 2018
> > +Kernel Version: 4.17.0
> > +Contact: [email protected], [email protected]
> > +Description: Total number of ERR_NONFATAL messages reported to
> > rootport.
> > diff --git a/Documentation/PCI/pcieaer-howto.txt
> > b/Documentation/PCI/pcieaer-howto.txt
> > index acd0dddd6bb8..91b6e677cb8c 100644
> > --- a/Documentation/PCI/pcieaer-howto.txt
> > +++ b/Documentation/PCI/pcieaer-howto.txt
> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
> > device who sends
> > the error message to root port. Pls. refer to pci express specs for
> > other fields.
> >
> > +2.4 AER Statistics / Counters
> > +
> > +When PCIe AER errors are captured, the counters / statistics are also
> > exposed
> > +in form of sysfs attributes which are documented at
> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >
> > 3. Developer Guide
Sorry, correction needed in my statement below:
On Mon, Jun 18, 2018 at 5:11 PM, Rajat Jain <[email protected]> wrote:
> Hello,
>
> On Sat, Jun 16, 2018 at 10:24 PM <[email protected]> wrote:
>>
>> On 2018-05-23 23:28, Rajat Jain wrote:
>> > Add the PCI AER statistics details to
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > and provide a pointer to it in
>> > Documentation/PCI/pcieaer-howto.txt
>> >
>> > Signed-off-by: Rajat Jain <[email protected]>
>> > ---
>> > v2: Move the documentation to Documentation/ABI/
>> >
>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
>> > Documentation/PCI/pcieaer-howto.txt | 5 +
>> > 2 files changed, 108 insertions(+)
>> > create mode 100644
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > new file mode 100644
>> > index 000000000000..f55c389290ac
>> > --- /dev/null
>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > @@ -0,0 +1,103 @@
>> > +==========================
>> > +PCIe Device AER statistics
>> > +==========================
>> > +These attributes show up under all the devices that are AER capable.
>> > These
>> > +statistical counters indicate the errors "as seen/reported by the
>> > device".
>> > +Note that this may mean that if an end point is causing problems, the
>> > AER
>> > +counters may increment at its link partner (e.g. root port) because
>> > the
>> > +errors will be "seen" / reported by the link partner and not the the
>> > +problematic end point itself (which may report all counters as 0 as it
>> > never
>> > +saw any problems).
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of correctable errors seen and reported by
>> > this
>> > + PCI device using ERR_COR.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of uncorrectable fatal errors seen and
>> > reported
>> > + by this PCI device using ERR_FATAL.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of uncorrectable non-fatal errors seen and
>> > reported
>> > + by this PCI device using ERR_NONFATAL.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > + PCI device using ERR_COR. A sample result looks like this:
>> > +-----------------------------------------
>> > +Receiver Error = 0x174
>> > +Bad TLP = 0x19
>> > +Bad DLLP = 0x3
>> > +RELAY_NUM Rollover = 0x0
>> > +Replay Timer Timeout = 0x1
>> > +Advisory Non-Fatal = 0x0
>> > +Corrected Internal Error = 0x0
>> > +Header Log Overflow = 0x0
>> > +-----------------------------------------
>> why hex display ? decimal is easy to read as these are counters.
>
> Have no particular preference. Since these can be potentially large
> numbers, just had a random thought that hex might make it more
> concise. I can change to decimal if that is preferable.
>
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
>> > + looks like this:
>> > +-----------------------------------------
>> > +Undefined = 0x0
>> > +Data Link Protocol = 0x0
>> > +Surprise Down Error = 0x0
>> > +Poisoned TLP = 0x0
>> > +Flow Control Protocol = 0x0
>> > +Completion Timeout = 0x0
>> > +Completer Abort = 0x0
>> > +Unexpected Completion = 0x0
>> > +Receiver Overflow = 0x0
>> > +Malformed TLP = 0x0
>> > +ECRC = 0x0
>> > +Unsupported Request = 0x0
>> > +ACS Violation = 0x0
>> > +Uncorrectable Internal Error = 0x0
>> > +MC Blocked TLP = 0x0
>> > +AtomicOp Egress Blocked = 0x0
>> > +TLP Prefix Blocked Error = 0x0
>> > +-----------------------------------------
>> > +
>> > +============================
>> > +PCIe Rootport AER statistics
>> > +============================
>> > +These attributes showup under only the rootports that are AER capable.
>> > These
>> > +indicate the number of error messages as "reported to" the rootport.
>> > Please note
>> > +that the rootports also transmit (internally) the ERR_* messages for
>> > errors seen
>> > +by the internal rootport PCI device, so these counters includes them
>> > and are
>> > +thus cumulative of all the error messages on the PCI hierarchy
>> > originating
>> > +at that root port.
>>
>> what about switches and bridges ?
>
> What about them? AIUI, the switches forward the ERR_ messages from
> downstream devices to the rootport, like they do with standard
> messages. They can potentially generate their own ERR_ message and
> that would be reported no different than other end point devices.
>
>> Also Can you give some idea as e.g what is the difference between
>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that both
>> are same pci_dev.
>
> For a pci_dev representing the rootport:
>
> dev_total_fatal_errors = how many times this PCI device *experienced*
> a fatal problem on its own (i.e. either link issues while talking to
> its link partner, or some internal errors).
>
> rootport_total_fatal_errors = how many times this rootport was
> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
Read the above sentence as:
" rootport_total_fatal_errors = how many times this rootport was
*informed* about a FATAL problem (via ERR_FATAL messages) in the PCI hierarchy"
> that originates at it (can be any link further downstream). This
> includes the dev_total_fatal_errors also, because any errors detected
> by the rootport are also "informed" to itself via ERR_* messages. In
> reality, this is just the total number of ERR_FATAL messages received
> at the rootport. This sysfs attribute will only exist for root ports.
>
>>
>> rootport_total_fatal_errs gives me an idea that how many times things
>> have been failed under this pci_dev ?
>
> Yes, as above.
>
>> which means num of downstream link problems. but I am still trying to
>> make sense as how it could be used,
>> since we dont have BDF information associated with the number of errors
>> anywhere (except these AER print messages)
>>
>
> Agree. That is a limitation. The challenges being more record keeping,
> more complicated sysfs representation, and given that PCI devices may
> come and go, how do we know it is the same device before we collate
> their stats etc.
>
>>
>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>> then say root-port will report it and increment
>> dev_total_fatal_errs ++
>> does it also increment root-port_total_fatal_errs ++ in above scenario ?
>
> Yes, as above, it will also root-port_total_fatal_errs++ for the root
> port of that hierarchy.
>
> Thanks,
>
> Rajat
>
>>
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of ERR_COR messages reported to rootport.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>> > +
>> > +Where:
>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of ERR_NONFATAL messages reported to
>> > rootport.
>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>> > b/Documentation/PCI/pcieaer-howto.txt
>> > index acd0dddd6bb8..91b6e677cb8c 100644
>> > --- a/Documentation/PCI/pcieaer-howto.txt
>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>> > device who sends
>> > the error message to root port. Pls. refer to pci express specs for
>> > other fields.
>> >
>> > +2.4 AER Statistics / Counters
>> > +
>> > +When PCIe AER errors are captured, the counters / statistics are also
>> > exposed
>> > +in form of sysfs attributes which are documented at
>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> > 3. Developer Guide
On 2018-06-19 05:41, Rajat Jain wrote:
> Hello,
>
> On Sat, Jun 16, 2018 at 10:24 PM <[email protected]> wrote:
>>
>> On 2018-05-23 23:28, Rajat Jain wrote:
>> > Add the PCI AER statistics details to
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > and provide a pointer to it in
>> > Documentation/PCI/pcieaer-howto.txt
>> >
>> > Signed-off-by: Rajat Jain <[email protected]>
>> > ---
>> > v2: Move the documentation to Documentation/ABI/
>> >
>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
>> > Documentation/PCI/pcieaer-howto.txt | 5 +
>> > 2 files changed, 108 insertions(+)
>> > create mode 100644
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > new file mode 100644
>> > index 000000000000..f55c389290ac
>> > --- /dev/null
>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > @@ -0,0 +1,103 @@
>> > +==========================
>> > +PCIe Device AER statistics
>> > +==========================
>> > +These attributes show up under all the devices that are AER capable.
>> > These
>> > +statistical counters indicate the errors "as seen/reported by the
>> > device".
>> > +Note that this may mean that if an end point is causing problems, the
>> > AER
>> > +counters may increment at its link partner (e.g. root port) because
>> > the
>> > +errors will be "seen" / reported by the link partner and not the the
>> > +problematic end point itself (which may report all counters as 0 as it
>> > never
>> > +saw any problems).
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of correctable errors seen and reported by
>> > this
>> > + PCI device using ERR_COR.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of uncorrectable fatal errors seen and
>> > reported
>> > + by this PCI device using ERR_FATAL.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of uncorrectable non-fatal errors seen and
>> > reported
>> > + by this PCI device using ERR_NONFATAL.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > + PCI device using ERR_COR. A sample result looks like this:
>> > +-----------------------------------------
>> > +Receiver Error = 0x174
>> > +Bad TLP = 0x19
>> > +Bad DLLP = 0x3
>> > +RELAY_NUM Rollover = 0x0
>> > +Replay Timer Timeout = 0x1
>> > +Advisory Non-Fatal = 0x0
>> > +Corrected Internal Error = 0x0
>> > +Header Log Overflow = 0x0
>> > +-----------------------------------------
>> why hex display ? decimal is easy to read as these are counters.
>
> Have no particular preference. Since these can be potentially large
> numbers, just had a random thought that hex might make it more
> concise. I can change to decimal if that is preferable.
>
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
>> > + looks like this:
>> > +-----------------------------------------
>> > +Undefined = 0x0
>> > +Data Link Protocol = 0x0
>> > +Surprise Down Error = 0x0
>> > +Poisoned TLP = 0x0
>> > +Flow Control Protocol = 0x0
>> > +Completion Timeout = 0x0
>> > +Completer Abort = 0x0
>> > +Unexpected Completion = 0x0
>> > +Receiver Overflow = 0x0
>> > +Malformed TLP = 0x0
>> > +ECRC = 0x0
>> > +Unsupported Request = 0x0
>> > +ACS Violation = 0x0
>> > +Uncorrectable Internal Error = 0x0
>> > +MC Blocked TLP = 0x0
>> > +AtomicOp Egress Blocked = 0x0
>> > +TLP Prefix Blocked Error = 0x0
>> > +-----------------------------------------
>> > +
>> > +============================
>> > +PCIe Rootport AER statistics
>> > +============================
>> > +These attributes showup under only the rootports that are AER capable.
>> > These
>> > +indicate the number of error messages as "reported to" the rootport.
>> > Please note
>> > +that the rootports also transmit (internally) the ERR_* messages for
>> > errors seen
>> > +by the internal rootport PCI device, so these counters includes them
>> > and are
>> > +thus cumulative of all the error messages on the PCI hierarchy
>> > originating
>> > +at that root port.
>>
>> what about switches and bridges ?
>
> What about them? AIUI, the switches forward the ERR_ messages from
> downstream devices to the rootport, like they do with standard
> messages. They can potentially generate their own ERR_ message and
> that would be reported no different than other end point devices.
yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
contained by switch
and the error handling code thinks that, the error is contained by
switch irrespective of
AER or DPC, and it will think that the problem could be with
Switch/bridge upstream link.
hence the pci_dev of the switch where you should be increment your
counters.
of course ER_FATAL would have traversed till RP, but that doesnt meant
that
you account the error there.
>
>> Also Can you give some idea as e.g what is the difference between
>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that
>> both
>> are same pci_dev.
>
> For a pci_dev representing the rootport:
>
> dev_total_fatal_errors = how many times this PCI device *experienced*
> a fatal problem on its own (i.e. either link issues while talking to
> its link partner, or some internal errors).
>
> rootport_total_fatal_errors = how many times this rootport was
> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
> that originates at it (can be any link further downstream). This
> includes the dev_total_fatal_errors also, because any errors detected
> by the rootport are also "informed" to itself via ERR_* messages. In
> reality, this is just the total number of ERR_FATAL messages received
> at the rootport. This sysfs attribute will only exist for root ports.
>
>>
>> rootport_total_fatal_errs gives me an idea that how many times things
>> have been failed under this pci_dev ?
>
> Yes, as above.
>
>> which means num of downstream link problems. but I am still trying to
>> make sense as how it could be used,
>> since we dont have BDF information associated with the number of
>> errors
>> anywhere (except these AER print messages)
>>
>
> Agree. That is a limitation. The challenges being more record keeping,
> more complicated sysfs representation, and given that PCI devices may
> come and go, how do we know it is the same device before we collate
> their stats etc.
>
>>
>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>> then say root-port will report it and increment
>> dev_total_fatal_errs ++
>> does it also increment root-port_total_fatal_errs ++ in above scenario
>> ?
>
> Yes, as above, it will also root-port_total_fatal_errs++ for the root
> port of that hierarchy.
>
> Thanks,
>
> Rajat
>
>>
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of ERR_COR messages reported to rootport.
>> > +
>> > +Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>> > +
>> > +Where:
>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>> > +Date: May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact: [email protected], [email protected]
>> > +Description: Total number of ERR_NONFATAL messages reported to
>> > rootport.
>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>> > b/Documentation/PCI/pcieaer-howto.txt
>> > index acd0dddd6bb8..91b6e677cb8c 100644
>> > --- a/Documentation/PCI/pcieaer-howto.txt
>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>> > device who sends
>> > the error message to root port. Pls. refer to pci express specs for
>> > other fields.
>> >
>> > +2.4 AER Statistics / Counters
>> > +
>> > +When PCIe AER errors are captured, the counters / statistics are also
>> > exposed
>> > +in form of sysfs attributes which are documented at
>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> > 3. Developer Guide
On Mon, Jun 18, 2018 at 11:03 PM, <[email protected]> wrote:
> On 2018-06-19 05:41, Rajat Jain wrote:
>>
>> Hello,
>>
>> On Sat, Jun 16, 2018 at 10:24 PM <[email protected]> wrote:
>>>
>>>
>>> On 2018-05-23 23:28, Rajat Jain wrote:
>>> > Add the PCI AER statistics details to
>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > and provide a pointer to it in
>>> > Documentation/PCI/pcieaer-howto.txt
>>> >
>>> > Signed-off-by: Rajat Jain <[email protected]>
>>> > ---
>>> > v2: Move the documentation to Documentation/ABI/
>>> >
>>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
>>> > Documentation/PCI/pcieaer-howto.txt | 5 +
>>> > 2 files changed, 108 insertions(+)
>>> > create mode 100644
>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> >
>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > new file mode 100644
>>> > index 000000000000..f55c389290ac
>>> > --- /dev/null
>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > @@ -0,0 +1,103 @@
>>> > +==========================
>>> > +PCIe Device AER statistics
>>> > +==========================
>>> > +These attributes show up under all the devices that are AER capable.
>>> > These
>>> > +statistical counters indicate the errors "as seen/reported by the
>>> > device".
>>> > +Note that this may mean that if an end point is causing problems, the
>>> > AER
>>> > +counters may increment at its link partner (e.g. root port) because
>>> > the
>>> > +errors will be "seen" / reported by the link partner and not the the
>>> > +problematic end point itself (which may report all counters as 0 as it
>>> > never
>>> > +saw any problems).
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Total number of correctable errors seen and reported by
>>> > this
>>> > + PCI device using ERR_COR.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Total number of uncorrectable fatal errors seen and
>>> > reported
>>> > + by this PCI device using ERR_FATAL.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Total number of uncorrectable non-fatal errors seen and
>>> > reported
>>> > + by this PCI device using ERR_NONFATAL.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Breakdown of of correctable errors seen and reported by
>>> > this
>>> > + PCI device using ERR_COR. A sample result looks like
>>> > this:
>>> > +-----------------------------------------
>>> > +Receiver Error = 0x174
>>> > +Bad TLP = 0x19
>>> > +Bad DLLP = 0x3
>>> > +RELAY_NUM Rollover = 0x0
>>> > +Replay Timer Timeout = 0x1
>>> > +Advisory Non-Fatal = 0x0
>>> > +Corrected Internal Error = 0x0
>>> > +Header Log Overflow = 0x0
>>> > +-----------------------------------------
>>> why hex display ? decimal is easy to read as these are counters.
>>
>>
>> Have no particular preference. Since these can be potentially large
>> numbers, just had a random thought that hex might make it more
>> concise. I can change to decimal if that is preferable.
>>
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Breakdown of of correctable errors seen and reported by
>>> > this
>>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample
>>> > result
>>> > + looks like this:
>>> > +-----------------------------------------
>>> > +Undefined = 0x0
>>> > +Data Link Protocol = 0x0
>>> > +Surprise Down Error = 0x0
>>> > +Poisoned TLP = 0x0
>>> > +Flow Control Protocol = 0x0
>>> > +Completion Timeout = 0x0
>>> > +Completer Abort = 0x0
>>> > +Unexpected Completion = 0x0
>>> > +Receiver Overflow = 0x0
>>> > +Malformed TLP = 0x0
>>> > +ECRC = 0x0
>>> > +Unsupported Request = 0x0
>>> > +ACS Violation = 0x0
>>> > +Uncorrectable Internal Error = 0x0
>>> > +MC Blocked TLP = 0x0
>>> > +AtomicOp Egress Blocked = 0x0
>>> > +TLP Prefix Blocked Error = 0x0
>>> > +-----------------------------------------
>>> > +
>>> > +============================
>>> > +PCIe Rootport AER statistics
>>> > +============================
>>> > +These attributes showup under only the rootports that are AER capable.
>>> > These
>>> > +indicate the number of error messages as "reported to" the rootport.
>>> > Please note
>>> > +that the rootports also transmit (internally) the ERR_* messages for
>>> > errors seen
>>> > +by the internal rootport PCI device, so these counters includes them
>>> > and are
>>> > +thus cumulative of all the error messages on the PCI hierarchy
>>> > originating
>>> > +at that root port.
>>>
>>> what about switches and bridges ?
>>
>>
>> What about them? AIUI, the switches forward the ERR_ messages from
>> downstream devices to the rootport, like they do with standard
>> messages. They can potentially generate their own ERR_ message and
>> that would be reported no different than other end point devices.
>
>
>
> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
> contained by switch
> and the error handling code thinks that, the error is contained by switch
> irrespective of
> AER or DPC, and it will think that the problem could be with Switch/bridge
> upstream link.
>
> hence the pci_dev of the switch where you should be increment your counters.
> of course ER_FATAL would have traversed till RP, but that doesnt meant that
> you account the error there.
In this case, for the pci_dev for the rootport:
- rootport_total_fatal_errors will be incremented (since it will get ERR_FATAL)
- dev_total_fatal_errors will not be incremented.
The dev_total_fatal_errors will be incremented only for the pci device
identified by the "Error Source Identification Register" in the PCIe
spec. Does this help clarify?
>
>
>>
>>> Also Can you give some idea as e.g what is the difference between
>>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that both
>>> are same pci_dev.
>>
>>
>> For a pci_dev representing the rootport:
>>
>> dev_total_fatal_errors = how many times this PCI device *experienced*
>> a fatal problem on its own (i.e. either link issues while talking to
>> its link partner, or some internal errors).
>>
>> rootport_total_fatal_errors = how many times this rootport was
>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
>> that originates at it (can be any link further downstream). This
>> includes the dev_total_fatal_errors also, because any errors detected
>> by the rootport are also "informed" to itself via ERR_* messages. In
>> reality, this is just the total number of ERR_FATAL messages received
>> at the rootport. This sysfs attribute will only exist for root ports.
>>
>>>
>>> rootport_total_fatal_errs gives me an idea that how many times things
>>> have been failed under this pci_dev ?
>>
>>
>> Yes, as above.
>>
>>> which means num of downstream link problems. but I am still trying to
>>> make sense as how it could be used,
>>> since we dont have BDF information associated with the number of errors
>>> anywhere (except these AER print messages)
>>>
>>
>> Agree. That is a limitation. The challenges being more record keeping,
>> more complicated sysfs representation, and given that PCI devices may
>> come and go, how do we know it is the same device before we collate
>> their stats etc.
>>
>>>
>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>>> then say root-port will report it and increment
>>> dev_total_fatal_errs ++
>>> does it also increment root-port_total_fatal_errs ++ in above scenario ?
>>
>>
>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
>> port of that hierarchy.
>>
>> Thanks,
>>
>> Rajat
>>
>>>
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Total number of ERR_COR messages reported to rootport.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>>> > +Date: May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact: [email protected], [email protected]
>>> > +Description: Total number of ERR_NONFATAL messages reported to
>>> > rootport.
>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>>> > b/Documentation/PCI/pcieaer-howto.txt
>>> > index acd0dddd6bb8..91b6e677cb8c 100644
>>> > --- a/Documentation/PCI/pcieaer-howto.txt
>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>>> > device who sends
>>> > the error message to root port. Pls. refer to pci express specs for
>>> > other fields.
>>> >
>>> > +2.4 AER Statistics / Counters
>>> > +
>>> > +When PCIe AER errors are captured, the counters / statistics are also
>>> > exposed
>>> > +in form of sysfs attributes which are documented at
>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> >
>>> > 3. Developer Guide
On Wed, May 23, 2018 at 10:58:03AM -0700, Rajat Jain wrote:
> This patchset exposes the AER stats via the sysfs attributes.
>
> Patchset v2 has minor changes to v1 based on the review comments,
> no functional change.
> Primarily:
> * Fix license header
> * Use tabs instead of spaces
> * Remove use on unlikely() etc
> * Move documentation to Documentation/ABI/
>
> Rajat Jain (5):
> PCI/AER: Define and allocate aer_stats structure for AER capable
> devices
> PCI/AER: Add sysfs stats for AER capable devices
> PCI/AER: Add sysfs attributes to provide breakdown of AERs
> PCI/AER: Add sysfs attributes for rootport cumulative stats
> Documentation/ABI: Add details of PCI AER statistics
>
> .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++
> Documentation/PCI/pcieaer-howto.txt | 5 +
> drivers/pci/pci-sysfs.c | 3 +
> drivers/pci/pci.h | 4 +-
> drivers/pci/pcie/aer/Makefile | 2 +-
> drivers/pci/pcie/aer/aerdrv.h | 15 ++
> drivers/pci/pcie/aer/aerdrv_core.c | 11 +
> drivers/pci/pcie/aer/aerdrv_errprint.c | 7 +-
> drivers/pci/pcie/aer/aerdrv_stats.c | 192 ++++++++++++++++++
> drivers/pci/probe.c | 1 +
> include/linux/pci.h | 3 +
> 11 files changed, 342 insertions(+), 4 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
I broke this by putting all the AER code in one file in v4.18-rc1,
sorry! Would you mind rebasing these on top of that?
Since everything AER-related is now in aer.c, I'd suggest putting the
stats code there, too.
Sure, no problem. Will do.
On Tue, Jun 19, 2018 at 3:16 PM Bjorn Helgaas <[email protected]> wrote:
>
> On Wed, May 23, 2018 at 10:58:03AM -0700, Rajat Jain wrote:
> > This patchset exposes the AER stats via the sysfs attributes.
> >
> > Patchset v2 has minor changes to v1 based on the review comments,
> > no functional change.
> > Primarily:
> > * Fix license header
> > * Use tabs instead of spaces
> > * Remove use on unlikely() etc
> > * Move documentation to Documentation/ABI/
> >
> > Rajat Jain (5):
> > PCI/AER: Define and allocate aer_stats structure for AER capable
> > devices
> > PCI/AER: Add sysfs stats for AER capable devices
> > PCI/AER: Add sysfs attributes to provide breakdown of AERs
> > PCI/AER: Add sysfs attributes for rootport cumulative stats
> > Documentation/ABI: Add details of PCI AER statistics
> >
> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++
> > Documentation/PCI/pcieaer-howto.txt | 5 +
> > drivers/pci/pci-sysfs.c | 3 +
> > drivers/pci/pci.h | 4 +-
> > drivers/pci/pcie/aer/Makefile | 2 +-
> > drivers/pci/pcie/aer/aerdrv.h | 15 ++
> > drivers/pci/pcie/aer/aerdrv_core.c | 11 +
> > drivers/pci/pcie/aer/aerdrv_errprint.c | 7 +-
> > drivers/pci/pcie/aer/aerdrv_stats.c | 192 ++++++++++++++++++
> > drivers/pci/probe.c | 1 +
> > include/linux/pci.h | 3 +
> > 11 files changed, 342 insertions(+), 4 deletions(-)
> > create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
>
> I broke this by putting all the AER code in one file in v4.18-rc1,
> sorry! Would you mind rebasing these on top of that?
>
> Since everything AER-related is now in aer.c, I'd suggest putting the
> stats code there, too.
On 06/19/2018 05:16 PM, Bjorn Helgaas wrote:
> On Wed, May 23, 2018 at 10:58:03AM -0700, Rajat Jain wrote:
>> This patchset exposes the AER stats via the sysfs attributes.
>>
>> Patchset v2 has minor changes to v1 based on the review comments,
>> no functional change.
>> Primarily:
>> * Fix license header
>> * Use tabs instead of spaces
>> * Remove use on unlikely() etc
>> * Move documentation to Documentation/ABI/
>>
>> Rajat Jain (5):
>> PCI/AER: Define and allocate aer_stats structure for AER capable
>> devices
>> PCI/AER: Add sysfs stats for AER capable devices
>> PCI/AER: Add sysfs attributes to provide breakdown of AERs
>> PCI/AER: Add sysfs attributes for rootport cumulative stats
>> Documentation/ABI: Add details of PCI AER statistics
>>
>> .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++
>> Documentation/PCI/pcieaer-howto.txt | 5 +
>> drivers/pci/pci-sysfs.c | 3 +
>> drivers/pci/pci.h | 4 +-
>> drivers/pci/pcie/aer/Makefile | 2 +-
>> drivers/pci/pcie/aer/aerdrv.h | 15 ++
>> drivers/pci/pcie/aer/aerdrv_core.c | 11 +
>> drivers/pci/pcie/aer/aerdrv_errprint.c | 7 +-
>> drivers/pci/pcie/aer/aerdrv_stats.c | 192 ++++++++++++++++++
>> drivers/pci/probe.c | 1 +
>> include/linux/pci.h | 3 +
>> 11 files changed, 342 insertions(+), 4 deletions(-)
>> create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> create mode 100644 drivers/pci/pcie/aer/aerdrv_stats.c
>
> I broke this by putting all the AER code in one file in v4.18-rc1,
I see the next phoronix headline:
"Linux maintainer breaks things and watches in amusement as contributors
rush to the fix"
;)
Alex
> sorry! Would you mind rebasing these on top of that?
>
> Since everything AER-related is now in aer.c, I'd suggest putting the
> stats code there, too.
>
On Tue, 19 Jun 2018 17:20:28 -0500
"Alex G." <[email protected]> wrote:
> I see the next phoronix headline:
>
> "Linux maintainer breaks things and watches in amusement as contributors
> rush to the fix"
But isn't this the daily routine of every Linux maintainer?
-- Steve
On 06/19/2018 05:25 PM, Steven Rostedt wrote:
> On Tue, 19 Jun 2018 17:20:28 -0500
> "Alex G." <[email protected]> wrote:
>
>> I see the next phoronix headline:
>>
>> "Linux maintainer breaks things and watches in amusement as contributors
>> rush to the fix"
>
> But isn't this the daily routine of every Linux maintainer?
Phoronix wrote articles about much less interesting things than that.
Alex
Add the following AER sysfs stats to represent the counters for each
kind of error as seen by the device:
dev_total_cor_errs
dev_total_fatal_errs
dev_total_nonfatal_errs
Signed-off-by: Rajat Jain <[email protected]>
---
v3: Merge everything in aer.c, use "%llu" in place of "%llx"
drivers/pci/pci-sysfs.c | 3 ++
drivers/pci/pci.h | 4 ++-
drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 0c4653c1d2ce..9f1cb9051d7d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1746,6 +1746,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
#endif
&pci_bridge_attr_group,
&pcie_dev_attr_group,
+#ifdef CONFIG_PCIEAER
+ &aer_stats_attr_group,
+#endif
NULL,
};
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c358e7a07f3f..9a28ec600225 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -181,7 +181,9 @@ extern const struct attribute_group *pci_dev_groups[];
extern const struct attribute_group *pcibus_groups[];
extern const struct device_type pci_dev_type;
extern const struct attribute_group *pci_bus_groups[];
-
+#ifdef CONFIG_PCIEAER
+extern const struct attribute_group aer_stats_attr_group;
+#endif
/**
* pci_match_one_device - Tell if a PCI device structure has a matching
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f9fa994b6c33..ce0d675d7bd3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -573,6 +573,79 @@ static const char *aer_agent_string[] = {
"Transmitter ID"
};
+#define aer_stats_aggregate_attr(field) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "%llu\n", pdev->aer_stats->field); \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_aggregate_attr(dev_total_cor_errs);
+aer_stats_aggregate_attr(dev_total_fatal_errs);
+aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+
+static struct attribute *aer_stats_attrs[] __ro_after_init = {
+ &dev_attr_dev_total_cor_errs.attr,
+ &dev_attr_dev_total_fatal_errs.attr,
+ &dev_attr_dev_total_nonfatal_errs.attr,
+ NULL
+};
+
+static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_stats)
+ return 0;
+
+ return a->mode;
+}
+
+const struct attribute_group aer_stats_attr_group = {
+ .name = "aer_stats",
+ .attrs = aer_stats_attrs,
+ .is_visible = aer_stats_attrs_are_visible,
+};
+
+static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_info *info)
+{
+ int status, i, max = -1;
+ u64 *counter = NULL;
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ switch (info->severity) {
+ case AER_CORRECTABLE:
+ aer_stats->dev_total_cor_errs++;
+ counter = &aer_stats->dev_cor_errs[0];
+ max = AER_MAX_TYPEOF_CORRECTABLE_ERRS;
+ break;
+ case AER_NONFATAL:
+ aer_stats->dev_total_nonfatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ case AER_FATAL:
+ aer_stats->dev_total_fatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ }
+
+ status = (info->status & ~info->mask);
+ for (i = 0; i < max; i++)
+ if (status & (1 << i))
+ counter[i]++;
+}
+
static void __print_tlp_header(struct pci_dev *dev,
struct aer_header_log_regs *t)
{
@@ -605,6 +678,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_err(dev, " [%2d] Unknown Error Bit%s\n",
i, info->first_error == i ? " (First)" : "");
}
+ pci_dev_aer_stats_incr(dev, info);
}
static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
--
2.18.0.rc1.244.gcf134e6275-goog
Add the PCI AER statistics details to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
and provide a pointer to it in
Documentation/PCI/pcieaer-howto.txt
Signed-off-by: Rajat Jain <[email protected]>
---
v3: Add some more details, use decimal instead of hex
.../testing/sysfs-bus-pci-devices-aer_stats | 111 ++++++++++++++++++
Documentation/PCI/pcieaer-howto.txt | 5 +
2 files changed, 116 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
new file mode 100644
index 000000000000..3ed5a682be87
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -0,0 +1,111 @@
+==========================
+PCIe Device AER statistics
+==========================
+These attributes show up under all the devices that are AER capable. These
+statistical counters indicate the errors "as seen/reported by the device".
+Note that this may mean that if an end point is causing problems, the AER
+counters may increment at its link partner (e.g. root port) because the
+errors will be "seen" / reported by the link partner and not the the
+problematic end point itself (which may report all counters as 0 as it never
+saw any problems).
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of correctable errors seen and reported by this
+ PCI device using ERR_COR.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of uncorrectable fatal errors seen and reported
+ by this PCI device using ERR_FATAL.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of uncorrectable non-fatal errors seen and reported
+ by this PCI device using ERR_NONFATAL.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Breakdown of correctable errors seen and reported by this
+ PCI device using ERR_COR. Note that the sum total of all errors
+ in dev_breakdown_correctable may exceed dev_total_cor_errs
+ because a device is allowed to merge multiple correctable and
+ send a single ERR_COR for them (which is what dev_total_cor_errs
+ counts). A sample output for this attribute looks like this:
+-----------------------------------------
+Receiver Error = 174
+Bad TLP = 19
+Bad DLLP = 3
+RELAY_NUM Rollover = 0
+Replay Timer Timeout = 1
+Advisory Non-Fatal = 0
+Corrected Internal Error = 0
+Header Log Overflow = 0
+-----------------------------------------
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Breakdown of of correctable errors seen and reported by this
+ PCI device using ERR_FATAL or ERR_NONFATAL. Note that the sum
+ total of all errors in dev_breakdown_uncorrectable may exceed
+ (dev_total_fatal_errs + dev_total_nonfatal_errs) because a
+ device is allowed to merge multiple errors at the same severity
+ and send a single ERR_FATAL/ERR_NON_FATAL for them.
+ A sample output for this attribute looks like this:
+-----------------------------------------
+Undefined = 0
+Data Link Protocol = 0
+Surprise Down Error = 0
+Poisoned TLP = 0
+Flow Control Protocol = 0
+Completion Timeout = 0
+Completer Abort = 0
+Unexpected Completion = 0
+Receiver Overflow = 0
+Malformed TLP = 0
+ECRC = 0
+Unsupported Request = 0
+ACS Violation = 0
+Uncorrectable Internal Error = 0
+MC Blocked TLP = 0
+AtomicOp Egress Blocked = 0
+TLP Prefix Blocked Error = 0
+-----------------------------------------
+
+============================
+PCIe Rootport AER statistics
+============================
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please note
+that the rootports also transmit (internally) the ERR_* messages for errors seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_COR messages reported to rootport.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_FATAL messages reported to rootport.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_NONFATAL messages reported to rootport.
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index acd0dddd6bb8..91b6e677cb8c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for
other fields.
+2.4 AER Statistics / Counters
+
+When PCIe AER errors are captured, the counters / statistics are also exposed
+in form of sysfs attributes which are documented at
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
3. Developer Guide
--
2.18.0.rc1.244.gcf134e6275-goog
Define a structure to hold the AER statistics. There are 2 groups
of statistics: dev_* counters that are to be collected for all AER
capable devices and rootport_* counters that are collected for all
(AER capable) rootports only. Allocate and free this structure when
device is added or released (thus counters survive the lifetime of the
device).
Signed-off-by: Rajat Jain <[email protected]>
---
v3: Merge everything in aer.c
drivers/pci/pcie/aer.c | 60 ++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 +++
3 files changed, 64 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..f9fa994b6c33 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -33,6 +33,9 @@
#define AER_ERROR_SOURCES_MAX 100
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
+#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
+#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
+
struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
@@ -76,6 +79,40 @@ struct aer_rpc {
*/
};
+/* AER stats for the device */
+struct aer_stats {
+
+ /*
+ * Fields for all AER capable devices. They indicate the errors
+ * "as seen by this device". Note that this may mean that if an
+ * end point is causing problems, the AER counters may increment
+ * at its link partner (e.g. root port) because the errors will be
+ * "seen" by the link partner and not the the problematic end point
+ * itself (which may report all counters as 0 as it never saw any
+ * problems).
+ */
+ /* Individual counters for different type of correctable errors */
+ u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+ /* Individual counters for different type of uncorrectable errors */
+ u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+ /* Total number of correctable errors seen by this device */
+ u64 dev_total_cor_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_fatal_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_nonfatal_errs;
+
+ /*
+ * Fields for Root ports only, these indicate the total number of
+ * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
+ * rootport, INCLUDING the ones that are generated internally (by
+ * the rootport itself)
+ */
+ u64 rootport_total_cor_errs;
+ u64 rootport_total_fatal_errs;
+ u64 rootport_total_nonfatal_errs;
+};
+
#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
PCI_ERR_UNC_ECRC| \
PCI_ERR_UNC_UNSUP| \
@@ -402,12 +439,35 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
return 0;
}
+static int pci_aer_stats_init(struct pci_dev *pdev)
+{
+ pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ if (!pdev->aer_stats) {
+ dev_err(&pdev->dev, "No memory for aer_stats\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void pci_aer_stats_exit(struct pci_dev *pdev)
+{
+ kfree(pdev->aer_stats);
+ pdev->aer_stats = NULL;
+}
+
int pci_aer_init(struct pci_dev *dev)
{
dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+ if (!dev->aer_cap || pci_aer_stats_init(dev))
+ return -EIO;
return pci_cleanup_aer_error_status_regs(dev);
}
+void pci_aer_exit(struct pci_dev *dev)
+{
+ pci_aer_stats_exit(dev);
+}
+
#define AER_AGENT_RECEIVER 0
#define AER_AGENT_REQUESTER 1
#define AER_AGENT_COMPLETER 2
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..48edd0c9e4bc 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
+ pci_aer_exit(dev);
pci_vpd_release(dev);
pci_iov_release(dev);
pci_free_cap_save_buffers(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 340029b2fb38..8d59c6c19a19 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -299,6 +299,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
+ struct aer_stats *aer_stats; /* AER stats for this device */
#endif
u8 pcie_cap; /* PCIe capability offset */
u8 msi_cap; /* MSI capability offset */
@@ -1471,10 +1472,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
void pci_no_aer(void);
bool pci_aer_available(void);
int pci_aer_init(struct pci_dev *dev);
+void pci_aer_exit(struct pci_dev *dev);
#else
static inline void pci_no_aer(void) { }
static inline bool pci_aer_available(void) { return false; }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
+static inline void pci_aer_exit(struct pci_dev *d) { }
#endif
#ifdef CONFIG_PCIE_ECRC
--
2.18.0.rc1.244.gcf134e6275-goog
Add sysfs attributes for rootport statistics (that are cumulative
of all the ERR_* messages seen on this PCI hierarchy).
Signed-off-by: Rajat Jain <[email protected]>
---
v3: Merge everything in aer.c, use "%llu" in place of "%llx"
drivers/pci/pcie/aer.c | 37 +++++++++++++++++++++++++++++++++++--
1 file changed, 35 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 876f03799810..b6d0a0b56d65 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -586,6 +586,9 @@ static DEVICE_ATTR_RO(field)
aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+aer_stats_aggregate_attr(rootport_total_cor_errs);
+aer_stats_aggregate_attr(rootport_total_fatal_errs);
+aer_stats_aggregate_attr(rootport_total_nonfatal_errs);
#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
static ssize_t \
@@ -598,10 +601,10 @@ aer_stats_aggregate_attr(dev_total_nonfatal_errs);
u64 *stats = pdev->aer_stats->stats_array; \
for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
if (strings_array[i]) \
- str += sprintf(str, "%s = 0x%llx\n", \
+ str += sprintf(str, "%s = %llu\n", \
strings_array[i], stats[i]); \
else if (stats[i]) \
- str += sprintf(str, #stats_array "bit[%d] = 0x%llx\n",\
+ str += sprintf(str, #stats_array "bit[%d] = %llu\n",\
i, stats[i]); \
} \
return str-buf; \
@@ -619,6 +622,9 @@ static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_nonfatal_errs.attr,
&dev_attr_dev_breakdown_correctable.attr,
&dev_attr_dev_breakdown_uncorrectable.attr,
+ &dev_attr_rootport_total_cor_errs.attr,
+ &dev_attr_rootport_total_fatal_errs.attr,
+ &dev_attr_rootport_total_nonfatal_errs.attr,
NULL
};
@@ -631,6 +637,12 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
if (!pdev->aer_stats)
return 0;
+ if ((a == &dev_attr_rootport_total_cor_errs.attr ||
+ a == &dev_attr_rootport_total_fatal_errs.attr ||
+ a == &dev_attr_rootport_total_nonfatal_errs.attr) &&
+ pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT)
+ return 0;
+
return a->mode;
}
@@ -674,6 +686,25 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
counter[i]++;
}
+void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_source *e_src)
+{
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ if (e_src->status & PCI_ERR_ROOT_COR_RCV)
+ aer_stats->rootport_total_cor_errs++;
+
+ if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
+ if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
+ aer_stats->rootport_total_fatal_errs++;
+ else
+ aer_stats->rootport_total_nonfatal_errs++;
+ }
+}
+
static void __print_tlp_header(struct pci_dev *dev,
struct aer_header_log_regs *t)
{
@@ -1124,6 +1155,8 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
struct pci_dev *pdev = rpc->rpd;
struct aer_err_info *e_info = &rpc->e_info;
+ pci_rootport_aer_stats_incr(pdev, e_src);
+
/*
* There is a possibility that both correctable error and
* uncorrectable error being logged. Report correctable error first.
--
2.18.0.rc1.244.gcf134e6275-goog
Add sysfs attributes to provide breakdown of the AERs seen,
into different type of correctable or uncorrectable errors:
dev_breakdown_correctable
dev_breakdown_uncorrectable
Signed-off-by: Rajat Jain <[email protected]>
---
v3: Merge everything in aer.c, use "%llu" in place of "%llx"
drivers/pci/pcie/aer.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ce0d675d7bd3..876f03799810 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -587,10 +587,38 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ unsigned int i; \
+ char *str = buf; \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ u64 *stats = pdev->aer_stats->stats_array; \
+ for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
+ if (strings_array[i]) \
+ str += sprintf(str, "%s = 0x%llx\n", \
+ strings_array[i], stats[i]); \
+ else if (stats[i]) \
+ str += sprintf(str, #stats_array "bit[%d] = 0x%llx\n",\
+ i, stats[i]); \
+ } \
+ return str-buf; \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
+ aer_correctable_error_string);
+aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
+ aer_uncorrectable_error_string);
+
static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_cor_errs.attr,
&dev_attr_dev_total_fatal_errs.attr,
&dev_attr_dev_total_nonfatal_errs.attr,
+ &dev_attr_dev_breakdown_correctable.attr,
+ &dev_attr_dev_breakdown_uncorrectable.attr,
NULL
};
--
2.18.0.rc1.244.gcf134e6275-goog
Hi Rajat,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on pci/next]
[also build test WARNING on v4.18-rc1 next-20180619]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Rajat-Jain/PCI-AER-Define-and-allocate-aer_stats-structure-for-AER-capable-devices/20180620-091551
base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
>> drivers/pci/pcie/aer.c:689:6: sparse: symbol 'pci_rootport_aer_stats_incr' was not declared. Should it be static?
Please review and possibly fold the followup patch.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Add the following AER sysfs stats to represent the counters for each
kind of error as seen by the device:
dev_total_cor_errs
dev_total_fatal_errs
dev_total_nonfatal_errs
Signed-off-by: Rajat Jain <[email protected]>
---
v4: Same as v3
v3: Merge everything in aer.c, use "%llu" in place of "%llx"
drivers/pci/pci-sysfs.c | 3 ++
drivers/pci/pci.h | 4 ++-
drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 0c4653c1d2ce..9f1cb9051d7d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1746,6 +1746,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
#endif
&pci_bridge_attr_group,
&pcie_dev_attr_group,
+#ifdef CONFIG_PCIEAER
+ &aer_stats_attr_group,
+#endif
NULL,
};
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c358e7a07f3f..9a28ec600225 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -181,7 +181,9 @@ extern const struct attribute_group *pci_dev_groups[];
extern const struct attribute_group *pcibus_groups[];
extern const struct device_type pci_dev_type;
extern const struct attribute_group *pci_bus_groups[];
-
+#ifdef CONFIG_PCIEAER
+extern const struct attribute_group aer_stats_attr_group;
+#endif
/**
* pci_match_one_device - Tell if a PCI device structure has a matching
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f9fa994b6c33..ce0d675d7bd3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -573,6 +573,79 @@ static const char *aer_agent_string[] = {
"Transmitter ID"
};
+#define aer_stats_aggregate_attr(field) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "%llu\n", pdev->aer_stats->field); \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_aggregate_attr(dev_total_cor_errs);
+aer_stats_aggregate_attr(dev_total_fatal_errs);
+aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+
+static struct attribute *aer_stats_attrs[] __ro_after_init = {
+ &dev_attr_dev_total_cor_errs.attr,
+ &dev_attr_dev_total_fatal_errs.attr,
+ &dev_attr_dev_total_nonfatal_errs.attr,
+ NULL
+};
+
+static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_stats)
+ return 0;
+
+ return a->mode;
+}
+
+const struct attribute_group aer_stats_attr_group = {
+ .name = "aer_stats",
+ .attrs = aer_stats_attrs,
+ .is_visible = aer_stats_attrs_are_visible,
+};
+
+static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_info *info)
+{
+ int status, i, max = -1;
+ u64 *counter = NULL;
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ switch (info->severity) {
+ case AER_CORRECTABLE:
+ aer_stats->dev_total_cor_errs++;
+ counter = &aer_stats->dev_cor_errs[0];
+ max = AER_MAX_TYPEOF_CORRECTABLE_ERRS;
+ break;
+ case AER_NONFATAL:
+ aer_stats->dev_total_nonfatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ case AER_FATAL:
+ aer_stats->dev_total_fatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ }
+
+ status = (info->status & ~info->mask);
+ for (i = 0; i < max; i++)
+ if (status & (1 << i))
+ counter[i]++;
+}
+
static void __print_tlp_header(struct pci_dev *dev,
struct aer_header_log_regs *t)
{
@@ -605,6 +678,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_err(dev, " [%2d] Unknown Error Bit%s\n",
i, info->first_error == i ? " (First)" : "");
}
+ pci_dev_aer_stats_incr(dev, info);
}
static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
--
2.18.0.rc1.244.gcf134e6275-goog
Define a structure to hold the AER statistics. There are 2 groups
of statistics: dev_* counters that are to be collected for all AER
capable devices and rootport_* counters that are collected for all
(AER capable) rootports only. Allocate and free this structure when
device is added or released (thus counters survive the lifetime of the
device).
Signed-off-by: Rajat Jain <[email protected]>
---
v4: Same as v3
v3: Merge everything in aer.c
drivers/pci/pcie/aer.c | 60 ++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 +++
3 files changed, 64 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..f9fa994b6c33 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -33,6 +33,9 @@
#define AER_ERROR_SOURCES_MAX 100
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
+#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
+#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
+
struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
@@ -76,6 +79,40 @@ struct aer_rpc {
*/
};
+/* AER stats for the device */
+struct aer_stats {
+
+ /*
+ * Fields for all AER capable devices. They indicate the errors
+ * "as seen by this device". Note that this may mean that if an
+ * end point is causing problems, the AER counters may increment
+ * at its link partner (e.g. root port) because the errors will be
+ * "seen" by the link partner and not the the problematic end point
+ * itself (which may report all counters as 0 as it never saw any
+ * problems).
+ */
+ /* Individual counters for different type of correctable errors */
+ u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+ /* Individual counters for different type of uncorrectable errors */
+ u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+ /* Total number of correctable errors seen by this device */
+ u64 dev_total_cor_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_fatal_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_nonfatal_errs;
+
+ /*
+ * Fields for Root ports only, these indicate the total number of
+ * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
+ * rootport, INCLUDING the ones that are generated internally (by
+ * the rootport itself)
+ */
+ u64 rootport_total_cor_errs;
+ u64 rootport_total_fatal_errs;
+ u64 rootport_total_nonfatal_errs;
+};
+
#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
PCI_ERR_UNC_ECRC| \
PCI_ERR_UNC_UNSUP| \
@@ -402,12 +439,35 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
return 0;
}
+static int pci_aer_stats_init(struct pci_dev *pdev)
+{
+ pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ if (!pdev->aer_stats) {
+ dev_err(&pdev->dev, "No memory for aer_stats\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void pci_aer_stats_exit(struct pci_dev *pdev)
+{
+ kfree(pdev->aer_stats);
+ pdev->aer_stats = NULL;
+}
+
int pci_aer_init(struct pci_dev *dev)
{
dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+ if (!dev->aer_cap || pci_aer_stats_init(dev))
+ return -EIO;
return pci_cleanup_aer_error_status_regs(dev);
}
+void pci_aer_exit(struct pci_dev *dev)
+{
+ pci_aer_stats_exit(dev);
+}
+
#define AER_AGENT_RECEIVER 0
#define AER_AGENT_REQUESTER 1
#define AER_AGENT_COMPLETER 2
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..48edd0c9e4bc 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
+ pci_aer_exit(dev);
pci_vpd_release(dev);
pci_iov_release(dev);
pci_free_cap_save_buffers(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 340029b2fb38..8d59c6c19a19 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -299,6 +299,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
+ struct aer_stats *aer_stats; /* AER stats for this device */
#endif
u8 pcie_cap; /* PCIe capability offset */
u8 msi_cap; /* MSI capability offset */
@@ -1471,10 +1472,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
void pci_no_aer(void);
bool pci_aer_available(void);
int pci_aer_init(struct pci_dev *dev);
+void pci_aer_exit(struct pci_dev *dev);
#else
static inline void pci_no_aer(void) { }
static inline bool pci_aer_available(void) { return false; }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
+static inline void pci_aer_exit(struct pci_dev *d) { }
#endif
#ifdef CONFIG_PCIE_ECRC
--
2.18.0.rc1.244.gcf134e6275-goog
Add the PCI AER statistics details to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
and provide a pointer to it in
Documentation/PCI/pcieaer-howto.txt
Signed-off-by: Rajat Jain <[email protected]>
---
v5: Same as v4
v4: Same as v3
v3: Add some more details
.../testing/sysfs-bus-pci-devices-aer_stats | 111 ++++++++++++++++++
Documentation/PCI/pcieaer-howto.txt | 5 +
2 files changed, 116 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
new file mode 100644
index 000000000000..3ed5a682be87
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -0,0 +1,111 @@
+==========================
+PCIe Device AER statistics
+==========================
+These attributes show up under all the devices that are AER capable. These
+statistical counters indicate the errors "as seen/reported by the device".
+Note that this may mean that if an end point is causing problems, the AER
+counters may increment at its link partner (e.g. root port) because the
+errors will be "seen" / reported by the link partner and not the the
+problematic end point itself (which may report all counters as 0 as it never
+saw any problems).
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of correctable errors seen and reported by this
+ PCI device using ERR_COR.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of uncorrectable fatal errors seen and reported
+ by this PCI device using ERR_FATAL.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of uncorrectable non-fatal errors seen and reported
+ by this PCI device using ERR_NONFATAL.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Breakdown of correctable errors seen and reported by this
+ PCI device using ERR_COR. Note that the sum total of all errors
+ in dev_breakdown_correctable may exceed dev_total_cor_errs
+ because a device is allowed to merge multiple correctable and
+ send a single ERR_COR for them (which is what dev_total_cor_errs
+ counts). A sample output for this attribute looks like this:
+-----------------------------------------
+Receiver Error = 174
+Bad TLP = 19
+Bad DLLP = 3
+RELAY_NUM Rollover = 0
+Replay Timer Timeout = 1
+Advisory Non-Fatal = 0
+Corrected Internal Error = 0
+Header Log Overflow = 0
+-----------------------------------------
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Breakdown of of correctable errors seen and reported by this
+ PCI device using ERR_FATAL or ERR_NONFATAL. Note that the sum
+ total of all errors in dev_breakdown_uncorrectable may exceed
+ (dev_total_fatal_errs + dev_total_nonfatal_errs) because a
+ device is allowed to merge multiple errors at the same severity
+ and send a single ERR_FATAL/ERR_NON_FATAL for them.
+ A sample output for this attribute looks like this:
+-----------------------------------------
+Undefined = 0
+Data Link Protocol = 0
+Surprise Down Error = 0
+Poisoned TLP = 0
+Flow Control Protocol = 0
+Completion Timeout = 0
+Completer Abort = 0
+Unexpected Completion = 0
+Receiver Overflow = 0
+Malformed TLP = 0
+ECRC = 0
+Unsupported Request = 0
+ACS Violation = 0
+Uncorrectable Internal Error = 0
+MC Blocked TLP = 0
+AtomicOp Egress Blocked = 0
+TLP Prefix Blocked Error = 0
+-----------------------------------------
+
+============================
+PCIe Rootport AER statistics
+============================
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please note
+that the rootports also transmit (internally) the ERR_* messages for errors seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_COR messages reported to rootport.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_FATAL messages reported to rootport.
+
+Where: /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
+Date: May 2018
+Kernel Version: 4.17.0
+Contact: [email protected], [email protected]
+Description: Total number of ERR_NONFATAL messages reported to rootport.
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index acd0dddd6bb8..91b6e677cb8c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for
other fields.
+2.4 AER Statistics / Counters
+
+When PCIe AER errors are captured, the counters / statistics are also exposed
+in form of sysfs attributes which are documented at
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
3. Developer Guide
--
2.18.0.rc1.244.gcf134e6275-goog
Add sysfs attributes for rootport statistics (that are cumulative
of all the ERR_* messages seen on this PCI hierarchy).
Signed-off-by: Rajat Jain <[email protected]>
---
v5: Same as v4
v4: Make pci_rootport_aer_stats_incr() static, 1 hunk of previous patch was included by mistake, fix that.
v3: Merge everything in aer.c
drivers/pci/pcie/aer.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c989bb5bb6f1..d9def7fabd81 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -586,6 +586,9 @@ static DEVICE_ATTR_RO(field)
aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+aer_stats_aggregate_attr(rootport_total_cor_errs);
+aer_stats_aggregate_attr(rootport_total_fatal_errs);
+aer_stats_aggregate_attr(rootport_total_nonfatal_errs);
#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
static ssize_t \
@@ -619,6 +622,9 @@ static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_nonfatal_errs.attr,
&dev_attr_dev_breakdown_correctable.attr,
&dev_attr_dev_breakdown_uncorrectable.attr,
+ &dev_attr_rootport_total_cor_errs.attr,
+ &dev_attr_rootport_total_fatal_errs.attr,
+ &dev_attr_rootport_total_nonfatal_errs.attr,
NULL
};
@@ -631,6 +637,12 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
if (!pdev->aer_stats)
return 0;
+ if ((a == &dev_attr_rootport_total_cor_errs.attr ||
+ a == &dev_attr_rootport_total_fatal_errs.attr ||
+ a == &dev_attr_rootport_total_nonfatal_errs.attr) &&
+ pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT)
+ return 0;
+
return a->mode;
}
@@ -674,6 +686,25 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
counter[i]++;
}
+static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_source *e_src)
+{
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ if (e_src->status & PCI_ERR_ROOT_COR_RCV)
+ aer_stats->rootport_total_cor_errs++;
+
+ if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
+ if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
+ aer_stats->rootport_total_fatal_errs++;
+ else
+ aer_stats->rootport_total_nonfatal_errs++;
+ }
+}
+
static void __print_tlp_header(struct pci_dev *dev,
struct aer_header_log_regs *t)
{
@@ -1124,6 +1155,8 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
struct pci_dev *pdev = rpc->rpd;
struct aer_err_info *e_info = &rpc->e_info;
+ pci_rootport_aer_stats_incr(pdev, e_src);
+
/*
* There is a possibility that both correctable error and
* uncorrectable error being logged. Report correctable error first.
--
2.18.0.rc1.244.gcf134e6275-goog
Add the following AER sysfs stats to represent the counters for each
kind of error as seen by the device:
dev_total_cor_errs
dev_total_fatal_errs
dev_total_nonfatal_errs
Signed-off-by: Rajat Jain <[email protected]>
---
v5: same as v4
v4: Same as v3
v3: Merge everything in aer.c, use "%llu" in place of "%llx"
drivers/pci/pci-sysfs.c | 3 ++
drivers/pci/pci.h | 4 ++-
drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 0c4653c1d2ce..9f1cb9051d7d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1746,6 +1746,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
#endif
&pci_bridge_attr_group,
&pcie_dev_attr_group,
+#ifdef CONFIG_PCIEAER
+ &aer_stats_attr_group,
+#endif
NULL,
};
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c358e7a07f3f..9a28ec600225 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -181,7 +181,9 @@ extern const struct attribute_group *pci_dev_groups[];
extern const struct attribute_group *pcibus_groups[];
extern const struct device_type pci_dev_type;
extern const struct attribute_group *pci_bus_groups[];
-
+#ifdef CONFIG_PCIEAER
+extern const struct attribute_group aer_stats_attr_group;
+#endif
/**
* pci_match_one_device - Tell if a PCI device structure has a matching
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f9fa994b6c33..ce0d675d7bd3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -573,6 +573,79 @@ static const char *aer_agent_string[] = {
"Transmitter ID"
};
+#define aer_stats_aggregate_attr(field) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "%llu\n", pdev->aer_stats->field); \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_aggregate_attr(dev_total_cor_errs);
+aer_stats_aggregate_attr(dev_total_fatal_errs);
+aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+
+static struct attribute *aer_stats_attrs[] __ro_after_init = {
+ &dev_attr_dev_total_cor_errs.attr,
+ &dev_attr_dev_total_fatal_errs.attr,
+ &dev_attr_dev_total_nonfatal_errs.attr,
+ NULL
+};
+
+static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_stats)
+ return 0;
+
+ return a->mode;
+}
+
+const struct attribute_group aer_stats_attr_group = {
+ .name = "aer_stats",
+ .attrs = aer_stats_attrs,
+ .is_visible = aer_stats_attrs_are_visible,
+};
+
+static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
+ struct aer_err_info *info)
+{
+ int status, i, max = -1;
+ u64 *counter = NULL;
+ struct aer_stats *aer_stats = pdev->aer_stats;
+
+ if (!aer_stats)
+ return;
+
+ switch (info->severity) {
+ case AER_CORRECTABLE:
+ aer_stats->dev_total_cor_errs++;
+ counter = &aer_stats->dev_cor_errs[0];
+ max = AER_MAX_TYPEOF_CORRECTABLE_ERRS;
+ break;
+ case AER_NONFATAL:
+ aer_stats->dev_total_nonfatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ case AER_FATAL:
+ aer_stats->dev_total_fatal_errs++;
+ counter = &aer_stats->dev_uncor_errs[0];
+ max = AER_MAX_TYPEOF_UNCORRECTABLE_ERRS;
+ break;
+ }
+
+ status = (info->status & ~info->mask);
+ for (i = 0; i < max; i++)
+ if (status & (1 << i))
+ counter[i]++;
+}
+
static void __print_tlp_header(struct pci_dev *dev,
struct aer_header_log_regs *t)
{
@@ -605,6 +678,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_err(dev, " [%2d] Unknown Error Bit%s\n",
i, info->first_error == i ? " (First)" : "");
}
+ pci_dev_aer_stats_incr(dev, info);
}
static void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
--
2.18.0.rc1.244.gcf134e6275-goog
Define a structure to hold the AER statistics. There are 2 groups
of statistics: dev_* counters that are to be collected for all AER
capable devices and rootport_* counters that are collected for all
(AER capable) rootports only. Allocate and free this structure when
device is added or released (thus counters survive the lifetime of the
device).
Signed-off-by: Rajat Jain <[email protected]>
---
v5: Same as v4
v4: Same as v3
v3: Merge everything in aer.c
drivers/pci/pcie/aer.c | 60 ++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 1 +
include/linux/pci.h | 3 +++
3 files changed, 64 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..f9fa994b6c33 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -33,6 +33,9 @@
#define AER_ERROR_SOURCES_MAX 100
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
+#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
+#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
+
struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
@@ -76,6 +79,40 @@ struct aer_rpc {
*/
};
+/* AER stats for the device */
+struct aer_stats {
+
+ /*
+ * Fields for all AER capable devices. They indicate the errors
+ * "as seen by this device". Note that this may mean that if an
+ * end point is causing problems, the AER counters may increment
+ * at its link partner (e.g. root port) because the errors will be
+ * "seen" by the link partner and not the the problematic end point
+ * itself (which may report all counters as 0 as it never saw any
+ * problems).
+ */
+ /* Individual counters for different type of correctable errors */
+ u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+ /* Individual counters for different type of uncorrectable errors */
+ u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+ /* Total number of correctable errors seen by this device */
+ u64 dev_total_cor_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_fatal_errs;
+ /* Total number of fatal uncorrectable errors seen by this device */
+ u64 dev_total_nonfatal_errs;
+
+ /*
+ * Fields for Root ports only, these indicate the total number of
+ * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
+ * rootport, INCLUDING the ones that are generated internally (by
+ * the rootport itself)
+ */
+ u64 rootport_total_cor_errs;
+ u64 rootport_total_fatal_errs;
+ u64 rootport_total_nonfatal_errs;
+};
+
#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
PCI_ERR_UNC_ECRC| \
PCI_ERR_UNC_UNSUP| \
@@ -402,12 +439,35 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
return 0;
}
+static int pci_aer_stats_init(struct pci_dev *pdev)
+{
+ pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ if (!pdev->aer_stats) {
+ dev_err(&pdev->dev, "No memory for aer_stats\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void pci_aer_stats_exit(struct pci_dev *pdev)
+{
+ kfree(pdev->aer_stats);
+ pdev->aer_stats = NULL;
+}
+
int pci_aer_init(struct pci_dev *dev)
{
dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
+ if (!dev->aer_cap || pci_aer_stats_init(dev))
+ return -EIO;
return pci_cleanup_aer_error_status_regs(dev);
}
+void pci_aer_exit(struct pci_dev *dev)
+{
+ pci_aer_stats_exit(dev);
+}
+
#define AER_AGENT_RECEIVER 0
#define AER_AGENT_REQUESTER 1
#define AER_AGENT_COMPLETER 2
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..48edd0c9e4bc 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
+ pci_aer_exit(dev);
pci_vpd_release(dev);
pci_iov_release(dev);
pci_free_cap_save_buffers(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 340029b2fb38..8d59c6c19a19 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -299,6 +299,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
+ struct aer_stats *aer_stats; /* AER stats for this device */
#endif
u8 pcie_cap; /* PCIe capability offset */
u8 msi_cap; /* MSI capability offset */
@@ -1471,10 +1472,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
void pci_no_aer(void);
bool pci_aer_available(void);
int pci_aer_init(struct pci_dev *dev);
+void pci_aer_exit(struct pci_dev *dev);
#else
static inline void pci_no_aer(void) { }
static inline bool pci_aer_available(void) { return false; }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
+static inline void pci_aer_exit(struct pci_dev *d) { }
#endif
#ifdef CONFIG_PCIE_ECRC
--
2.18.0.rc1.244.gcf134e6275-goog
Add sysfs attributes to provide breakdown of the AERs seen,
into different type of correctable or uncorrectable errors:
dev_breakdown_correctable
dev_breakdown_uncorrectable
Signed-off-by: Rajat Jain <[email protected]>
---
v5: Fix the signature
v4: use "%llu" in place of "%llx"
v3: Merge everything in aer.c
drivers/pci/pcie/aer.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ce0d675d7bd3..c989bb5bb6f1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -587,10 +587,38 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
aer_stats_aggregate_attr(dev_total_fatal_errs);
aer_stats_aggregate_attr(dev_total_nonfatal_errs);
+#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
+ static ssize_t \
+ field##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ unsigned int i; \
+ char *str = buf; \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ u64 *stats = pdev->aer_stats->stats_array; \
+ for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
+ if (strings_array[i]) \
+ str += sprintf(str, "%s = 0x%llu\n", \
+ strings_array[i], stats[i]); \
+ else if (stats[i]) \
+ str += sprintf(str, #stats_array "bit[%d] = 0x%llu\n",\
+ i, stats[i]); \
+ } \
+ return str-buf; \
+} \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
+ aer_correctable_error_string);
+aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
+ aer_uncorrectable_error_string);
+
static struct attribute *aer_stats_attrs[] __ro_after_init = {
&dev_attr_dev_total_cor_errs.attr,
&dev_attr_dev_total_fatal_errs.attr,
&dev_attr_dev_total_nonfatal_errs.attr,
+ &dev_attr_dev_breakdown_correctable.attr,
+ &dev_attr_dev_breakdown_uncorrectable.attr,
NULL
};
--
2.18.0.rc1.244.gcf134e6275-goog
On 2018-06-19 22:01, Rajat Jain wrote:
> On Mon, Jun 18, 2018 at 11:03 PM, <[email protected]> wrote:
>> On 2018-06-19 05:41, Rajat Jain wrote:
>>>
>>> Hello,
>>>
>>> On Sat, Jun 16, 2018 at 10:24 PM <[email protected]> wrote:
>>>>
>>>>
>>>> On 2018-05-23 23:28, Rajat Jain wrote:
>>>> > Add the PCI AER statistics details to
>>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > and provide a pointer to it in
>>>> > Documentation/PCI/pcieaer-howto.txt
>>>> >
>>>> > Signed-off-by: Rajat Jain <[email protected]>
>>>> > ---
>>>> > v2: Move the documentation to Documentation/ABI/
>>>> >
>>>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
>>>> > Documentation/PCI/pcieaer-howto.txt | 5 +
>>>> > 2 files changed, 108 insertions(+)
>>>> > create mode 100644
>>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> >
>>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > new file mode 100644
>>>> > index 000000000000..f55c389290ac
>>>> > --- /dev/null
>>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > @@ -0,0 +1,103 @@
>>>> > +==========================
>>>> > +PCIe Device AER statistics
>>>> > +==========================
>>>> > +These attributes show up under all the devices that are AER capable.
>>>> > These
>>>> > +statistical counters indicate the errors "as seen/reported by the
>>>> > device".
>>>> > +Note that this may mean that if an end point is causing problems, the
>>>> > AER
>>>> > +counters may increment at its link partner (e.g. root port) because
>>>> > the
>>>> > +errors will be "seen" / reported by the link partner and not the the
>>>> > +problematic end point itself (which may report all counters as 0 as it
>>>> > never
>>>> > +saw any problems).
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Total number of correctable errors seen and reported by
>>>> > this
>>>> > + PCI device using ERR_COR.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Total number of uncorrectable fatal errors seen and
>>>> > reported
>>>> > + by this PCI device using ERR_FATAL.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Total number of uncorrectable non-fatal errors seen and
>>>> > reported
>>>> > + by this PCI device using ERR_NONFATAL.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Breakdown of of correctable errors seen and reported by
>>>> > this
>>>> > + PCI device using ERR_COR. A sample result looks like
>>>> > this:
>>>> > +-----------------------------------------
>>>> > +Receiver Error = 0x174
>>>> > +Bad TLP = 0x19
>>>> > +Bad DLLP = 0x3
>>>> > +RELAY_NUM Rollover = 0x0
>>>> > +Replay Timer Timeout = 0x1
>>>> > +Advisory Non-Fatal = 0x0
>>>> > +Corrected Internal Error = 0x0
>>>> > +Header Log Overflow = 0x0
>>>> > +-----------------------------------------
>>>> why hex display ? decimal is easy to read as these are counters.
>>>
>>>
>>> Have no particular preference. Since these can be potentially large
>>> numbers, just had a random thought that hex might make it more
>>> concise. I can change to decimal if that is preferable.
>>>
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Breakdown of of correctable errors seen and reported by
>>>> > this
>>>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample
>>>> > result
>>>> > + looks like this:
>>>> > +-----------------------------------------
>>>> > +Undefined = 0x0
>>>> > +Data Link Protocol = 0x0
>>>> > +Surprise Down Error = 0x0
>>>> > +Poisoned TLP = 0x0
>>>> > +Flow Control Protocol = 0x0
>>>> > +Completion Timeout = 0x0
>>>> > +Completer Abort = 0x0
>>>> > +Unexpected Completion = 0x0
>>>> > +Receiver Overflow = 0x0
>>>> > +Malformed TLP = 0x0
>>>> > +ECRC = 0x0
>>>> > +Unsupported Request = 0x0
>>>> > +ACS Violation = 0x0
>>>> > +Uncorrectable Internal Error = 0x0
>>>> > +MC Blocked TLP = 0x0
>>>> > +AtomicOp Egress Blocked = 0x0
>>>> > +TLP Prefix Blocked Error = 0x0
>>>> > +-----------------------------------------
>>>> > +
>>>> > +============================
>>>> > +PCIe Rootport AER statistics
>>>> > +============================
>>>> > +These attributes showup under only the rootports that are AER capable.
>>>> > These
>>>> > +indicate the number of error messages as "reported to" the rootport.
>>>> > Please note
>>>> > +that the rootports also transmit (internally) the ERR_* messages for
>>>> > errors seen
>>>> > +by the internal rootport PCI device, so these counters includes them
>>>> > and are
>>>> > +thus cumulative of all the error messages on the PCI hierarchy
>>>> > originating
>>>> > +at that root port.
>>>>
>>>> what about switches and bridges ?
>>>
>>>
>>> What about them? AIUI, the switches forward the ERR_ messages from
>>> downstream devices to the rootport, like they do with standard
>>> messages. They can potentially generate their own ERR_ message and
>>> that would be reported no different than other end point devices.
>>
>>
>>
>> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
>> contained by switch
>> and the error handling code thinks that, the error is contained by
>> switch
>> irrespective of
>> AER or DPC, and it will think that the problem could be with
>> Switch/bridge
>> upstream link.
>>
>> hence the pci_dev of the switch where you should be increment your
>> counters.
>> of course ER_FATAL would have traversed till RP, but that doesnt meant
>> that
>> you account the error there.
>
> In this case, for the pci_dev for the rootport:
> - rootport_total_fatal_errors will be incremented (since it will get
> ERR_FATAL)
> - dev_total_fatal_errors will not be incremented.
ok but my confusion is: should you not be incrementing counter against
pci_dev of switch ? and not the RP ?
because the problem was with upstream link of the EP (e.g. switch)
>
> The dev_total_fatal_errors will be incremented only for the pci device
> identified by the "Error Source Identification Register" in the PCIe
> spec. Does this help clarify?
>
>>
>>
>>>
>>>> Also Can you give some idea as e.g what is the difference between
>>>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that
>>>> both
>>>> are same pci_dev.
>>>
>>>
>>> For a pci_dev representing the rootport:
>>>
>>> dev_total_fatal_errors = how many times this PCI device *experienced*
>>> a fatal problem on its own (i.e. either link issues while talking to
>>> its link partner, or some internal errors).
>>>
>>> rootport_total_fatal_errors = how many times this rootport was
>>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
>>> that originates at it (can be any link further downstream). This
>>> includes the dev_total_fatal_errors also, because any errors detected
>>> by the rootport are also "informed" to itself via ERR_* messages. In
>>> reality, this is just the total number of ERR_FATAL messages received
>>> at the rootport. This sysfs attribute will only exist for root ports.
>>>
>>>>
>>>> rootport_total_fatal_errs gives me an idea that how many times
>>>> things
>>>> have been failed under this pci_dev ?
>>>
>>>
>>> Yes, as above.
>>>
>>>> which means num of downstream link problems. but I am still trying
>>>> to
>>>> make sense as how it could be used,
>>>> since we dont have BDF information associated with the number of
>>>> errors
>>>> anywhere (except these AER print messages)
>>>>
>>>
>>> Agree. That is a limitation. The challenges being more record
>>> keeping,
>>> more complicated sysfs representation, and given that PCI devices may
>>> come and go, how do we know it is the same device before we collate
>>> their stats etc.
>>>
>>>>
>>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>>>> then say root-port will report it and increment
>>>> dev_total_fatal_errs ++
>>>> does it also increment root-port_total_fatal_errs ++ in above
>>>> scenario ?
>>>
>>>
>>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
>>> port of that hierarchy.
>>>
>>> Thanks,
>>>
>>> Rajat
>>>
>>>>
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Total number of ERR_COR messages reported to rootport.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>>>> > +Date: May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact: [email protected], [email protected]
>>>> > +Description: Total number of ERR_NONFATAL messages reported to
>>>> > rootport.
>>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>>>> > b/Documentation/PCI/pcieaer-howto.txt
>>>> > index acd0dddd6bb8..91b6e677cb8c 100644
>>>> > --- a/Documentation/PCI/pcieaer-howto.txt
>>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>>>> > device who sends
>>>> > the error message to root port. Pls. refer to pci express specs for
>>>> > other fields.
>>>> >
>>>> > +2.4 AER Statistics / Counters
>>>> > +
>>>> > +When PCIe AER errors are captured, the counters / statistics are also
>>>> > exposed
>>>> > +in form of sysfs attributes which are documented at
>>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> >
>>>> > 3. Developer Guide
On Wed, Jun 20, 2018 at 04:41:43PM -0700, Rajat Jain wrote:
> Define a structure to hold the AER statistics. There are 2 groups
> of statistics: dev_* counters that are to be collected for all AER
> capable devices and rootport_* counters that are collected for all
> (AER capable) rootports only. Allocate and free this structure when
> device is added or released (thus counters survive the lifetime of the
> device).
>
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> v5: Same as v4
> v4: Same as v3
> v3: Merge everything in aer.c
>
> drivers/pci/pcie/aer.c | 60 ++++++++++++++++++++++++++++++++++++++++++
> drivers/pci/probe.c | 1 +
> include/linux/pci.h | 3 +++
> 3 files changed, 64 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e88386af28..f9fa994b6c33 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -33,6 +33,9 @@
> #define AER_ERROR_SOURCES_MAX 100
> #define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
>
> +#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
> +#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
> +
> struct aer_err_info {
> struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
> int error_dev_num;
> @@ -76,6 +79,40 @@ struct aer_rpc {
> */
> };
>
> +/* AER stats for the device */
> +struct aer_stats {
> +
> + /*
> + * Fields for all AER capable devices. They indicate the errors
> + * "as seen by this device". Note that this may mean that if an
> + * end point is causing problems, the AER counters may increment
> + * at its link partner (e.g. root port) because the errors will be
> + * "seen" by the link partner and not the the problematic end point
> + * itself (which may report all counters as 0 as it never saw any
> + * problems).
> + */
> + /* Individual counters for different type of correctable errors */
> + u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
> + /* Individual counters for different type of uncorrectable errors */
> + u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
> + /* Total number of correctable errors seen by this device */
> + u64 dev_total_cor_errs;
> + /* Total number of fatal uncorrectable errors seen by this device */
> + u64 dev_total_fatal_errs;
> + /* Total number of fatal uncorrectable errors seen by this device */
> + u64 dev_total_nonfatal_errs;
> +
> + /*
> + * Fields for Root ports only, these indicate the total number of
> + * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
> + * rootport, INCLUDING the ones that are generated internally (by
> + * the rootport itself)
Strictly speaking, I think these are applicable for both root ports
and root complex event collectors, right?
> + */
> + u64 rootport_total_cor_errs;
> + u64 rootport_total_fatal_errs;
> + u64 rootport_total_nonfatal_errs;
> +};
> +
> #define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
> PCI_ERR_UNC_ECRC| \
> PCI_ERR_UNC_UNSUP| \
> @@ -402,12 +439,35 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
> return 0;
> }
>
> +static int pci_aer_stats_init(struct pci_dev *pdev)
> +{
> + pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> + if (!pdev->aer_stats) {
> + dev_err(&pdev->dev, "No memory for aer_stats\n");
pci_err(), if we need the message at all.
Based on c7abb2352c29 ("PCI: Remove unnecessary messages for memory
allocation failures"), I'd be inclined to drop the message completely.
> + return -ENOMEM;
> + }
> + return 0;
> +}
> +
> +static void pci_aer_stats_exit(struct pci_dev *pdev)
> +{
> + kfree(pdev->aer_stats);
> + pdev->aer_stats = NULL;
> +}
> +
> int pci_aer_init(struct pci_dev *dev)
> {
> dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
> + if (!dev->aer_cap || pci_aer_stats_init(dev))
> + return -EIO;
This skips pci_cleanup_aer_error_status_regs() if the kzalloc() fails.
I think we should still do pci_cleanup_aer_error_status_regs(), even
if the alloc fails.
Nobody checks the return value of pci_aer_init(), so I think you can
simplify this by making these functions void.
Maybe even squash them together, i.e., do the kzalloc() directly in
pci_aer_init() and the kfree() directly in pci_aer_exit()?
> return pci_cleanup_aer_error_status_regs(dev);
> }
>
> +void pci_aer_exit(struct pci_dev *dev)
> +{
> + pci_aer_stats_exit(dev);
> +}
> +
> #define AER_AGENT_RECEIVER 0
> #define AER_AGENT_REQUESTER 1
> #define AER_AGENT_COMPLETER 2
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac876e32de4b..48edd0c9e4bc 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
>
> static void pci_release_capabilities(struct pci_dev *dev)
> {
> + pci_aer_exit(dev);
> pci_vpd_release(dev);
> pci_iov_release(dev);
> pci_free_cap_save_buffers(dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 340029b2fb38..8d59c6c19a19 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -299,6 +299,7 @@ struct pci_dev {
> u8 hdr_type; /* PCI header type (`multi' flag masked out) */
> #ifdef CONFIG_PCIEAER
> u16 aer_cap; /* AER capability offset */
> + struct aer_stats *aer_stats; /* AER stats for this device */
> #endif
> u8 pcie_cap; /* PCIe capability offset */
> u8 msi_cap; /* MSI capability offset */
> @@ -1471,10 +1472,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
> void pci_no_aer(void);
> bool pci_aer_available(void);
> int pci_aer_init(struct pci_dev *dev);
> +void pci_aer_exit(struct pci_dev *dev);
With the exception of pci_aer_available(), these are only used inside
drivers/pci. This might be a good opportunity to move those private
things to drivers/pci/pci.h (in a separate patch, of course).
> #else
> static inline void pci_no_aer(void) { }
> static inline bool pci_aer_available(void) { return false; }
> static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
> +static inline void pci_aer_exit(struct pci_dev *d) { }
> #endif
>
> #ifdef CONFIG_PCIE_ECRC
> --
> 2.18.0.rc1.244.gcf134e6275-goog
>
[+cc Tyler for AER dmesg decoding]
I really like this idea a lot; thanks for putting it together!
On Wed, Jun 20, 2018 at 04:41:45PM -0700, Rajat Jain wrote:
> Add sysfs attributes to provide breakdown of the AERs seen,
> into different type of correctable or uncorrectable errors:
>
> dev_breakdown_correctable
> dev_breakdown_uncorrectable
- Can you include a more complete sysfs path here in the commit log,
as well as a snippet of the contents? From the doc patch, I think
it is currently:
/sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
/sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
- I'm not sure it's worth making a new subdirectory. What if you
simply added these?
/sys/bus/pci/devices/<dev>/aer_correctable
/sys/bus/pci/devices/<dev>/aer_uncorrectable
or perhaps, since you split the "total" files into
cor/nonfatal/fatal, these could match?
/sys/bus/pci/devices/<dev>/aer_correctable
/sys/bus/pci/devices/<dev>/aer_nonfatal
/sys/bus/pci/devices/<dev>/aer_fatal
I think the nonfatal/fatal distinction might be worth exposing
because some of those are configurable and the kernel handling is
significantly different. So I think it would make this more
approachable if the "remove/re-enumerate" situations that will be
obvious in dmesg logs were clearly connected with "aer_fatal"
statistics, as opposed to being connected to some subset of what's
in "aer_uncorrectable".
- Possibly the totals that you currently have in dev_total_cor_errs
could even be added to the bottom of these? Not sure what direction
would be best, and as you say, there's the potential for confusion
because the individual items won't add up to the totals. If they
were in the same file, maybe that could be addressed in the label.
- Can you include the related doc update in the same patch? That way
the doc update is more likely to be backported along with the patch.
- I was going to ask whether these should all be in a single file or
whether they should be split up so there's a separate file for each
type or error, each containing a single number. But
Documentation/filesystems/sysfs.txt says either is OK and
/sys/devices/system/node/node0/vmstat is an example of a similar
situation in an existing file, so I think what you did is perfect.
> Signed-off-by: Rajat Jain <[email protected]>
> ---
> v5: Fix the signature
> v4: use "%llu" in place of "%llx"
> v3: Merge everything in aer.c
>
> drivers/pci/pcie/aer.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ce0d675d7bd3..c989bb5bb6f1 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -587,10 +587,38 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
> aer_stats_aggregate_attr(dev_total_fatal_errs);
> aer_stats_aggregate_attr(dev_total_nonfatal_errs);
>
> +#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
> + static ssize_t \
> + field##_show(struct device *dev, struct device_attribute *attr, \
> + char *buf) \
> +{ \
> + unsigned int i; \
> + char *str = buf; \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + u64 *stats = pdev->aer_stats->stats_array; \
Nit: add a blank line here.
> + for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
> + if (strings_array[i]) \
> + str += sprintf(str, "%s = 0x%llu\n", \
> + strings_array[i], stats[i]); \
> + else if (stats[i]) \
> + str += sprintf(str, #stats_array "bit[%d] = 0x%llu\n",\
> + i, stats[i]); \
- I like the way this uses the same text as used in dmesg
(aer_correctable_error_string[] and
aer_uncorrectable_error_string[]).
- I think this incorrectly prints a "0x" prefix for a decimal number
(probably an artifact of your v4 change).
- Tyler posted a patch [1] to update those dmesg strings so they match
the way lspci decodes them. I really liked that update, but we
never quite finished it. If we're going to do that, it would be
nice to do it first, so we don't publish new sysfs files, then
immediately change the labels used in them.
- IIRC, Tyler's patch had the nice property of changing the strings so
each error name had no spaces, which would make it a little easier
to parse this sysfs file: each line would be a single identifier
followed by a single number (I would probably remove the "=" from
the middle).
[1] https://lkml.kernel.org/r/[email protected]
> + } \
> + return str-buf; \
> +} \
> +static DEVICE_ATTR_RO(field)
> +
> +aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
> + aer_correctable_error_string);
> +aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
> + aer_uncorrectable_error_string);
> +
> static struct attribute *aer_stats_attrs[] __ro_after_init = {
> &dev_attr_dev_total_cor_errs.attr,
> &dev_attr_dev_total_fatal_errs.attr,
> &dev_attr_dev_total_nonfatal_errs.attr,
> + &dev_attr_dev_breakdown_correctable.attr,
> + &dev_attr_dev_breakdown_uncorrectable.attr,
> NULL
> };
>
> --
> 2.18.0.rc1.244.gcf134e6275-goog
>
On Thu, Jun 21, 2018 at 6:17 AM, Bjorn Helgaas <[email protected]> wrote:
> On Wed, Jun 20, 2018 at 04:41:43PM -0700, Rajat Jain wrote:
>> Define a structure to hold the AER statistics. There are 2 groups
>> of statistics: dev_* counters that are to be collected for all AER
>> capable devices and rootport_* counters that are collected for all
>> (AER capable) rootports only. Allocate and free this structure when
>> device is added or released (thus counters survive the lifetime of the
>> device).
>>
>> Signed-off-by: Rajat Jain <[email protected]>
>> ---
>> v5: Same as v4
>> v4: Same as v3
>> v3: Merge everything in aer.c
>>
>> drivers/pci/pcie/aer.c | 60 ++++++++++++++++++++++++++++++++++++++++++
>> drivers/pci/probe.c | 1 +
>> include/linux/pci.h | 3 +++
>> 3 files changed, 64 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..f9fa994b6c33 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -33,6 +33,9 @@
>> #define AER_ERROR_SOURCES_MAX 100
>> #define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
>>
>> +#define AER_MAX_TYPEOF_CORRECTABLE_ERRS 16 /* as per PCI_ERR_COR_STATUS */
>> +#define AER_MAX_TYPEOF_UNCORRECTABLE_ERRS 26 /* as per PCI_ERR_UNCOR_STATUS*/
>> +
>> struct aer_err_info {
>> struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>> int error_dev_num;
>> @@ -76,6 +79,40 @@ struct aer_rpc {
>> */
>> };
>>
>> +/* AER stats for the device */
>> +struct aer_stats {
>> +
>> + /*
>> + * Fields for all AER capable devices. They indicate the errors
>> + * "as seen by this device". Note that this may mean that if an
>> + * end point is causing problems, the AER counters may increment
>> + * at its link partner (e.g. root port) because the errors will be
>> + * "seen" by the link partner and not the the problematic end point
>> + * itself (which may report all counters as 0 as it never saw any
>> + * problems).
>> + */
>> + /* Individual counters for different type of correctable errors */
>> + u64 dev_cor_errs[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
>> + /* Individual counters for different type of uncorrectable errors */
>> + u64 dev_uncor_errs[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
>> + /* Total number of correctable errors seen by this device */
>> + u64 dev_total_cor_errs;
>> + /* Total number of fatal uncorrectable errors seen by this device */
>> + u64 dev_total_fatal_errs;
>> + /* Total number of fatal uncorrectable errors seen by this device */
>> + u64 dev_total_nonfatal_errs;
>> +
>> + /*
>> + * Fields for Root ports only, these indicate the total number of
>> + * ERR_COR, ERR_FATAL, and ERR_NONFATAL messages received by the
>> + * rootport, INCLUDING the ones that are generated internally (by
>> + * the rootport itself)
>
> Strictly speaking, I think these are applicable for both root ports
> and root complex event collectors, right?
Correct, I will reword this comment to state this.
>
>> + */
>> + u64 rootport_total_cor_errs;
>> + u64 rootport_total_fatal_errs;
>> + u64 rootport_total_nonfatal_errs;
>> +};
>> +
>> #define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
>> PCI_ERR_UNC_ECRC| \
>> PCI_ERR_UNC_UNSUP| \
>> @@ -402,12 +439,35 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>> return 0;
>> }
>>
>> +static int pci_aer_stats_init(struct pci_dev *pdev)
>> +{
>> + pdev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
>> + if (!pdev->aer_stats) {
>> + dev_err(&pdev->dev, "No memory for aer_stats\n");
>
> pci_err(), if we need the message at all.
>
> Based on c7abb2352c29 ("PCI: Remove unnecessary messages for memory
> allocation failures"), I'd be inclined to drop the message completely.
Will do.
>
>> + return -ENOMEM;
>> + }
>> + return 0;
>> +}
>> +
>> +static void pci_aer_stats_exit(struct pci_dev *pdev)
>> +{
>> + kfree(pdev->aer_stats);
>> + pdev->aer_stats = NULL;
>> +}
>> +
>> int pci_aer_init(struct pci_dev *dev)
>> {
>> dev->aer_cap = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
>> + if (!dev->aer_cap || pci_aer_stats_init(dev))
>> + return -EIO;
>
> This skips pci_cleanup_aer_error_status_regs() if the kzalloc() fails.
> I think we should still do pci_cleanup_aer_error_status_regs(), even
> if the alloc fails.
Will do.
>
> Nobody checks the return value of pci_aer_init(), so I think you can
> simplify this by making these functions void.
Will do.
>
> Maybe even squash them together, i.e., do the kzalloc() directly in
> pci_aer_init() and the kfree() directly in pci_aer_exit()?
Will do.
>
>> return pci_cleanup_aer_error_status_regs(dev);
>> }
>>
>> +void pci_aer_exit(struct pci_dev *dev)
>> +{
>> + pci_aer_stats_exit(dev);
>> +}
>> +
>> #define AER_AGENT_RECEIVER 0
>> #define AER_AGENT_REQUESTER 1
>> #define AER_AGENT_COMPLETER 2
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index ac876e32de4b..48edd0c9e4bc 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
>>
>> static void pci_release_capabilities(struct pci_dev *dev)
>> {
>> + pci_aer_exit(dev);
>> pci_vpd_release(dev);
>> pci_iov_release(dev);
>> pci_free_cap_save_buffers(dev);
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 340029b2fb38..8d59c6c19a19 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -299,6 +299,7 @@ struct pci_dev {
>> u8 hdr_type; /* PCI header type (`multi' flag masked out) */
>> #ifdef CONFIG_PCIEAER
>> u16 aer_cap; /* AER capability offset */
>> + struct aer_stats *aer_stats; /* AER stats for this device */
>> #endif
>> u8 pcie_cap; /* PCIe capability offset */
>> u8 msi_cap; /* MSI capability offset */
>> @@ -1471,10 +1472,12 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
>> void pci_no_aer(void);
>> bool pci_aer_available(void);
>> int pci_aer_init(struct pci_dev *dev);
>> +void pci_aer_exit(struct pci_dev *dev);
>
> With the exception of pci_aer_available(), these are only used inside
> drivers/pci. This might be a good opportunity to move those private
> things to drivers/pci/pci.h (in a separate patch, of course).
Will do.
>
>> #else
>> static inline void pci_no_aer(void) { }
>> static inline bool pci_aer_available(void) { return false; }
>> static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
>> +static inline void pci_aer_exit(struct pci_dev *d) { }
>> #endif
>>
>> #ifdef CONFIG_PCIE_ECRC
>> --
>> 2.18.0.rc1.244.gcf134e6275-goog
>>
On Thu, Jun 21, 2018 at 11:48 AM, Bjorn Helgaas <[email protected]> wrote:
> [+cc Tyler for AER dmesg decoding]
>
> I really like this idea a lot; thanks for putting it together!
>
> On Wed, Jun 20, 2018 at 04:41:45PM -0700, Rajat Jain wrote:
>> Add sysfs attributes to provide breakdown of the AERs seen,
>> into different type of correctable or uncorrectable errors:
>>
>> dev_breakdown_correctable
>> dev_breakdown_uncorrectable
>
> - Can you include a more complete sysfs path here in the commit log,
> as well as a snippet of the contents? From the doc patch, I think
> it is currently:
>
> /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>
> - I'm not sure it's worth making a new subdirectory. What if you
> simply added these?
Its your call. We're going to be creating 6 files for aer_stats (I'll
be following your suggestion below), and I think it may clutter the
directory. In my next patch, I'm going to remove the sub directory,
but we can add that later if you feel so.
>
> /sys/bus/pci/devices/<dev>/aer_correctable
> /sys/bus/pci/devices/<dev>/aer_uncorrectable
>
> or perhaps, since you split the "total" files into
> cor/nonfatal/fatal, these could match?
>
> /sys/bus/pci/devices/<dev>/aer_correctable
> /sys/bus/pci/devices/<dev>/aer_nonfatal
> /sys/bus/pci/devices/<dev>/aer_fatal
This sounds like a better idea.
>
> I think the nonfatal/fatal distinction might be worth exposing
> because some of those are configurable and the kernel handling is
> significantly different. So I think it would make this more
> approachable if the "remove/re-enumerate" situations that will be
> obvious in dmesg logs were clearly connected with "aer_fatal"
> statistics, as opposed to being connected to some subset of what's
> in "aer_uncorrectable".
Agree, however note that theoretically, the classification of
uncorrectable errors into fatal or non fatal can be programmed /
changed (by who?), so it is possible that some of the same types of
errors may show up such that some instances in counted in fatal and
some in non-fatal (depending on whether those bits were set while
handling ERR_FATAL or ERR_NONFATAL respectively). Not that I think
there is something wrong with this, just thought I will mention.
>
> - Possibly the totals that you currently have in dev_total_cor_errs
> could even be added to the bottom of these? Not sure what direction
> would be best, and as you say, there's the potential for confusion
> because the individual items won't add up to the totals. If they
> were in the same file, maybe that could be addressed in the label.
Agree, this also sounds good.
>
> - Can you include the related doc update in the same patch? That way
> the doc update is more likely to be backported along with the patch.
Will do.
>
> - I was going to ask whether these should all be in a single file or
> whether they should be split up so there's a separate file for each
> type or error, each containing a single number. But
> Documentation/filesystems/sysfs.txt says either is OK and
> /sys/devices/system/node/node0/vmstat is an example of a similar
> situation in an existing file, so I think what you did is perfect.
Thank you, I initially thought of having a different file for each
error, but then it looked like we're be having much more files - at
least large enough for the number of files to overwhelm the user
space.
Thanks,
Rajat
>
>> Signed-off-by: Rajat Jain <[email protected]>
>> ---
>> v5: Fix the signature
>> v4: use "%llu" in place of "%llx"
>> v3: Merge everything in aer.c
>>
>> drivers/pci/pcie/aer.c | 28 ++++++++++++++++++++++++++++
>> 1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index ce0d675d7bd3..c989bb5bb6f1 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -587,10 +587,38 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
>> aer_stats_aggregate_attr(dev_total_fatal_errs);
>> aer_stats_aggregate_attr(dev_total_nonfatal_errs);
>>
>> +#define aer_stats_breakdown_attr(field, stats_array, strings_array) \
>> + static ssize_t \
>> + field##_show(struct device *dev, struct device_attribute *attr, \
>> + char *buf) \
>> +{ \
>> + unsigned int i; \
>> + char *str = buf; \
>> + struct pci_dev *pdev = to_pci_dev(dev); \
>> + u64 *stats = pdev->aer_stats->stats_array; \
>
> Nit: add a blank line here.
Will do.
>
>> + for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
>> + if (strings_array[i]) \
>> + str += sprintf(str, "%s = 0x%llu\n", \
>> + strings_array[i], stats[i]); \
>> + else if (stats[i]) \
>> + str += sprintf(str, #stats_array "bit[%d] = 0x%llu\n",\
>> + i, stats[i]); \
>
> - I like the way this uses the same text as used in dmesg
> (aer_correctable_error_string[] and
> aer_uncorrectable_error_string[]).
>
> - I think this incorrectly prints a "0x" prefix for a decimal number
> (probably an artifact of your v4 change).
Will do.
>
> - Tyler posted a patch [1] to update those dmesg strings so they match
> the way lspci decodes them. I really liked that update, but we
> never quite finished it. If we're going to do that, it would be
> nice to do it first, so we don't publish new sysfs files, then
> immediately change the labels used in them.
Sure, I guess you can push them in the right order.
>
> - IIRC, Tyler's patch had the nice property of changing the strings so
> each error name had no spaces, which would make it a little easier
> to parse this sysfs file: each line would be a single identifier
> followed by a single number (I would probably remove the "=" from
> the middle).
Will do.
>
> [1] https://lkml.kernel.org/r/[email protected]
>
>> + } \
>> + return str-buf; \
>> +} \
>> +static DEVICE_ATTR_RO(field)
>> +
>> +aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
>> + aer_correctable_error_string);
>> +aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
>> + aer_uncorrectable_error_string);
>> +
>> static struct attribute *aer_stats_attrs[] __ro_after_init = {
>> &dev_attr_dev_total_cor_errs.attr,
>> &dev_attr_dev_total_fatal_errs.attr,
>> &dev_attr_dev_total_nonfatal_errs.attr,
>> + &dev_attr_dev_breakdown_correctable.attr,
>> + &dev_attr_dev_breakdown_uncorrectable.attr,
>> NULL
>> };
>>
>> --
>> 2.18.0.rc1.244.gcf134e6275-goog
>>
On Thu, Jun 21, 2018 at 2:19 AM <[email protected]> wrote:
>
> On 2018-06-19 22:01, Rajat Jain wrote:
> > On Mon, Jun 18, 2018 at 11:03 PM, <[email protected]> wrote:
> >> On 2018-06-19 05:41, Rajat Jain wrote:
> >>>
> >>> Hello,
> >>>
> >>> On Sat, Jun 16, 2018 at 10:24 PM <[email protected]> wrote:
> >>>>
> >>>>
> >>>> On 2018-05-23 23:28, Rajat Jain wrote:
> >>>> > Add the PCI AER statistics details to
> >>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > and provide a pointer to it in
> >>>> > Documentation/PCI/pcieaer-howto.txt
> >>>> >
> >>>> > Signed-off-by: Rajat Jain <[email protected]>
> >>>> > ---
> >>>> > v2: Move the documentation to Documentation/ABI/
> >>>> >
> >>>> > .../testing/sysfs-bus-pci-devices-aer_stats | 103 ++++++++++++++++++
> >>>> > Documentation/PCI/pcieaer-howto.txt | 5 +
> >>>> > 2 files changed, 108 insertions(+)
> >>>> > create mode 100644
> >>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> >
> >>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > new file mode 100644
> >>>> > index 000000000000..f55c389290ac
> >>>> > --- /dev/null
> >>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > @@ -0,0 +1,103 @@
> >>>> > +==========================
> >>>> > +PCIe Device AER statistics
> >>>> > +==========================
> >>>> > +These attributes show up under all the devices that are AER capable.
> >>>> > These
> >>>> > +statistical counters indicate the errors "as seen/reported by the
> >>>> > device".
> >>>> > +Note that this may mean that if an end point is causing problems, the
> >>>> > AER
> >>>> > +counters may increment at its link partner (e.g. root port) because
> >>>> > the
> >>>> > +errors will be "seen" / reported by the link partner and not the the
> >>>> > +problematic end point itself (which may report all counters as 0 as it
> >>>> > never
> >>>> > +saw any problems).
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Total number of correctable errors seen and reported by
> >>>> > this
> >>>> > + PCI device using ERR_COR.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Total number of uncorrectable fatal errors seen and
> >>>> > reported
> >>>> > + by this PCI device using ERR_FATAL.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Total number of uncorrectable non-fatal errors seen and
> >>>> > reported
> >>>> > + by this PCI device using ERR_NONFATAL.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Breakdown of of correctable errors seen and reported by
> >>>> > this
> >>>> > + PCI device using ERR_COR. A sample result looks like
> >>>> > this:
> >>>> > +-----------------------------------------
> >>>> > +Receiver Error = 0x174
> >>>> > +Bad TLP = 0x19
> >>>> > +Bad DLLP = 0x3
> >>>> > +RELAY_NUM Rollover = 0x0
> >>>> > +Replay Timer Timeout = 0x1
> >>>> > +Advisory Non-Fatal = 0x0
> >>>> > +Corrected Internal Error = 0x0
> >>>> > +Header Log Overflow = 0x0
> >>>> > +-----------------------------------------
> >>>> why hex display ? decimal is easy to read as these are counters.
> >>>
> >>>
> >>> Have no particular preference. Since these can be potentially large
> >>> numbers, just had a random thought that hex might make it more
> >>> concise. I can change to decimal if that is preferable.
> >>>
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Breakdown of of correctable errors seen and reported by
> >>>> > this
> >>>> > + PCI device using ERR_FATAL or ERR_NONFATAL. A sample
> >>>> > result
> >>>> > + looks like this:
> >>>> > +-----------------------------------------
> >>>> > +Undefined = 0x0
> >>>> > +Data Link Protocol = 0x0
> >>>> > +Surprise Down Error = 0x0
> >>>> > +Poisoned TLP = 0x0
> >>>> > +Flow Control Protocol = 0x0
> >>>> > +Completion Timeout = 0x0
> >>>> > +Completer Abort = 0x0
> >>>> > +Unexpected Completion = 0x0
> >>>> > +Receiver Overflow = 0x0
> >>>> > +Malformed TLP = 0x0
> >>>> > +ECRC = 0x0
> >>>> > +Unsupported Request = 0x0
> >>>> > +ACS Violation = 0x0
> >>>> > +Uncorrectable Internal Error = 0x0
> >>>> > +MC Blocked TLP = 0x0
> >>>> > +AtomicOp Egress Blocked = 0x0
> >>>> > +TLP Prefix Blocked Error = 0x0
> >>>> > +-----------------------------------------
> >>>> > +
> >>>> > +============================
> >>>> > +PCIe Rootport AER statistics
> >>>> > +============================
> >>>> > +These attributes showup under only the rootports that are AER capable.
> >>>> > These
> >>>> > +indicate the number of error messages as "reported to" the rootport.
> >>>> > Please note
> >>>> > +that the rootports also transmit (internally) the ERR_* messages for
> >>>> > errors seen
> >>>> > +by the internal rootport PCI device, so these counters includes them
> >>>> > and are
> >>>> > +thus cumulative of all the error messages on the PCI hierarchy
> >>>> > originating
> >>>> > +at that root port.
> >>>>
> >>>> what about switches and bridges ?
> >>>
> >>>
> >>> What about them? AIUI, the switches forward the ERR_ messages from
> >>> downstream devices to the rootport, like they do with standard
> >>> messages. They can potentially generate their own ERR_ message and
> >>> that would be reported no different than other end point devices.
> >>
> >>
> >>
> >> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
> >> contained by switch
> >> and the error handling code thinks that, the error is contained by
> >> switch
> >> irrespective of
> >> AER or DPC, and it will think that the problem could be with
> >> Switch/bridge
> >> upstream link.
> >>
> >> hence the pci_dev of the switch where you should be increment your
> >> counters.
> >> of course ER_FATAL would have traversed till RP, but that doesnt meant
> >> that
> >> you account the error there.
> >
> > In this case, for the pci_dev for the rootport:
> > - rootport_total_fatal_errors will be incremented (since it will get
> > ERR_FATAL)
> > - dev_total_fatal_errors will not be incremented.
>
> ok but my confusion is: should you not be incrementing counter against
> pci_dev of switch ? and not the RP ?
> because the problem was with upstream link of the EP (e.g. switch)
The question is who sent the ERR_* message to the rootport? That is
the guy who noticed the problem, and will most likely be the switch
port in your case. It is this guy whose counter shall be incremented.
>
> >
> > The dev_total_fatal_errors will be incremented only for the pci device
> > identified by the "Error Source Identification Register" in the PCIe
> > spec. Does this help clarify?
>
> >
> >>
> >>
> >>>
> >>>> Also Can you give some idea as e.g what is the difference between
> >>>> dev_total_fatal_errs and rootport_total_fatal_errs (assuming that
> >>>> both
> >>>> are same pci_dev.
> >>>
> >>>
> >>> For a pci_dev representing the rootport:
> >>>
> >>> dev_total_fatal_errors = how many times this PCI device *experienced*
> >>> a fatal problem on its own (i.e. either link issues while talking to
> >>> its link partner, or some internal errors).
> >>>
> >>> rootport_total_fatal_errors = how many times this rootport was
> >>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
> >>> that originates at it (can be any link further downstream). This
> >>> includes the dev_total_fatal_errors also, because any errors detected
> >>> by the rootport are also "informed" to itself via ERR_* messages. In
> >>> reality, this is just the total number of ERR_FATAL messages received
> >>> at the rootport. This sysfs attribute will only exist for root ports.
> >>>
> >>>>
> >>>> rootport_total_fatal_errs gives me an idea that how many times
> >>>> things
> >>>> have been failed under this pci_dev ?
> >>>
> >>>
> >>> Yes, as above.
> >>>
> >>>> which means num of downstream link problems. but I am still trying
> >>>> to
> >>>> make sense as how it could be used,
> >>>> since we dont have BDF information associated with the number of
> >>>> errors
> >>>> anywhere (except these AER print messages)
> >>>>
> >>>
> >>> Agree. That is a limitation. The challenges being more record
> >>> keeping,
> >>> more complicated sysfs representation, and given that PCI devices may
> >>> come and go, how do we know it is the same device before we collate
> >>> their stats etc.
> >>>
> >>>>
> >>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
> >>>> then say root-port will report it and increment
> >>>> dev_total_fatal_errs ++
> >>>> does it also increment root-port_total_fatal_errs ++ in above
> >>>> scenario ?
> >>>
> >>>
> >>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
> >>> port of that hierarchy.
> >>>
> >>> Thanks,
> >>>
> >>> Rajat
> >>>
> >>>>
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Total number of ERR_COR messages reported to rootport.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
> >>>> > +Date: May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact: [email protected], [email protected]
> >>>> > +Description: Total number of ERR_NONFATAL messages reported to
> >>>> > rootport.
> >>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
> >>>> > b/Documentation/PCI/pcieaer-howto.txt
> >>>> > index acd0dddd6bb8..91b6e677cb8c 100644
> >>>> > --- a/Documentation/PCI/pcieaer-howto.txt
> >>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
> >>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
> >>>> > device who sends
> >>>> > the error message to root port. Pls. refer to pci express specs for
> >>>> > other fields.
> >>>> >
> >>>> > +2.4 AER Statistics / Counters
> >>>> > +
> >>>> > +When PCIe AER errors are captured, the counters / statistics are also
> >>>> > exposed
> >>>> > +in form of sysfs attributes which are documented at
> >>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> >
> >>>> > 3. Developer Guide
On 6/21/2018 5:25 PM, Rajat Jain wrote:
> On Thu, Jun 21, 2018 at 11:48 AM, Bjorn Helgaas <[email protected]> wrote:
>> [+cc Tyler for AER dmesg decoding]
>>
>> - Tyler posted a patch [1] to update those dmesg strings so they match
>> the way lspci decodes them. I really liked that update, but we
>> never quite finished it. If we're going to do that, it would be
>> nice to do it first, so we don't publish new sysfs files, then
>> immediately change the labels used in them.
> Sure, I guess you can push them in the right order.
The way the prints are handled has already been unified in 4.18rc1:
https://elixir.bootlin.com/linux/v4.18-rc1/source/drivers/pci/pcie/aer.c#L636
So that patch isn't needed anymore in it's entirety.
>> - IIRC, Tyler's patch had the nice property of changing the strings so
>> each error name had no spaces, which would make it a little easier
>> to parse this sysfs file: each line would be a single identifier
>> followed by a single number (I would probably remove the "=" from
>> the middle).
>
> Will do.
Would you like me to send a patch with just the string changes?
Thanks,
Tyler
>> [1] https://lkml.kernel.org/r/[email protected]
>>
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Fri, Jun 22, 2018 at 12:38:50PM -0400, Tyler Baicar wrote:
> On 6/21/2018 5:25 PM, Rajat Jain wrote:
> > On Thu, Jun 21, 2018 at 11:48 AM, Bjorn Helgaas <[email protected]> wrote:
> > > [+cc Tyler for AER dmesg decoding]
> > >
> > > - Tyler posted a patch [1] to update those dmesg strings so they match
> > > the way lspci decodes them. I really liked that update, but we
> > > never quite finished it. If we're going to do that, it would be
> > > nice to do it first, so we don't publish new sysfs files, then
> > > immediately change the labels used in them.
> > Sure, I guess you can push them in the right order.
> The way the prints are handled has already been unified in 4.18rc1:
>
> https://elixir.bootlin.com/linux/v4.18-rc1/source/drivers/pci/pcie/aer.c#L636
>
> So that patch isn't needed anymore in it's entirety.
> > > - IIRC, Tyler's patch had the nice property of changing the strings so
> > > each error name had no spaces, which would make it a little easier
> > > to parse this sysfs file: each line would be a single identifier
> > > followed by a single number (I would probably remove the "=" from
> > > the middle).
> >
> > Will do.
> Would you like me to send a patch with just the string changes?
That would be awesome! Sorry, I didn't realize that half of that got
done via another route.
Bjorn