2021-10-04 21:47:21

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v3 0/8] Fix long standing AER Error Handling Issues

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues:
- Confusing message in aer_print_error()
- aer_err_info not being initialized completely in DPC path before
we print the AER logs
- A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.

This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.

PATCH 1:
- Fixes the first issue

PATCH 2 - 4:
- Fixes the second issue
- "Patch 3/8" is dependent on "Patch 2/3" in the series

PATCH 5 - 7
- Deals with converging the various paths and to bring more
commonality between them
- "Patch 6/8" depends on "Patch 1/8"

PATCH 8:
- Adds extra information in AER error logs.

Thanks,
Naveen Naidu

Changelog
=========

v3:
- Fix up mail formatting and resend the patches again.
Really sorry for all the spam. I messed up in the first try and
instead of fixing it well in v2, I messed up again. I have fixed
everything now. Apologies for the inconvenience caused. I'll make
sure to not repeat it again.

v2:
- Apologies for the mistake, I forgot to cc the linux-pci mailing
list.Resent the email with cc to linux-pci

Naveen Naidu (8):
[PATCH v3 1/8] PCI/AER: Remove ID from aer_agent_string[]
[PATCH v3 2/8] PCI: Cleanup struct aer_err_info
[PATCH v3 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
[PATCH v3 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
[PATCH v3 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
[PATCH v3 6/8] PCI/AER: Clear error device AER registers in aer_irq()
[PATCH v3 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
[PATCH v3 8/8] PCI/AER: Include DEVCTL in aer_print_error()

drivers/pci/pci.h | 23 +++-
drivers/pci/pcie/aer.c | 265 ++++++++++++++++++++++++++++-------------
drivers/pci/pcie/dpc.c | 9 +-
drivers/pci/pcie/err.c | 9 +-
4 files changed, 207 insertions(+), 99 deletions(-)

--
2.25.1


2021-10-04 22:23:32

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v3 8/8] PCI/AER: Include DEVCTL in aer_print_error()

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

Sample output from dummy error injected by aer-inject:

pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000, devctl=0x000f
pcieport 0000:00:03.0: [ 6] BadTLP

Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aer.c | 10 ++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+ u16 devctl;
};

/* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 91f91d6ab052..42cae01b6887 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);

- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x, devctl=%#06x\n",
+ dev->vendor, dev->device, info->status, info->mask, info->devctl);

__aer_print_error(dev, info);

@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
if (!aer)
return 0;

+ /*
+ * Cache the value of Device Control Register now, because later the
+ * device might not be available
+ */
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &info->devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
&info->status);
--
2.25.1