2021-10-05 17:20:58

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 0/8] Fix long standing AER Error Handling Issues

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues:
- Confusing message in aer_print_error()
- aer_err_info not being initialized completely in DPC path before
we print the AER logs
- A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The primary aim of this patch series is to converge the APEI path and the
native AER error handling paths. In our current code, we find that we
have two different behaviours (especially when it comes to clearing of
the AER registers) for the same functionality.

This patch series, tries to bring the same semantics and hence more
commonanlity between the APEI part of code and the native OS
handling of AER errors.

PATCH 1:
- Fixes the first issue

PATCH 2 - 4:
- Fixes the second issue
- "Patch 3/8" is dependent on "Patch 2/8" in the series

PATCH 5 - 7
- Deals with converging the various paths and brings more
commonality between them
- "Patch 6/8" depends on "Patch 1/8"

PATCH 8:
- Adds extra information in AER error logs.

Thanks,
Naveen Naidu

Changelog
=========

v4:
- Implement review comments
- Make "Patch 1/8" commit message more meaningful
- Fix the code comment error detected by kernel test robot
in "Patch 6/8"

v2 and v3:
- Fix up mail formatting and include the appropriate receipients for
the patch.

Naveen Naidu (8):
[PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[]
[PATCH v4 2/8] PCI: Cleanup struct aer_err_info
[PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()
[PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()
[PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
[PATCH v4 6/8] PCI/AER: Clear error device AER registers in aer_irq()
[PATCH v4 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()
[PATCH v4 8/8] PCI/AER: Include DEVCTL in aer_print_error()

drivers/pci/pci.h | 23 +++-
drivers/pci/pcie/aer.c | 269 ++++++++++++++++++++++++++++-------------
drivers/pci/pcie/dpc.c | 9 +-
drivers/pci/pcie/err.c | 9 +-
4 files changed, 209 insertions(+), 101 deletions(-)

--
2.25.1


2021-10-05 17:21:17

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 2/8] PCI: Cleanup struct aer_err_info

The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.

The length of these registers are:
- id: 16 bits - Represents the Error Source Requester ID
- status: 32 bits - COR/UNCOR Error Status
- mask: 32 bits - COR/UNCOR Error Mask

Since the length of the above registers are even, use u16 and u32
to represent their values.

Also remove the __pad fields.

"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.

Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pci.h | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;

- unsigned int id:16;
+ u16 id;

unsigned int severity:2; /* 0:NONFATAL | 1:FATAL | 2:COR */
- unsigned int __pad1:5;
unsigned int multi_error_valid:1;

unsigned int first_error:5;
- unsigned int __pad2:2;
unsigned int tlp_header_valid:1;

- unsigned int status; /* COR/UNCOR Error Status */
- unsigned int mask; /* COR/UNCOR Error Mask */
+ u32 status; /* COR/UNCOR Error Status */
+ u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
};

--
2.25.1

2021-10-05 17:23:07

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

dpc_process_error() clears both AER fatal and non fatal status
registers. Instead of clearing each status registers via a different
function call use pci_aer_clear_status().

This helps clean up the code a bit.

Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pcie/dpc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df3f3a10f8bc..faf4a1e77fab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
aer_print_error(pdev, &info);
- pci_aer_clear_nonfatal_status(pdev);
- pci_aer_clear_fatal_status(pdev);
+ pci_aer_clear_status(pdev);
}
}

--
2.25.1

2021-10-05 17:23:14

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

In the EDR path, AER registers are cleared *after* DPC error event is
processed. The process stack in EDR is:

edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status()
pcie_do_recovery()

But in DPC path, AER status registers are cleared *while* processing
the error. The process stack in DPC is:

dpc_handler()
dpc_process_error()
pci_aer_clear_status()
pcie_do_recovery()

In EDR path, AER status registers are cleared irrespective of whether
the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
AER status registers are cleared only when it's an unmasked uncorrectable
error.

This leads to two different behaviours for the same task (handling of
DPC errors) in FFS systems and when native OS has control.

Bring the same semantics for clearing the AER status register in EDR
path and DPC path.

Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pcie/dpc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index faf4a1e77fab..68899a3db126 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
aer_print_error(pdev, &info);
- pci_aer_clear_status(pdev);
}
}

@@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
struct pci_dev *pdev = context;

dpc_process_error(pdev);
+ pci_aer_clear_status(pdev);

/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
--
2.25.1

2021-10-05 17:23:21

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery()

pcie_do_recovery() is shared across the following paths:
- ACPI APEI
- Native AER path
- EDR
- DPC

ACPI APEI
==========

ghes_handle_aer()
aer_recover_queue()
kfifo_in_spinlocked(aer_recover_ring)

aer_recover_work_func()
while (kfifo_get(aer_recover_ring))
pcie_do_recovery()

In this path the system firmware clears the AER registers before
handing off the record to the OS in ghes_handle_aer()

Native AER
==========

aer_irq()
aer_add_err_devices_to_queue()
kfifo_put(&rpc->aer_fifo, *e_dev)
clear_error_source_aer_registers() <---- AER registers are cleard

aer_isr()
aer_isr_one_error()
handle_error_source()
pcie_do_recovery()

The AER registers are cleared during the handling of IRQ, i.e before we
the recovery starts.

DPC
=====

dpc_handler()
dpc_process_error()
pci_aer_clear_status() <---- AER registers are cleared
pcie_do_recovery()

EDR
====

edr_handle_event()
dpc_process_error()
pci_aer_raw_clear_status() <---- AER registers are cleared
pcie_do_recovery()

In all the above paths, the AER registers are cleared before
pcie_do_recovery(). The non fatal status AER registers are again cleared
in pcie_do_recovery(). This is redundant.

Remove redundant clearing of AER register in pcie_do_recovery()

Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pcie/err.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b576aa890c76..fe04b0ae22f4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -231,14 +231,11 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,

/*
* If we have native control of AER, clear error status in the device
- * that detected the error. If the platform retained control of AER,
- * it is responsible for clearing this status. In that case, the
- * signaling device may not even be visible to the OS.
+ * that detected the error.
*/
- if (host->native_aer || pcie_ports_native) {
+ if (host->native_aer || pcie_ports_native)
pcie_clear_device_status(dev);
- pci_aer_clear_nonfatal_status(dev);
- }
+
pci_info(bridge, "device recovery successful\n");
return status;

--
2.25.1

2021-10-05 17:24:33

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

In the dpc_process_error() path, info->id isn't initialized before being
passed to aer_print_error(). In the corresponding AER path, it is
initialized in aer_isr_one_error().

The error message shown during Coverity Scan is:

Coverity #1461602
CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
8. uninit_use_in_call: Using uninitialized value info.id when calling aer_print_error.

Initialize the "info->id" before passing it to aer_print_error()

Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pcie/dpc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..df3f3a10f8bc 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,

void dpc_process_error(struct pci_dev *pdev)
{
- u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
+ u16 cap = pdev->dpc_cap, status, reason, ext_reason;
struct aer_err_info info;

pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
- pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
+ pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &info.id);

pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
- status, source);
+ status, info.id);

reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
--
2.25.1

2021-10-05 17:24:55

by Naveen Naidu

[permalink] [raw]
Subject: [PATCH v4 8/8] PCI/AER: Include DEVCTL in aer_print_error()

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

Sample output from dummy error injected by aer-inject:

pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
pcieport 0000:00:03.0: device [1b36:000c] error status/mask=00000040/0000e000, devctl=0x000f <-- devctl added to the error log
pcieport 0000:00:03.0: [ 6] BadTLP

Signed-off-by: Naveen Naidu <[email protected]>
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aer.c | 10 ++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
u32 status; /* COR/UNCOR Error Status */
u32 mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
+
+ u16 devctl;
};

/* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3937f5384e4..fdeef9deb016 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);

- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x, devctl=%#06x\n",
+ dev->vendor, dev->device, info->status, info->mask, info->devctl);

__aer_print_error(dev, info);

@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
if (!aer)
return 0;

+ /*
+ * Cache the value of Device Control Register now, because later the
+ * device might not be available
+ */
+ pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &info->devctl);
+
if (info->severity == AER_CORRECTABLE) {
pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
&info->status);
--
2.25.1

2021-10-21 01:30:37

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

On Tue, Oct 05, 2021 at 10:48:10PM +0530, Naveen Naidu wrote:
> In the dpc_process_error() path, info->id isn't initialized before being
> passed to aer_print_error(). In the corresponding AER path, it is
> initialized in aer_isr_one_error().
>
> The error message shown during Coverity Scan is:
>
> Coverity #1461602
> CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
> 8. uninit_use_in_call: Using uninitialized value info.id when calling aer_print_error.
>
> Initialize the "info->id" before passing it to aer_print_error()
>
> Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
> Signed-off-by: Naveen Naidu <[email protected]>
> ---
> drivers/pci/pcie/dpc.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index c556e7beafe3..df3f3a10f8bc 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
>
> void dpc_process_error(struct pci_dev *pdev)
> {
> - u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> + u16 cap = pdev->dpc_cap, status, reason, ext_reason;
> struct aer_err_info info;
>
> pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> - pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> + pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &info.id);
>
> pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> - status, source);
> + status, info.id);
>
> reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;

Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
Trigger Reason indicates ERR_NONFATAL or ERR_FATAL. So I think we
need to extract this reason before reading PCI_EXP_DPC_SOURCE_ID,
e.g.,

reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
if (reason == 1 || reason == 2)
pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &info.id);
else
info.id = 0;

> ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
> --
> 2.25.1
>
> _______________________________________________
> Linux-kernel-mentees mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

2021-10-21 01:44:53

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

On Tue, Oct 05, 2021 at 10:48:11PM +0530, Naveen Naidu wrote:
> dpc_process_error() clears both AER fatal and non fatal status
> registers. Instead of clearing each status registers via a different
> function call use pci_aer_clear_status().
>
> This helps clean up the code a bit.
>
> Signed-off-by: Naveen Naidu <[email protected]>
> ---
> drivers/pci/pcie/dpc.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index df3f3a10f8bc..faf4a1e77fab 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
> dpc_get_aer_uncorrect_severity(pdev, &info) &&
> aer_get_device_error_info(pdev, &info)) {
> aer_print_error(pdev, &info);
> - pci_aer_clear_nonfatal_status(pdev);
> - pci_aer_clear_fatal_status(pdev);
> + pci_aer_clear_status(pdev);

The commit log suggests that this is a simple cleanup that doesn't
change any behavior, but that's not quite true:

- The new code would clear PCI_ERR_ROOT_STATUS, but the old code
does not.

- The old code masks the status bits with the severity bits before
clearing, but the new code does not.

The commit log needs to show why these changes are what we want.

> }
> }
>
> --
> 2.25.1
>
> _______________________________________________
> Linux-kernel-mentees mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

2021-10-21 02:11:24

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

[+cc Keith, Sinan, Oza]

On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> In the EDR path, AER registers are cleared *after* DPC error event is
> processed. The process stack in EDR is:
>
> edr_handle_event()
> dpc_process_error()
> pci_aer_raw_clear_status()
> pcie_do_recovery()
>
> But in DPC path, AER status registers are cleared *while* processing
> the error. The process stack in DPC is:
>
> dpc_handler()
> dpc_process_error()
> pci_aer_clear_status()
> pcie_do_recovery()

These are accurate but they both include dpc_process_error(), so we
need a hint to show why the one here is different from the one in the
EDR path, e.g.,

dpc_handler
dpc_process_error
if (reason == 0)
pci_aer_clear_status # uncorrectable errors only
pcie_do_recovery

> In EDR path, AER status registers are cleared irrespective of whether
> the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> AER status registers are cleared only when it's an unmasked uncorrectable
> error.
>
> This leads to two different behaviours for the same task (handling of
> DPC errors) in FFS systems and when native OS has control.

FFS?

I'd really like to have a specific example of how a user would observe
this difference. I know you probably don't have two systems to
compare like that, but maybe we can work it out manually.

I guess you're saying the problem is in the native DPC handling, and
we don't clear the AER status registers for ERR_NONFATAL,
ERR_NONFATAL, etc., right?

I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
status in DPC event handling"), where Keith explicitly mentions those
cases. The commit log here should connect back to that and explain
whether something has changed.

I cc'd Keith and the reviewers of that change in case any of them have
time to dig into this again.

> Bring the same semantics for clearing the AER status register in EDR
> path and DPC path.
>
> Signed-off-by: Naveen Naidu <[email protected]>
> ---
> drivers/pci/pcie/dpc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index faf4a1e77fab..68899a3db126 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
> dpc_get_aer_uncorrect_severity(pdev, &info) &&
> aer_get_device_error_info(pdev, &info)) {
> aer_print_error(pdev, &info);
> - pci_aer_clear_status(pdev);
> }
> }
>
> @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
> struct pci_dev *pdev = context;
>
> dpc_process_error(pdev);
> + pci_aer_clear_status(pdev);
>
> /* We configure DPC so it only triggers on ERR_FATAL */
> pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> --
> 2.25.1
>
> _______________________________________________
> Linux-kernel-mentees mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

2021-10-21 16:35:29

by Naveen Naidu

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:10PM +0530, Naveen Naidu wrote:
> > In the dpc_process_error() path, info->id isn't initialized before being
> > passed to aer_print_error(). In the corresponding AER path, it is
> > initialized in aer_isr_one_error().
> >
> > The error message shown during Coverity Scan is:
> >
> > Coverity #1461602
> > CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
> > 8. uninit_use_in_call: Using uninitialized value info.id when calling aer_print_error.
> >
> > Initialize the "info->id" before passing it to aer_print_error()
> >
> > Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
> > Signed-off-by: Naveen Naidu <[email protected]>
> > ---
> > drivers/pci/pcie/dpc.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index c556e7beafe3..df3f3a10f8bc 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
> >
> > void dpc_process_error(struct pci_dev *pdev)
> > {
> > - u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> > + u16 cap = pdev->dpc_cap, status, reason, ext_reason;
> > struct aer_err_info info;
> >
> > pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> > - pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &info.id);
> >
> > pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> > - status, source);
> > + status, info.id);
> >
> > reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
>
> Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
> Trigger Reason indicates ERR_NONFATAL or ERR_FATAL. So I think we
> need to extract this reason before reading PCI_EXP_DPC_SOURCE_ID,
> e.g.,
>
> reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
> if (reason == 1 || reason == 2)
> pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &info.id);
> else
> info.id = 0;
>

Thank you for the review, I'll make this change when I send a v5 for the
patch series.

> > ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
> > --
> > 2.25.1
> >
> > _______________________________________________
> > Linux-kernel-mentees mailing list
> > [email protected]
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

2021-10-21 16:39:29

by Naveen Naidu

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:11PM +0530, Naveen Naidu wrote:
> > dpc_process_error() clears both AER fatal and non fatal status
> > registers. Instead of clearing each status registers via a different
> > function call use pci_aer_clear_status().
> >
> > This helps clean up the code a bit.
> >
> > Signed-off-by: Naveen Naidu <[email protected]>
> > ---
> > drivers/pci/pcie/dpc.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index df3f3a10f8bc..faf4a1e77fab 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
> > dpc_get_aer_uncorrect_severity(pdev, &info) &&
> > aer_get_device_error_info(pdev, &info)) {
> > aer_print_error(pdev, &info);
> > - pci_aer_clear_nonfatal_status(pdev);
> > - pci_aer_clear_fatal_status(pdev);
> > + pci_aer_clear_status(pdev);
>
> The commit log suggests that this is a simple cleanup that doesn't
> change any behavior, but that's not quite true:
>
> - The new code would clear PCI_ERR_ROOT_STATUS, but the old code
> does not.
>
> - The old code masks the status bits with the severity bits before
> clearing, but the new code does not.
>
> The commit log needs to show why these changes are what we want.
>

Reading through the code again, I realize how wrong(stupid) I was when
making this patch. I was thinking that:

pci_aer_clear_status() = pci_aer_clear_fatal_status() + pci_aer_clear_nonfatal_status()

Now I understand, that it is not at all the case. I apologize for the
mistake. I'll make sure to be meticulous while reading functions and not
just assume their behaviour just from their function names.

I'll drop this patch in the next version of the patch series I make.

Apologies again ^^'

> > }
> > }
> >
> > --
> > 2.25.1
> >
> > _______________________________________________
> > Linux-kernel-mentees mailing list
> > [email protected]
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

2021-10-21 16:55:22

by Naveen Naidu

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

On 20/10, Bjorn Helgaas wrote:
> [+cc Keith, Sinan, Oza]
>
> On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> > In the EDR path, AER registers are cleared *after* DPC error event is
> > processed. The process stack in EDR is:
> >
> > edr_handle_event()
> > dpc_process_error()
> > pci_aer_raw_clear_status()
> > pcie_do_recovery()
> >
> > But in DPC path, AER status registers are cleared *while* processing
> > the error. The process stack in DPC is:
> >
> > dpc_handler()
> > dpc_process_error()
> > pci_aer_clear_status()
> > pcie_do_recovery()
>
> These are accurate but they both include dpc_process_error(), so we
> need a hint to show why the one here is different from the one in the
> EDR path, e.g.,
>
> dpc_handler
> dpc_process_error
> if (reason == 0)
> pci_aer_clear_status # uncorrectable errors only
> pcie_do_recovery
>
> > In EDR path, AER status registers are cleared irrespective of whether
> > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> > AER status registers are cleared only when it's an unmasked uncorrectable
> > error.
> >
> > This leads to two different behaviours for the same task (handling of
> > DPC errors) in FFS systems and when native OS has control.
>
> FFS?
>

Firmware First Systems

> I'd really like to have a specific example of how a user would observe
> this difference. I know you probably don't have two systems to
> compare like that, but maybe we can work it out manually.
>

Apologies again! Reading through the code again and the specification, I
realize that my understanding was very incorrect at the time of making
this patch. I grossly oversimplified EDR and DPC when I was learning
about it.

I'll drop this patch when I send the v5 for the series.

Apologies again ^^'

> I guess you're saying the problem is in the native DPC handling, and
> we don't clear the AER status registers for ERR_NONFATAL,
> ERR_NONFATAL, etc., right?
>

But yes, I did have this question though (I wasn't able to find the
answers to it when reading the spec). Why do we not clear the entire
ERR_NONFATAL and ERR_FATAL registers in the DPC path just like EDR does
using the pci_aer_raw_clear_status() before going to pcie_do_recovery()

I am sure I might have missed something in the spec. I guess I'll
look/re-read these bits again.

Thanks for the review :)

> I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
> status in DPC event handling"), where Keith explicitly mentions those
> cases. The commit log here should connect back to that and explain
> whether something has changed.
>
> I cc'd Keith and the reviewers of that change in case any of them have
> time to dig into this again.
>
> > Bring the same semantics for clearing the AER status register in EDR
> > path and DPC path.
> >
> > Signed-off-by: Naveen Naidu <[email protected]>
> > ---
> > drivers/pci/pcie/dpc.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index faf4a1e77fab..68899a3db126 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
> > dpc_get_aer_uncorrect_severity(pdev, &info) &&
> > aer_get_device_error_info(pdev, &info)) {
> > aer_print_error(pdev, &info);
> > - pci_aer_clear_status(pdev);
> > }
> > }
> >
> > @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
> > struct pci_dev *pdev = context;
> >
> > dpc_process_error(pdev);
> > + pci_aer_clear_status(pdev);
> >
> > /* We configure DPC so it only triggers on ERR_FATAL */
> > pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> > --
> > 2.25.1
> >
> > _______________________________________________
> > Linux-kernel-mentees mailing list
> > [email protected]
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

2021-10-21 17:07:52

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

On Thu, Oct 21, 2021 at 10:06:11PM +0530, Naveen Naidu wrote:
> On 20/10, Bjorn Helgaas wrote:
> > On Tue, Oct 05, 2021 at 10:48:11PM +0530, Naveen Naidu wrote:
> > > dpc_process_error() clears both AER fatal and non fatal status
> > > registers. Instead of clearing each status registers via a different
> > > function call use pci_aer_clear_status().
> > >
> > > This helps clean up the code a bit.
> > >
> > > Signed-off-by: Naveen Naidu <[email protected]>
> > > ---
> > > drivers/pci/pcie/dpc.c | 3 +--
> > > 1 file changed, 1 insertion(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > > index df3f3a10f8bc..faf4a1e77fab 100644
> > > --- a/drivers/pci/pcie/dpc.c
> > > +++ b/drivers/pci/pcie/dpc.c
> > > @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
> > > dpc_get_aer_uncorrect_severity(pdev, &info) &&
> > > aer_get_device_error_info(pdev, &info)) {
> > > aer_print_error(pdev, &info);
> > > - pci_aer_clear_nonfatal_status(pdev);
> > > - pci_aer_clear_fatal_status(pdev);
> > > + pci_aer_clear_status(pdev);
> >
> > The commit log suggests that this is a simple cleanup that doesn't
> > change any behavior, but that's not quite true:
> >
> > - The new code would clear PCI_ERR_ROOT_STATUS, but the old code
> > does not.
> >
> > - The old code masks the status bits with the severity bits before
> > clearing, but the new code does not.
> >
> > The commit log needs to show why these changes are what we want.
> >
>
> Reading through the code again, I realize how wrong(stupid) I was when
> making this patch. I was thinking that:
>
> pci_aer_clear_status() = pci_aer_clear_fatal_status() + pci_aer_clear_nonfatal_status()
>
> Now I understand, that it is not at all the case. I apologize for the
> mistake. I'll make sure to be meticulous while reading functions and not
> just assume their behaviour just from their function names.

No problem, one could argue that the collection of pci_aer_clear_*()
functions that do slightly different things is itself a defect.

2021-10-21 17:14:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

On Thu, Oct 21, 2021 at 10:23:30PM +0530, Naveen Naidu wrote:
> On 20/10, Bjorn Helgaas wrote:
> > On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:

> > > In EDR path, AER status registers are cleared irrespective of whether
> > > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> > > AER status registers are cleared only when it's an unmasked uncorrectable
> > > error.
> > >
> > > This leads to two different behaviours for the same task (handling of
> > > DPC errors) in FFS systems and when native OS has control.
> >
> > FFS?
>
> Firmware First Systems

I assumed that's what it was, but it's helpful to use the same terms
used by the specs to make things easier to find. I don't think it's
actually the case that "Firmware First" necessary applies to the
entire system, since the ACPI FIRMWARE_FIRST flag is a per-error
source thing, not a per-system thing.