2023-07-21 22:10:28

by Smita Koralahalli

[permalink] [raw]
Subject: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
of AER should also own CXL Protocol Error Management as there is no
explicit control of CXL Protocol error. And the CXL RAS Cap registers
reported on Protocol errors should check for AER _OSC rather than CXL
Memory Error Reporting Control _OSC.

The CXL Memory Error Reporting Control _OSC specifically highlights
handling Memory Error Logging and Signaling Enhancements. These kinds of
errors are reported through a device's mailbox and can be managed
independently from CXL Protocol Errors.

This change fixes handling and reporting CXL Protocol Errors and RAS
registers natively with native AER and FW-First CXL Memory Error Reporting
Control.

[1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.

Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
Signed-off-by: Smita Koralahalli <[email protected]>
---
v2:
Added fixes tag.
Included what the patch fixes in commit message.
---
drivers/cxl/pci.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 1cb1494c28fe..2323169b6e5f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
return 0;
}

- /* BIOS has CXL error control */
- if (!host_bridge->native_cxl_error)
- return -ENXIO;
+ /* BIOS has PCIe AER error control */
+ if (!host_bridge->native_aer)
+ return 0;

rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
if (rc)
--
2.17.1



Subject: Re: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers



On 7/21/23 2:47 PM, Smita Koralahalli wrote:
> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
>
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
>
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.
>
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <[email protected]>
> ---
> v2:
> Added fixes tag.
> Included what the patch fixes in commit message.
> ---
> drivers/cxl/pci.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
> return 0;
> }
>
> - /* BIOS has CXL error control */
> - if (!host_bridge->native_cxl_error)
> - return -ENXIO;
> + /* BIOS has PCIe AER error control */
> + if (!host_bridge->native_aer)
> + return 0;

Why not directly use pcie_aer_is_native() here?

>
> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
> if (rc)

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2023-07-24 12:16:33

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

On 21.07.23 21:47:38, Smita Koralahalli wrote:
> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
>
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
>
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.
>
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <[email protected]>

Reviewed-by: Robert Richter <[email protected]>

> ---
> v2:
> Added fixes tag.
> Included what the patch fixes in commit message.
> ---
> drivers/cxl/pci.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
> return 0;
> }
>
> - /* BIOS has CXL error control */
> - if (!host_bridge->native_cxl_error)
> - return -ENXIO;
> + /* BIOS has PCIe AER error control */
> + if (!host_bridge->native_aer)
> + return 0;
>
> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
> if (rc)
> --
> 2.17.1
>

2023-07-24 23:32:59

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers


>> - /* BIOS has CXL error control */
>> - if (!host_bridge->native_cxl_error)
>> - return -ENXIO;
>> + /* BIOS has PCIe AER error control */
>> + if (!host_bridge->native_aer)
>> + return 0;
>
> Why not directly use pcie_aer_is_native() here?
Yeah, this was in my v1. But changed as per Robert's comments, to be
applicable for automated backports..

https://lore.kernel.org/all/[email protected]/

Please advice.

Thanks,
Smita
>
>>
>> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>> if (rc)
>


2023-08-08 16:26:59

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

Smita Koralahalli wrote:
>
> >> - /* BIOS has CXL error control */
> >> - if (!host_bridge->native_cxl_error)
> >> - return -ENXIO;
> >> + /* BIOS has PCIe AER error control */
> >> + if (!host_bridge->native_aer)
> >> + return 0;
> >
> > Why not directly use pcie_aer_is_native() here?
> Yeah, this was in my v1. But changed as per Robert's comments, to be
> applicable for automated backports..
>
> https://lore.kernel.org/all/[email protected]/
>
> Please advice.

Keep it the way you have it. Minimizing the backport is the right call.

2023-08-08 17:30:50

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

Smita Koralahalli wrote:
> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
>
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
>
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.

I feel like this could be said more succinctly and with an indication of
what the end user should expect to see. Something like:

"cxl_pci fails to unmask CXL protocol errors when CXL memory error
reporting is not granted native control. Given that CXL memory error
reporting uses the event interface and protocol errors use AER, unmask
protocol errors based only on the native AER setting. Without this
change end user deployments will fail to report protocol errors in the
case where native memory error handling is not granted to Linux."

>
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <[email protected]>
> ---
> v2:
> Added fixes tag.
> Included what the patch fixes in commit message.
> ---
> drivers/cxl/pci.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
> return 0;
> }
>
> - /* BIOS has CXL error control */
> - if (!host_bridge->native_cxl_error)
> - return -ENXIO;
> + /* BIOS has PCIe AER error control */
> + if (!host_bridge->native_aer)
> + return 0;

The error code does not matter here and changing it makes the patch that
bit much more noisier than it needs to be. So just leave it as:

return -ENXIO;

>
> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
> if (rc)
> --
> 2.17.1
>




2023-08-08 22:01:42

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

On 8/7/2023 8:17 PM, Dan Williams wrote:
> Smita Koralahalli wrote:
>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>> of AER should also own CXL Protocol Error Management as there is no
>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>> reported on Protocol errors should check for AER _OSC rather than CXL
>> Memory Error Reporting Control _OSC.
>>
>> The CXL Memory Error Reporting Control _OSC specifically highlights
>> handling Memory Error Logging and Signaling Enhancements. These kinds of
>> errors are reported through a device's mailbox and can be managed
>> independently from CXL Protocol Errors.
>>
>> This change fixes handling and reporting CXL Protocol Errors and RAS
>> registers natively with native AER and FW-First CXL Memory Error Reporting
>> Control.
>
> I feel like this could be said more succinctly and with an indication of
> what the end user should expect to see. Something like:
>
> "cxl_pci fails to unmask CXL protocol errors when CXL memory error
> reporting is not granted native control. Given that CXL memory error
> reporting uses the event interface and protocol errors use AER, unmask
> protocol errors based only on the native AER setting. Without this
> change end user deployments will fail to report protocol errors in the
> case where native memory error handling is not granted to Linux."

Sure, will make the change for a more clearer description. Thanks!
>
>>
>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>
>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>> Signed-off-by: Smita Koralahalli <[email protected]>
>> ---
>> v2:
>> Added fixes tag.
>> Included what the patch fixes in commit message.
>> ---
>> drivers/cxl/pci.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index 1cb1494c28fe..2323169b6e5f 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>> return 0;
>> }
>>
>> - /* BIOS has CXL error control */
>> - if (!host_bridge->native_cxl_error)
>> - return -ENXIO;
>> + /* BIOS has PCIe AER error control */
>> + if (!host_bridge->native_aer)
>> + return 0;
>
> The error code does not matter here and changing it makes the patch that
> bit much more noisier than it needs to be. So just leave it as:

Doing this will return an error from cxl_pci probe thereby failing the
device node creation in FW-First AER/DPC. I cannot think of other places
where we reference the device node in FW-First mode but I have a place
where this could potentially be a roadblock.

I'm trying to add trace events support for FW-First Protocol Errors.
https://lore.kernel.org/linux-cxl/[email protected]/T/#mcaf8a78c1295372ab811be7e1ccb6a8a4d99f3e9

And we already have an existing trace_cxl_aer_correctable_error() and
similarly for uncorrectable error for native protocol error reporting. I
was trying to reuse the same function for fw-first as well. This
function references cxl memory device node which will be NULL in
FW-First if this returns an error.

I don't mind having a separate trace event function for FW-First mode as
it would simplify things especially when dealing with RCH DP.. But there
may be other potential places where we might reference this device node
in FW-First. Please advice.

Thanks,
Smita

>
> return -ENXIO;
>
>>
>> rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>> if (rc)
>> --
>> 2.17.1
>>
>
>
>