2023-11-02 15:53:13

by Terry Bowman

[permalink] [raw]
Subject: [PATCH] cxl/pci: Change CXL AER support check to use native AER

Native CXL protocol errors are delivered to the OS through AER
reporting. The owner of AER owns CXL Protocol error management with
respect to _OSC negotiation.[1] CXL device errors are handled by a
separate interrupt with native control gated by _OSC control field
'CXL Memory Error Reporting Control'.

The CXL driver incorrectly checks for 'CXL Memory Error Reporting
Control' before accessing AER registers and caching RCH downport
AER registers. Replace the current check in these 2 cases with
native AER checks.

[1] CXL 3.0 - 9.17.2 CXL _OSC, Table-9-26, Interpretation of CXL
_OSC Support Fields, p.641

Fixes: 5d2ffbe4b81a ("cxl/port: Store the downstream port's Component Register mappings in struct cxl_dport")
Signed-off-by: Terry Bowman <[email protected]>
Reviewed-by: Smita Koralahalli <[email protected]>
---
drivers/cxl/core/pci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 01c441f2e25e..b29f6d09744b 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -812,7 +812,7 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
* the root cmd register's interrupts is required. But, PCI spec
* shows these are disabled by default on reset.
*/
- if (bridge->native_cxl_error) {
+ if (bridge->native_aer) {
aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
PCI_ERR_ROOT_CMD_NONFATAL_EN |
PCI_ERR_ROOT_CMD_FATAL_EN);
@@ -828,7 +828,7 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
struct pci_host_bridge *host_bridge;

host_bridge = to_pci_host_bridge(dport_dev);
- if (host_bridge->native_cxl_error)
+ if (host_bridge->native_aer)
dport->rcrb.aer_cap = cxl_rcrb_to_aer(dport_dev, dport->rcrb.base);

dport->reg_map.host = host;
--
2.34.1


2023-11-02 20:21:06

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH] cxl/pci: Change CXL AER support check to use native AER

On Thu, Nov 02, 2023 at 10:52:32AM -0500, Terry Bowman wrote:
> Native CXL protocol errors are delivered to the OS through AER
> reporting. The owner of AER owns CXL Protocol error management with
> respect to _OSC negotiation.[1] CXL device errors are handled by a
> separate interrupt with native control gated by _OSC control field
> 'CXL Memory Error Reporting Control'.
>
> The CXL driver incorrectly checks for 'CXL Memory Error Reporting
> Control' before accessing AER registers and caching RCH downport
> AER registers. Replace the current check in these 2 cases with
> native AER checks.

Hi Terry, Does this have a user visible impact?

Alison

>
--snip

2023-11-02 21:10:27

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH] cxl/pci: Change CXL AER support check to use native AER

Terry Bowman wrote:
> Native CXL protocol errors are delivered to the OS through AER
> reporting. The owner of AER owns CXL Protocol error management with
> respect to _OSC negotiation.[1] CXL device errors are handled by a
> separate interrupt with native control gated by _OSC control field
> 'CXL Memory Error Reporting Control'.
>
> The CXL driver incorrectly checks for 'CXL Memory Error Reporting
> Control' before accessing AER registers and caching RCH downport
> AER registers. Replace the current check in these 2 cases with
> native AER checks.
>
> [1] CXL 3.0 - 9.17.2 CXL _OSC, Table-9-26, Interpretation of CXL
> _OSC Support Fields, p.641

Makes sense, applied.

2023-11-02 21:17:33

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] cxl/pci: Change CXL AER support check to use native AER

Alison Schofield wrote:
> On Thu, Nov 02, 2023 at 10:52:32AM -0500, Terry Bowman wrote:
> > Native CXL protocol errors are delivered to the OS through AER
> > reporting. The owner of AER owns CXL Protocol error management with
> > respect to _OSC negotiation.[1] CXL device errors are handled by a
> > separate interrupt with native control gated by _OSC control field
> > 'CXL Memory Error Reporting Control'.
> >
> > The CXL driver incorrectly checks for 'CXL Memory Error Reporting
> > Control' before accessing AER registers and caching RCH downport
> > AER registers. Replace the current check in these 2 cases with
> > native AER checks.
>
> Hi Terry, Does this have a user visible impact?

Saw this after I applied it. It is good feedback in general.

The reason I did not ask for this clarification was that this is fixing
brand new code and was just using the wrong flag, so I had the context.
A backporter will never need to make a judgement call about this patch.

The end user impact is that CXL protocol errors that could be handled by
AER will not be handled if Linux failed to negotiate memory error
handling. Memory errors are strictly related to memory-error-record
events, not protocol errors.

2023-11-02 21:31:51

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] cxl/pci: Change CXL AER support check to use native AER

Dan Williams wrote:
> Alison Schofield wrote:
> > On Thu, Nov 02, 2023 at 10:52:32AM -0500, Terry Bowman wrote:
> > > Native CXL protocol errors are delivered to the OS through AER
> > > reporting. The owner of AER owns CXL Protocol error management with
> > > respect to _OSC negotiation.[1] CXL device errors are handled by a
> > > separate interrupt with native control gated by _OSC control field
> > > 'CXL Memory Error Reporting Control'.
> > >
> > > The CXL driver incorrectly checks for 'CXL Memory Error Reporting
> > > Control' before accessing AER registers and caching RCH downport
> > > AER registers. Replace the current check in these 2 cases with
> > > native AER checks.
> >
> > Hi Terry, Does this have a user visible impact?
>
> Saw this after I applied it. It is good feedback in general.
>
> The reason I did not ask for this clarification was that this is fixing
> brand new code and was just using the wrong flag, so I had the context.
> A backporter will never need to make a judgement call about this patch.
>
> The end user impact is that CXL protocol errors that could be handled by
> AER will not be handled if Linux failed to negotiate memory error
> handling. Memory errors are strictly related to memory-error-record
> events, not protocol errors.

However, to that point the "Fixes:" tag looks wrong, it should be:

f05fd10d138d cxl/pci: Add RCH downstream port AER register discovery

2023-11-02 23:31:01

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH] cxl/pci: Change CXL AER support check to use native AER

Hi Dan and Allison,

On 11/2/23 16:31, Dan Williams wrote:
> Dan Williams wrote:
>> Alison Schofield wrote:
>>> On Thu, Nov 02, 2023 at 10:52:32AM -0500, Terry Bowman wrote:
>>>> Native CXL protocol errors are delivered to the OS through AER
>>>> reporting. The owner of AER owns CXL Protocol error management with
>>>> respect to _OSC negotiation.[1] CXL device errors are handled by a
>>>> separate interrupt with native control gated by _OSC control field
>>>> 'CXL Memory Error Reporting Control'.
>>>>
>>>> The CXL driver incorrectly checks for 'CXL Memory Error Reporting
>>>> Control' before accessing AER registers and caching RCH downport
>>>> AER registers. Replace the current check in these 2 cases with
>>>> native AER checks.
>>>
>>> Hi Terry, Does this have a user visible impact?
>>
>> Saw this after I applied it. It is good feedback in general.
>>
>> The reason I did not ask for this clarification was that this is fixing
>> brand new code and was just using the wrong flag, so I had the context.
>> A backporter will never need to make a judgement call about this patch.
>>
>> The end user impact is that CXL protocol errors that could be handled by
>> AER will not be handled if Linux failed to negotiate memory error
>> handling. Memory errors are strictly related to memory-error-record
>> events, not protocol errors.
>
Right, end user impact is RCH error handling will require using native
memory error/event _OSC control inorder for protocol errors to be logged.

> However, to that point the "Fixes:" tag looks wrong, it should be:
>
> f05fd10d138d cxl/pci: Add RCH downstream port AER register discovery

Correct, it is f05fd10d138d.

Regards,
Terry