2023-05-24 02:12:42

by Terry Bowman

[permalink] [raw]

On 01.06.23 13:59:31, Jonathan Cameron wrote:
> On Tue, 23 May 2023 18:22:00 -0500
> Terry Bowman <[email protected]> wrote:
>
> > From: Robert Richter <[email protected]>
> >
> > CXL RAS capabilities must be enabled and accessible as soon as the CXL
> > endpoint is detected in the PCI hierarchy and bound to the cxl_pci
> > driver. This needs to be independent of other modules such as cxl_port
> > or cxl_mem.
> >
> > CXL RAS capabilities reside in the Component Registers. For an RCH
> > this is determined by probing RCRB which is implemented very late once
> > the CXL Memory Device is created.
> >
> > Change this by moving the RCRB probe to the cxl_pci driver. Do this by
> > using a new introduced function cxl_pci_find_port() similar to
> > cxl_mem_find_port() to determine the involved dport by the endpoint's
> > PCI handle. Plug this into the existing cxl_pci_setup_regs() function
> > to setup Component Registers. Probe the RCRB in case the Component
> > Registers cannot be located through the CXL Register Locator
> > capability.
> >
> > This unifies code and early sets up the Component Registers at the
> > same time for both, VH and RCH mode. Only the cxl_pci driver is
> > involved for this. This allows an early mapping of the CXL RAS
> > capability registers.
> >
> > Signed-off-by: Robert Richter <[email protected]>
> > Signed-off-by: Terry Bowman <[email protected]>
>
> One minor wording suggestion inline. I'm don't really care
> that much about it though, so.
>
> Reviewed-by: Jonathan Cameron <[email protected]>
>
>
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 945ca0304d68..54c486cd65dd 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -274,13 +274,48 @@ static int cxl_pci_setup_mailbox(struct cxl_dev_state *cxlds)
> > return 0;
> > }
> >
> > +/* Extract RCRB, use same function interface as cxl_find_regblock(). */
> > +static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
> > + enum cxl_regloc_type type,
> > + struct cxl_register_map *map)
> > +{
> > + struct cxl_dport *dport;
> > + resource_size_t component_reg_phys;
> > +
> > + memset(map, 0, sizeof(*map));
> > + map->dev = &pdev->dev;
> > + map->resource = CXL_RESOURCE_NONE;
> > +
> > + if (type != CXL_REGLOC_RBI_COMPONENT)
> > + return -ENODEV;
> > +
> > + if (!cxl_pci_find_port(pdev, &dport) || !dport->rch)
> > + return -ENXIO;
> > +
> > + component_reg_phys = cxl_probe_rcrb(&pdev->dev, dport->rcrb.base,
> > + NULL, CXL_RCRB_UPSTREAM);
> > + if (component_reg_phys == CXL_RESOURCE_NONE)
> > + return -ENXIO;
> > +
> > + map->resource = component_reg_phys;
> > + map->reg_type = type;
> > + map->max_size = CXL_COMPONENT_REG_BLOCK_SIZE;
> > +
> > + return 0;
> > +}
> > +
> > static int cxl_pci_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
> > struct cxl_register_map *map)
> > {
> > int rc;
> >
> > + /*
> > + * If the Register Locator DVSEC does not contain the
> > + * Component Registers, try to extract them from the RCRB if
> > + * it is an RCH.
>
> My instinct here was to wonder why having said 'if it is an RCH'
> you don't seem to be checking that first. Perhaps
> change this text to something like.
> * Component Registers, assume it is an RCH and try to extra them
> * from an RCRB.
> */
> ?

Will change that.

Thanks for review,

-Robert

>
> > + */
> > rc = cxl_find_regblock(pdev, type, map);
> > - if (rc)
> > + if (rc && cxl_rcrb_get_comp_regs(pdev, type, map))
> > return rc;
> >
> > return cxl_setup_regs(map);
>

2023-06-02 17:06:03

by Robert Richter

[permalink] [raw]

Subject: Re: [PATCH v4 23/23] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

On 01.06.23 15:11:34, Jonathan Cameron wrote:
>
> > > > > @@ -1432,6 +1495,7 @@ static int aer_probe(struct pcie_device *dev)
> > > > > return status;
> > > > > }
> > > > >
> > > > > + cxl_rch_enable_rcec(port);
> > > >
> > > > Could this be done by the driver that claims the CXL RCiEP? There's
> > > > no point in unmasking the errors before there's a driver with
> > > > pci_error_handlers that can do something with them anyway.
> > >
> > > This sounds reasonable at the first glance. The problem is there could
> > > be many devices associated with the RCEC. Not all of them will be
> > > bound to a driver and handler at the same time. We would need to
> > > refcount it or maintain a list of enabled devices. But there is
> > > already something similar by checking dev->driver. But right, AER
> > > errors could be seen and handled then at least on PCI level. I tent to
> > > permanently enable RCEC AER, but that could cause side-effects. What
> > > do you think?
> >
> > IIUC, this really just affects CXL devices, so I think the choice is
> > (1) always unmask internal errors for RCECs where those CXL devices
> > report errors (as this patch does), or (2) unmask when first CXL
> > driver that can handle the errors is loaded and restore previous state
> > when last one is unloaded.
> >
> > If the RCEC *only* handles errors for CXL devices, i.e., not for a mix
> > of vanilla PCIe RCiEPs and CXL RCiEPs, I think I'm OK with (1). I
> > think you said only the CXL driver knows how to collect and interpret
> > the error data. Is it OK that when no such driver is loaded, we field
> > error interrupts silently, without even mentioning that an error
> > occurred? I guess without the driver, the device is probably not in
> > use.
>
> It might be in use. Firmware may well have set up the CXL device and
> even have put the kernel image in that memory for example. OS first RAS
> handling won't be up until the driver loads though. Would be a bit
> odd to mix OS first handling with firmware setup. I'd expect firmware
> first handling in that case, but I don't think anything stops the two
> being mixed.

Right, CXL memory may have been set up by firmware. We will only see
AER errors (for the unmasked error types) then without further CXL
handling, which is IMO OK.

This all assumes a non-CXL aware system can clear the error status by
only using PCIe AER. That is, a CXL RAS error may not trigger again
(or at all) by only clearing the AER status and not the CXL RAS status
in the capability. I don't know what the spec says here and how
devices actually operate.

Maybe option (2) is easy to implement with the refcount_t API. So with
the first device probed we just enable the RCEC's internal errors and
disable them when the last device is removed. I think CXL RAS errors
will not be triggered then as internal error must be enabled for this,
either in the RCEC or the endpoint. Since internal errors must be
unmasked first which can only be done by the CXL driver, CXL RAS error
wont trigger an AER error message.

Thanks,

-Robert

2023-06-06 10:07:23

by Jonathan Cameron

[permalink] [raw]