2023-03-23 21:39:45

by Terry Bowman

[permalink] [raw]
Subject: [PATCH v2 0/5] cxl/pci: Add support for RCH RAS error handling

This patchset adds error handling support for restricted CXL host (RCH)
downstream ports. This is necessary because RCH downstream ports are
implemented in RCRBs and report protocol errors through a root complex
event collector (RCEC). The RCH error reporting flow is not currently
supported by the CXL driver and will be added by this patchset.

This patchset uses an updated subject line and is a continuation of:
https://lore.kernel.org/all/[email protected]/

The first patch discovers the RCH dport AER and RAS registers. These will
be mapped later and used in CXL driver error logging.

The second patch exports cper_mem_err_unpack(). cper_mem_err_unpack() is a
dependency for using the cper_print_aer() AER trace logging.

The third patch exports cper_print_aer(). cper_print_aer() is used for
CXL AER error logging because it provides a common format for logging
into dmesg.

The fourth patch is AER port driver changes for forwarding RCH errors to
the RCiEP RCH handler.

The fifth patch maps the AER and RAS registers. This patch also adds the
RCH handler for logging downstream port AER and RAS information.

This is based on cxl/next commit
e686c32590f4 ("dax/kmem: Fix leak of memory-hotplug resources")'

Robert Richter (1):
cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem
dev handler

Terry Bowman (4):
cxl/pci: Add RCH downstream port AER and RAS register discovery
efi/cper: Export cper_mem_err_unpack() for CXL logging
pci/aer: Export cper_print_aer() for CXL driver logging
cxl/pci: Add RCH downstream port error logging

drivers/cxl/core/pci.c | 126 +++++++++++++++++++++++---
drivers/cxl/core/regs.c | 94 +++++++++++++++++---
drivers/cxl/cxl.h | 18 ++++
drivers/cxl/mem.c | 173 +++++++++++++++++++++++++++++++++---
drivers/firmware/efi/cper.c | 1 +
drivers/pci/pcie/aer.c | 46 ++++++++++
6 files changed, 423 insertions(+), 35 deletions(-)

--
2.34.1


2023-03-23 21:39:50

by Terry Bowman

[permalink] [raw]
Subject: [PATCH v2 1/5] cxl/pci: Add RCH downstream port AER and RAS register discovery

Restricted CXL host (RCH) downstream port AER information is not currently
logged while in the error state. One problem preventing existing PCIe AER
functions from logging errors is the AER registers are not accessible. The
CXL driver requires changes to find RCH downstream port AER registers for
purpose of error logging.

RCH downstream ports are not enumerated during a PCI bus scan and are
instead discovered using system firmware, ACPI in this case.[1] The
downstream port is implemented as a Root Complex Register Block (RCRB).
The RCRB is a 4k memory block containing PCIe registers based on the PCIe
root port.[2] The RCRB includes AER extended capability registers used for
reporting errors. Note, the RCH's AER Capability is located in the RCRB
memory space instead of PCI configuration space, thus its register access
is different. Existing kernel PCIe AER functions can not be used to manage
the downstream port AER capabilities because the port was not enumerated
during PCI scan and the registers are not PCI config accessible.

Discover RCH downstream port AER extended capability registers. This
requires using MMIO accesses to search for extended AER capability in
RCRB register space.

[1] CXL 3.0 Spec, 9.11.2 - System Firmware View of CXL 1.1 Hierarchy
[2] CXL 3.0 Spec, 8.2.1.1 - RCH Downstream Port RCRB

Co-developed-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Terry Bowman <[email protected]>
---
drivers/cxl/core/regs.c | 93 +++++++++++++++++++++++++++++++++++------
drivers/cxl/cxl.h | 5 +++
drivers/cxl/mem.c | 41 ++++++++++++------
3 files changed, 115 insertions(+), 24 deletions(-)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 1476a0299c9b..108a349d8101 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -332,10 +332,36 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
}
EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL);

+static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map,
+ char *name)
+{
+
+ if (!request_mem_region(map->resource, map->max_size, name))
+ return 0;
+
+ map->base = ioremap(map->resource, map->max_size);
+ if (!map->base) {
+ release_mem_region(map->resource, map->max_size);
+ return 0;
+ }
+
+ return map->base;
+}
+
+static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map)
+{
+ iounmap(map->base);
+ release_mem_region(map->resource, map->max_size);
+}
+
resource_size_t cxl_rcrb_to_component(struct device *dev,
resource_size_t rcrb,
enum cxl_rcrb which)
{
+ struct cxl_register_map map = {
+ .resource = rcrb,
+ .max_size = SZ_4K
+ };
resource_size_t component_reg_phys;
void __iomem *addr;
u32 bar0, bar1;
@@ -343,7 +369,10 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
u32 id;

if (which == CXL_RCRB_UPSTREAM)
- rcrb += SZ_4K;
+ map.resource += SZ_4K;
+
+ if (!cxl_map_reg(dev, &map, "CXL RCRB"))
+ return CXL_RESOURCE_NONE;

/*
* RCRB's BAR[0..1] point to component block containing CXL
@@ -351,21 +380,12 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
* the PCI Base spec here, esp. 64 bit extraction and memory
* ranges alignment (6.0, 7.5.1.2.1).
*/
- if (!request_mem_region(rcrb, SZ_4K, "CXL RCRB"))
- return CXL_RESOURCE_NONE;
- addr = ioremap(rcrb, SZ_4K);
- if (!addr) {
- dev_err(dev, "Failed to map region %pr\n", addr);
- release_mem_region(rcrb, SZ_4K);
- return CXL_RESOURCE_NONE;
- }
-
+ addr = map.base;
id = readl(addr + PCI_VENDOR_ID);
cmd = readw(addr + PCI_COMMAND);
bar0 = readl(addr + PCI_BASE_ADDRESS_0);
bar1 = readl(addr + PCI_BASE_ADDRESS_1);
- iounmap(addr);
- release_mem_region(rcrb, SZ_4K);
+ cxl_unmap_reg(dev, &map);

/*
* Sanity check, see CXL 3.0 Figure 9-8 CXL Device that Does Not
@@ -396,3 +416,52 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
return component_reg_phys;
}
EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL);
+
+u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb)
+{
+ struct cxl_register_map map = {
+ .resource = rcrb,
+ .max_size = SZ_4K,
+ };
+ u32 cap_hdr;
+ u16 offset = 0;
+
+ if (!cxl_map_reg(dev, &map, "CXL RCRB"))
+ return 0;
+
+ cap_hdr = readl(map.base + offset);
+ while (PCI_EXT_CAP_ID(cap_hdr) != PCI_EXT_CAP_ID_ERR) {
+
+ offset = PCI_EXT_CAP_NEXT(cap_hdr);
+ if (!offset) {
+ cxl_unmap_reg(dev, &map);
+ return 0;
+ }
+ cap_hdr = readl(map.base + offset);
+ }
+
+ dev_dbg(dev, "found AER extended capability (0x%x)\n", offset);
+ cxl_unmap_reg(dev, &map);
+
+ return offset;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, CXL);
+
+u16 cxl_component_to_ras(struct device *dev, resource_size_t component_reg_phys)
+{
+ struct cxl_register_map map = {
+ .resource = component_reg_phys,
+ .max_size = CXL_COMPONENT_REG_BLOCK_SIZE,
+ };
+
+ if (!cxl_map_reg(dev, &map, "component"))
+ return 0;
+
+ cxl_probe_component_regs(dev, map.base, &map.component_map);
+ cxl_unmap_reg(dev, &map);
+ if (!map.component_map.ras.valid)
+ return 0;
+
+ return map.component_map.ras.offset;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_component_to_ras, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d853a0238ad7..9fd7df48ce99 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -270,6 +270,9 @@ enum cxl_rcrb {
resource_size_t cxl_rcrb_to_component(struct device *dev,
resource_size_t rcrb,
enum cxl_rcrb which);
+u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb);
+u16 cxl_component_to_ras(struct device *dev,
+ resource_size_t component_reg_phys);

#define CXL_RESOURCE_NONE ((resource_size_t) -1)
#define CXL_TARGET_STRLEN 20
@@ -601,6 +604,8 @@ struct cxl_dport {
int port_id;
resource_size_t component_reg_phys;
resource_size_t rcrb;
+ u16 aer_cap;
+ u16 ras_cap;
bool rch;
struct cxl_port *port;
};
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 39c4b54f0715..12e8e8ebaac0 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,13 +45,38 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
return 0;
}

+static void cxl_rcrb_setup(struct cxl_dev_state *cxlds,
+ struct cxl_dport *parent_dport)
+{
+ struct cxl_memdev *cxlmd = cxlds->cxlmd;
+
+ if (!parent_dport->rch)
+ return;
+
+ /*
+ * The component registers for an RCD might come from the
+ * host-bridge RCRB if they are not already mapped via the
+ * typical register locator mechanism.
+ */
+ if (cxlds->component_reg_phys == CXL_RESOURCE_NONE)
+ cxlds->component_reg_phys = cxl_rcrb_to_component(
+ &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM);
+
+ /* RCH AER is required. CXL3.0 Spec Table 8-12 */
+ parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport,
+ parent_dport->rcrb);
+
+ /* RCH RAS is required. CXL3.0 Spec Table 8-22 */
+ parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport,
+ parent_dport->component_reg_phys);
+}
+
static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
struct cxl_dport *parent_dport)
{
struct cxl_port *parent_port = parent_dport->port;
struct cxl_dev_state *cxlds = cxlmd->cxlds;
struct cxl_port *endpoint, *iter, *down;
- resource_size_t component_reg_phys;
int rc;

/*
@@ -66,17 +91,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
ep->next = down;
}

- /*
- * The component registers for an RCD might come from the
- * host-bridge RCRB if they are not already mapped via the
- * typical register locator mechanism.
- */
- if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE)
- component_reg_phys = cxl_rcrb_to_component(
- &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM);
- else
- component_reg_phys = cxlds->component_reg_phys;
- endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys,
+ cxl_rcrb_setup(cxlds, parent_dport);
+
+ endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
parent_dport);
if (IS_ERR(endpoint))
return PTR_ERR(endpoint);
--
2.34.1

2023-03-23 21:39:57

by Terry Bowman

[permalink] [raw]
Subject: [PATCH v2 2/5] efi/cper: Export cper_mem_err_unpack() for CXL logging

The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH)
logging. This is not currently possible because CXL is a loadable module
and cper_print_aer() depends on cper_mem_err_unpack() which is not
exported.

Export cper_mem_err_unpack() to enable cper_print_aer() usage in
CXL and other loadable modules.

Signed-off-by: Terry Bowman <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: [email protected]
---
drivers/firmware/efi/cper.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 35c37f667781..ff15e12160ae 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p,

return ret;
}
+EXPORT_SYMBOL_GPL(cper_mem_err_unpack);

static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem,
int len)
--
2.34.1

2023-03-23 21:40:16

by Terry Bowman

[permalink] [raw]
Subject: [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging

The CXL driver plans to use cper_print_aer() for restricted CXL host
(RCH) logging. cper_print_aer() is not exported and as a result is not
available to the CXL driver or other loadable modules. Export
cper_print_aer() making it available to CXL and other loadable modules.

Signed-off-by: Terry Bowman <[email protected]>
Cc: Mahesh J Salgaonkar <[email protected]>
Cc: "Oliver O'Halloran" <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
---
drivers/pci/pcie/aer.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 625f7b2cafe4..7f0f52d094a4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -809,6 +809,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
trace_aer_event(dev_name(&dev->dev), (status & ~mask),
aer_severity, tlp_header_valid, &aer->header_log);
}
+EXPORT_SYMBOL_GPL(cper_print_aer);
#endif

/**
--
2.34.1

2023-03-23 21:40:51

by Terry Bowman

[permalink] [raw]
Subject: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging

RCH downstream port error logging is missing in the current CXL driver. The
missing AER and RAS error logging is needed for communicating driver error
details to userspace. Update the driver to include PCIe AER and CXL RAS
error logging.

Add RCH downstream port error handling into the existing RCiEP handler.
The downstream port error handler is added to the RCiEP error handler
because the downstream port is implemented in a RCRB, is not PCI
enumerable, and as a result is not directly accessible to the PCI AER
root port driver. The AER root port driver calls the RCiEP handler for
handling RCD errors and RCH downstream port protocol errors.

Update the cxl_mem driver to map the RCH RAS and AER register discovered
earlier. The RAS and AER registers will be used in the RCH error handlers.

Disable RCH downstream port's root port cmd interrupts. Enable RCEC AER
CIE/UIE reporting because they are disabled by default.[1]

Update existing RCiEP correctable and uncorrectable handlers to also call
the RCH handler. The RCH handler will read the downstream port AER
registers, check for error severity, and if an error exists will log
using an existing kernel AER trace routine. The RCH handler will also
reuse the existing RAS logging routine to log downstream port RAS
errors if they exist.

[1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors

Co-developed-by: Robert Richter <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Terry Bowman <[email protected]>
---
drivers/cxl/core/pci.c | 126 +++++++++++++++++++++++++++++++++----
drivers/cxl/core/regs.c | 1 +
drivers/cxl/cxl.h | 13 ++++
drivers/cxl/mem.c | 134 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 262 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 7328a2552411..6e36471969ba 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -5,6 +5,7 @@
#include <linux/delay.h>
#include <linux/pci.h>
#include <linux/pci-doe.h>
+#include <linux/aer.h>
#include <cxlpci.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -605,32 +606,88 @@ void read_cdat_data(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);

-void cxl_cor_error_detected(struct pci_dev *pdev)
+/* Get AER severity. Return false if there is no error. */
+static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
+ int *severity)
+{
+ if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
+ if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
+ *severity = AER_FATAL;
+ else
+ *severity = AER_NONFATAL;
+ return true;
+ }
+
+ if (aer_regs->cor_status & ~aer_regs->cor_mask) {
+ *severity = AER_CORRECTABLE;
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Copy the AER capability registers to a buffer. This is necessary
+ * because RCRB AER capability is MMIO mapped. Clear the status
+ * after copying.
+ *
+ * @aer_base: base address of AER capability block in RCRB
+ * @aer_regs: destination for copying AER capability
+ */
+static bool cxl_rch_get_aer_info(void __iomem *aer_base,
+ struct aer_capability_regs *aer_regs)
+{
+ int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
+ u32 *aer_regs_buf = (u32 *)aer_regs;
+ int n;
+
+ if (!aer_base)
+ return false;
+
+ for (n = 0; n < read_cnt; n++)
+ aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
+
+ writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
+ writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
+
+ return true;
+}
+
+static void __cxl_log_correctable_ras(struct cxl_dev_state *cxlds,
+ void __iomem *ras_base)
{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
void __iomem *addr;
u32 status;

- if (!cxlds->regs.ras)
+ if (!ras_base)
return;

- addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
+ addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
}
}
-EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
+
+static void cxl_log_correctable_ras_endpoint(struct cxl_dev_state *cxlds)
+{
+ return __cxl_log_correctable_ras(cxlds, cxlds->regs.ras);
+}
+
+static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds)
+{
+ return __cxl_log_correctable_ras(cxlds, cxlds->regs.dport_ras);
+}

/* CXL spec rev3.0 8.2.4.16.1 */
-static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log)
+static void header_log_copy(void __iomem *ras_base, u32 *log)
{
void __iomem *addr;
u32 *log_addr;
int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32);

- addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
+ addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET;
log_addr = log;

for (i = 0; i < log_u32_size; i++) {
@@ -644,17 +701,18 @@ static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
+static bool __cxl_report_and_clear(struct cxl_dev_state *cxlds,
+ void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
u32 status;
u32 fe;

- if (!cxlds->regs.ras)
+ if (!ras_base)
return false;

- addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
+ addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
status = readl(addr);
if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
return false;
@@ -662,7 +720,7 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
/* If multiple errors, log header points to first error from ctrl reg */
if (hweight32(status) > 1) {
void __iomem *rcc_addr =
- cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET;
+ ras_base + CXL_RAS_CAP_CONTROL_OFFSET;

fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
readl(rcc_addr)));
@@ -670,13 +728,54 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
fe = status;
}

- header_log_copy(cxlds, hl);
+ header_log_copy(ras_base, hl);
trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);

return true;
}

+static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
+{
+ return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
+}
+
+static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds)
+{
+ return __cxl_report_and_clear(cxlds, cxlds->regs.dport_ras);
+}
+
+static void cxl_rch_log_error(struct cxl_dev_state *cxlds)
+{
+ struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+ struct aer_capability_regs aer_regs;
+ int severity;
+
+ if (!cxl_rch_get_aer_info(cxlds->regs.aer, &aer_regs))
+ return;
+
+ if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
+ return;
+
+ cper_print_aer(pdev, severity, &aer_regs);
+
+ if (severity == AER_CORRECTABLE)
+ cxl_log_correctable_ras_dport(cxlds);
+ else
+ cxl_report_and_clear_dport(cxlds);
+}
+
+void cxl_cor_error_detected(struct pci_dev *pdev)
+{
+ struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+
+ if (cxlds->rcd)
+ cxl_rch_log_error(cxlds);
+
+ cxl_log_correctable_ras_endpoint(cxlds);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
+
pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
@@ -685,6 +784,9 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
struct device *dev = &cxlmd->dev;
bool ue;

+ if (cxlds->rcd)
+ cxl_rch_log_error(cxlds);
+
/*
* A frozen channel indicates an impending reset which is fatal to
* CXL.mem operation, and will likely crash the system. On the off
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 108a349d8101..7130f35891da 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,

return ret_val;
}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);

int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
struct cxl_register_map *map, unsigned long map_mask)
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 9fd7df48ce99..7036e34354bc 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -66,6 +66,8 @@
#define CXL_DECODER_MIN_GRANULARITY 256
#define CXL_DECODER_MAX_ENCODED_IG 6

+#define PCI_AER_CAPABILITY_LENGTH 56
+
static inline int cxl_hdm_decoder_count(u32 cap_hdr)
{
int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
@@ -209,6 +211,15 @@ struct cxl_regs {
struct_group_tagged(cxl_device_regs, device_regs,
void __iomem *status, *mbox, *memdev;
);
+
+ /*
+ * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
+ * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors
+ */
+ struct_group_tagged(cxl_rch_regs, rch_regs,
+ void __iomem *aer;
+ void __iomem *dport_ras;
+ );
};

struct cxl_reg_map {
@@ -249,6 +260,8 @@ struct cxl_register_map {
};
};

+void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
+ resource_size_t length);
void cxl_probe_component_regs(struct device *dev, void __iomem *base,
struct cxl_component_reg_map *map);
void cxl_probe_device_regs(struct device *dev, void __iomem *base,
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 12e8e8ebaac0..e217c44ed749 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -4,6 +4,7 @@
#include <linux/device.h>
#include <linux/module.h>
#include <linux/pci.h>
+#include <linux/aer.h>

#include "cxlmem.h"
#include "cxlpci.h"
@@ -45,6 +46,132 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
return 0;
}

+static int rcec_enable_aer_ints(struct pci_dev *pdev)
+{
+ struct pci_dev *rcec = pdev->rcec;
+ int aer, rc;
+ u32 mask;
+
+ if (!rcec)
+ return -ENODEV;
+
+ /*
+ * Internal errors are masked by default, unmask RCEC's here
+ * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
+ * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
+ */
+ aer = rcec->aer_cap;
+ rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
+ if (rc)
+ return rc;
+ mask &= ~PCI_ERR_UNC_INTN;
+ rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
+ if (rc)
+ return rc;
+
+ rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
+ if (rc)
+ return rc;
+ mask &= ~PCI_ERR_COR_INTERNAL;
+ rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
+
+ return rc;
+}
+
+static void disable_aer(void *_pdev)
+{
+ struct pci_dev *pdev = (struct pci_dev *)_pdev;
+
+ pci_disable_pcie_error_reporting(pdev);
+
+ /*
+ * Keep the RCEC's internal AER enabled. There
+ * could be other RCiEPs using this RCEC.
+ */
+}
+
+static void rch_disable_root_ints(void __iomem *aer_base)
+{
+ u32 aer_cmd_mask, aer_cmd;
+
+ /*
+ * Disable RCH root port command interrupts.
+ * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
+ */
+ aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
+ PCI_ERR_ROOT_CMD_NONFATAL_EN |
+ PCI_ERR_ROOT_CMD_FATAL_EN);
+ aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
+ aer_cmd &= ~aer_cmd_mask;
+ writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
+}
+
+static int cxl_ras_setup_interrupts(struct cxl_dev_state *cxlds)
+{
+ struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+ int rc;
+
+ if (cxlds->rcd) {
+ rch_disable_root_ints(cxlds->regs.aer);
+
+ rc = rcec_enable_aer_ints(pdev);
+ if (rc)
+ return rc;
+ }
+
+ rc = pci_enable_pcie_error_reporting(pdev);
+ if (rc)
+ return rc;
+
+ return devm_add_action_or_reset(&pdev->dev, disable_aer, pdev);
+}
+
+static int cxl_rch_map_ras(struct cxl_dev_state *cxlds,
+ struct cxl_dport *parent_dport)
+{
+ struct device *dev = parent_dport->dport;
+ resource_size_t aer_phys, ras_phys;
+ void __iomem *aer, *dport_ras;
+
+ if (!parent_dport->rch)
+ return 0;
+
+ if (!parent_dport->aer_cap || !parent_dport->ras_cap ||
+ parent_dport->component_reg_phys == CXL_RESOURCE_NONE)
+ return -ENODEV;
+
+ aer_phys = parent_dport->aer_cap + parent_dport->rcrb;
+ aer = devm_cxl_iomap_block(dev, aer_phys,
+ PCI_AER_CAPABILITY_LENGTH);
+
+ if (!aer)
+ return -ENOMEM;
+
+ ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys;
+ dport_ras = devm_cxl_iomap_block(dev, ras_phys,
+ CXL_RAS_CAPABILITY_LENGTH);
+
+ if (!dport_ras)
+ return -ENOMEM;
+
+ cxlds->regs.aer = aer;
+ cxlds->regs.dport_ras = dport_ras;
+
+ return 0;
+}
+
+static int cxl_setup_ras(struct cxl_dev_state *cxlds,
+ struct cxl_dport *parent_dport)
+{
+ int rc;
+
+ rc = cxl_rch_map_ras(cxlds, parent_dport);
+ if (rc)
+ return rc;
+
+ return cxl_ras_setup_interrupts(cxlds);
+}
+
static void cxl_rcrb_setup(struct cxl_dev_state *cxlds,
struct cxl_dport *parent_dport)
{
@@ -93,6 +220,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,

cxl_rcrb_setup(cxlds, parent_dport);

+ rc = cxl_setup_ras(cxlds, parent_dport);
+ /* Continue with RAS setup errors */
+ if (rc)
+ dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc);
+ else
+ dev_info(&cxlmd->dev, "CXL error handling enabled\n");
+
endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
parent_dport);
if (IS_ERR(endpoint))
--
2.34.1

2023-03-23 21:40:59

by Terry Bowman

[permalink] [raw]
Subject: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

From: Robert Richter <[email protected]>

In RCD mode a CXL device (RCD) is exposed as an RCiEP, but CXL
downstream and upstream ports are not enumerated and not visible in
the PCIe hierarchy. Protocol and link errors are sent to an RCEC.

Now, RCH downstream port-detected errors are signaled as internal AER
errors (UIE/CIE) with the RCEC's source ID. A CXL handler must then
inspect the error status in various CXL registers residing in the
dport's component register space (CXL RAS cap) or the dport's RCRB
(AER ext cap). [1]

This patch connects errors showing up in the RCEC's error handler with
the CXL subsystem. Implement this by forwarding the error to all CXL
devices below the RCEC. Since the entire CXL device is controlled only
using PCIe Configuration Space of device 0, Function 0, only pass it
there [2]. These devices have the Memory Device class code set
(PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver can
implement the handler. The CXL device driver is then responsible to
enable error reporting in the RCEC's AER cap (esp. CIE and UIE) and to
inspect the dport's CXL registers in addition (CXL RAS cap and AER ext
cap).

The reason for choosing this implementation is that a CXL RCEC device
is bound to the AER port driver, but the driver does not allow it to
register a custom specific handler to support CXL. Connecting the RCEC
hard-wired with a CXL handler does not work, as the CXL subsystem
might not be present all the time. The alternative to add an
implementation to the portdrv to allow the registration of a custom
RCEC error handler isn't worth doing it as CXL would be its only user.
Instead, just check for an CXL RCEC and pass it down to the connected
CXL device's error handler.

[1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
[2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices

Co-developed-by: Terry Bowman <[email protected]>
Signed-off-by: Terry Bowman <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Cc: "Oliver O'Halloran" <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/pci/pcie/aer.c | 45 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 7f0f52d094a4..d250a4caa85a 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -943,6 +943,49 @@ static bool find_source_device(struct pci_dev *parent,
return true;
}

+#if IS_ENABLED(CONFIG_CXL_PCI)
+
+static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info);
+
+static int handle_cxl_error_iter(struct pci_dev *dev, void *data)
+{
+ struct aer_err_info *e_info = (struct aer_err_info *)data;
+
+ if (dev->devfn != PCI_DEVFN(0, 0))
+ return 0;
+
+ /* Right now there is only a CXL.mem driver */
+ if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
+ return 0;
+
+ /* pci_dev_put() in handle_error_source() */
+ dev = pci_dev_get(dev);
+ if (dev)
+ handle_error_source(dev, e_info);
+
+ return 0;
+}
+
+static bool is_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
+{
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+ is_internal_error(info))
+ pcie_walk_rcec(dev, handle_cxl_error_iter, info);
+}
+
+#else
+static inline void handle_cxl_error(struct pci_dev *dev,
+ struct aer_err_info *info) { }
+#endif
+
/**
* handle_error_source - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
@@ -954,6 +997,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
int aer = dev->aer_cap;

+ handle_cxl_error(dev, info);
+
if (info->severity == AER_CORRECTABLE) {
/*
* Correctable error does not need software intervention.
--
2.34.1

2023-03-23 22:25:06

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging

Adding PCI reviewers.

https://lore.kernel.org/all/[email protected]/

On 3/23/23 16:38, Terry Bowman wrote:
> The CXL driver plans to use cper_print_aer() for restricted CXL host
> (RCH) logging. cper_print_aer() is not exported and as a result is not
> available to the CXL driver or other loadable modules. Export
> cper_print_aer() making it available to CXL and other loadable modules.
>
> Signed-off-by: Terry Bowman <[email protected]>
> Cc: Mahesh J Salgaonkar <[email protected]>
> Cc: "Oliver O'Halloran" <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: [email protected]
> ---
> drivers/pci/pcie/aer.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 625f7b2cafe4..7f0f52d094a4 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -809,6 +809,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
> trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> aer_severity, tlp_header_valid, &aer->header_log);
> }
> +EXPORT_SYMBOL_GPL(cper_print_aer);
> #endif
>
> /**

Subject: Re: [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging



On 3/23/23 3:20 PM, Terry Bowman wrote:
> Adding PCI reviewers.
>
> https://lore.kernel.org/all/[email protected]/

>
> On 3/23/23 16:38, Terry Bowman wrote:
>> The CXL driver plans to use cper_print_aer() for restricted CXL host
>> (RCH) logging. cper_print_aer() is not exported and as a result is not
>> available to the CXL driver or other loadable modules. Export
>> cper_print_aer() making it available to CXL and other loadable modules.
>>
>> Signed-off-by: Terry Bowman <[email protected]>
>> Cc: Mahesh J Salgaonkar <[email protected]>
>> Cc: "Oliver O'Halloran" <[email protected]>
>> Cc: Bjorn Helgaas <[email protected]>
>> Cc: [email protected]
>> ---

Looks fine to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>


>> drivers/pci/pcie/aer.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 625f7b2cafe4..7f0f52d094a4 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -809,6 +809,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
>> trace_aer_event(dev_name(&dev->dev), (status & ~mask),
>> aer_severity, tlp_header_valid, &aer->header_log);
>> }
>> +EXPORT_SYMBOL_GPL(cper_print_aer);
>> #endif
>>
>> /**

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2023-03-23 22:28:49

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

Adding PCI reviewers.

https://lore.kernel.org/all/[email protected]/

On 3/23/23 16:38, Terry Bowman wrote:
> |errors (UIE/CIE) with the RCEC's source ID. A CXL handler must then|

2023-03-23 22:30:19

by Terry Bowman

[permalink] [raw]

2023-03-24 05:52:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging

Hi Terry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus efi/next cxl/next linus/master v6.3-rc3 next-20230323]
[cannot apply to cxl/pending]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/r/20230323213808.398039-6-terry.bowman%40amd.com
patch subject: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging
config: i386-randconfig-a013 (https://download.01.org/0day-ci/archive/20230324/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
git checkout c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/cxl/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> drivers/cxl/mem.c:51:31: error: no member named 'rcec' in 'struct pci_dev'
struct pci_dev *rcec = pdev->rcec;
~~~~ ^
>> drivers/cxl/mem.c:63:14: error: no member named 'aer_cap' in 'struct pci_dev'
aer = rcec->aer_cap;
~~~~ ^
2 errors generated.


vim +51 drivers/cxl/mem.c

48
49 static int rcec_enable_aer_ints(struct pci_dev *pdev)
50 {
> 51 struct pci_dev *rcec = pdev->rcec;
52 int aer, rc;
53 u32 mask;
54
55 if (!rcec)
56 return -ENODEV;
57
58 /*
59 * Internal errors are masked by default, unmask RCEC's here
60 * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
61 * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
62 */
> 63 aer = rcec->aer_cap;
64 rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
65 if (rc)
66 return rc;
67 mask &= ~PCI_ERR_UNC_INTN;
68 rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
69 if (rc)
70 return rc;
71
72 rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
73 if (rc)
74 return rc;
75 mask &= ~PCI_ERR_COR_INTERNAL;
76 rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
77
78 return rc;
79 }
80

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-24 06:13:18

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging

Hi Terry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus efi/next cxl/next linus/master v6.3-rc3 next-20230323]
[cannot apply to cxl/pending]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/r/20230323213808.398039-6-terry.bowman%40amd.com
patch subject: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging
config: riscv-randconfig-r024-20230322 (https://download.01.org/0day-ci/archive/20230324/[email protected]/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
git checkout c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash drivers/cxl/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

drivers/cxl/mem.c: In function 'rcec_enable_aer_ints':
>> drivers/cxl/mem.c:63:21: error: 'struct pci_dev' has no member named 'aer_cap'; did you mean 'ats_cap'?
63 | aer = rcec->aer_cap;
| ^~~~~~~
| ats_cap


vim +63 drivers/cxl/mem.c

48
49 static int rcec_enable_aer_ints(struct pci_dev *pdev)
50 {
51 struct pci_dev *rcec = pdev->rcec;
52 int aer, rc;
53 u32 mask;
54
55 if (!rcec)
56 return -ENODEV;
57
58 /*
59 * Internal errors are masked by default, unmask RCEC's here
60 * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
61 * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
62 */
> 63 aer = rcec->aer_cap;
64 rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
65 if (rc)
66 return rc;
67 mask &= ~PCI_ERR_UNC_INTN;
68 rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
69 if (rc)
70 return rc;
71
72 rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
73 if (rc)
74 return rc;
75 mask &= ~PCI_ERR_COR_INTERNAL;
76 rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
77
78 return rc;
79 }
80

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-24 06:37:42

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging

Hi Terry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus efi/next cxl/next linus/master v6.3-rc3 next-20230323]
[cannot apply to cxl/pending]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/r/20230323213808.398039-6-terry.bowman%40amd.com
patch subject: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging
config: i386-randconfig-a001 (https://download.01.org/0day-ci/archive/20230324/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
git checkout c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

drivers/cxl/mem.c: In function 'rcec_enable_aer_ints':
>> drivers/cxl/mem.c:51:36: error: 'struct pci_dev' has no member named 'rcec'
51 | struct pci_dev *rcec = pdev->rcec;
| ^~
>> drivers/cxl/mem.c:63:21: error: 'struct pci_dev' has no member named 'aer_cap'; did you mean 'ats_cap'?
63 | aer = rcec->aer_cap;
| ^~~~~~~
| ats_cap


vim +51 drivers/cxl/mem.c

48
49 static int rcec_enable_aer_ints(struct pci_dev *pdev)
50 {
> 51 struct pci_dev *rcec = pdev->rcec;
52 int aer, rc;
53 u32 mask;
54
55 if (!rcec)
56 return -ENODEV;
57
58 /*
59 * Internal errors are masked by default, unmask RCEC's here
60 * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
61 * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
62 */
> 63 aer = rcec->aer_cap;
64 rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
65 if (rc)
66 return rc;
67 mask &= ~PCI_ERR_UNC_INTN;
68 rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
69 if (rc)
70 return rc;
71
72 rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
73 if (rc)
74 return rc;
75 mask &= ~PCI_ERR_COR_INTERNAL;
76 rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
77
78 return rc;
79 }
80

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-24 08:56:09

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] cxl/pci: Add RCH downstream port AER and RAS register discovery

Hi Terry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on pci/next]
[also build test WARNING on pci/for-linus efi/next cxl/next cxl/pending linus/master v6.3-rc3 next-20230324]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/r/20230323213808.398039-2-terry.bowman%40amd.com
patch subject: [PATCH v2 1/5] cxl/pci: Add RCH downstream port AER and RAS register discovery
config: parisc-randconfig-s043-20230322 (https://download.01.org/0day-ci/archive/20230324/[email protected]/config)
compiler: hppa-linux-gcc (GCC) 12.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/7cbc5c7357504af79c820ad7d0e9369b4a580a65
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
git checkout 7cbc5c7357504af79c820ad7d0e9369b4a580a65
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=parisc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=parisc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> drivers/cxl/core/regs.c:340:24: sparse: sparse: Using plain integer as NULL pointer
drivers/cxl/core/regs.c:345:24: sparse: sparse: Using plain integer as NULL pointer

vim +340 drivers/cxl/core/regs.c

338
339 if (!request_mem_region(map->resource, map->max_size, name))
> 340 return 0;
341
342 map->base = ioremap(map->resource, map->max_size);
343 if (!map->base) {
344 release_mem_region(map->resource, map->max_size);
345 return 0;
346 }
347
348 return map->base;
349 }
350

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-24 13:15:09

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] cxl/pci: Add RCH downstream port AER and RAS register discovery



On 3/24/23 03:53, kernel test robot wrote:

> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> sparse warnings: (new ones prefixed by >>)
>>> drivers/cxl/core/regs.c:340:24: sparse: sparse: Using plain integer as NULL pointer
> drivers/cxl/core/regs.c:345:24: sparse: sparse: Using plain integer as NULL pointer
>
> vim +340 drivers/cxl/core/regs.c
>
> 338
> 339 if (!request_mem_region(map->resource, map->max_size, name))
> > 340 return 0;
> 341
> 342 map->base = ioremap(map->resource, map->max_size);
> 343 if (!map->base) {
> 344 release_mem_region(map->resource, map->max_size);
> 345 return 0;
> 346 }
> 347
> 348 return map->base;
> 349 }
> 350
>

Yes, I will change the 0 return value to use NULL instead.

Regards,
Terry

2023-03-24 17:48:13

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging

I added a comment below.

On 3/24/23 01:30, kernel test robot wrote:
> Hi Terry,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on pci/next]
> [also build test ERROR on pci/for-linus efi/next cxl/next linus/master v6.3-rc3 next-20230323]
> [cannot apply to cxl/pending]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
> base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
> patch link: https://lore.kernel.org/r/20230323213808.398039-6-terry.bowman%40amd.com
> patch subject: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging
> config: i386-randconfig-a001 (https://download.01.org/0day-ci/archive/20230324/[email protected]/config)
> compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
> reproduce (this is a W=1 build):
> # https://github.com/intel-lab-lkp/linux/commit/c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230324-054044
> git checkout c40ca148e9cff1a1c32cd4c5c9b252bf0cf201b6
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> make W=1 O=build_dir ARCH=i386 olddefconfig
> make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> drivers/cxl/mem.c: In function 'rcec_enable_aer_ints':
>>> drivers/cxl/mem.c:51:36: error: 'struct pci_dev' has no member named 'rcec'
> 51 | struct pci_dev *rcec = pdev->rcec;
> | ^~
>>> drivers/cxl/mem.c:63:21: error: 'struct pci_dev' has no member named 'aer_cap'; did you mean 'ats_cap'?
> 63 | aer = rcec->aer_cap;
> | ^~~~~~~
> | ats_cap
>
>
> vim +51 drivers/cxl/mem.c
>
> 48
> 49 static int rcec_enable_aer_ints(struct pci_dev *pdev)
> 50 {
> > 51 struct pci_dev *rcec = pdev->rcec;
> 52 int aer, rc;
> 53 u32 mask;
> 54
> 55 if (!rcec)
> 56 return -ENODEV;
> 57
> 58 /*
> 59 * Internal errors are masked by default, unmask RCEC's here
> 60 * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
> 61 * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
> 62 */
> > 63 aer = rcec->aer_cap;
> 64 rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
> 65 if (rc)
> 66 return rc;
> 67 mask &= ~PCI_ERR_UNC_INTN;
> 68 rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
> 69 if (rc)
> 70 return rc;
> 71
> 72 rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
> 73 if (rc)
> 74 return rc;
> 75 mask &= ~PCI_ERR_COR_INTERNAL;
> 76 rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
> 77
> 78 return rc;
> 79 }
> 80
>

I will add #ifdef checks for CONFIG_PCIEPORTBUS and CONFIG_PCIEAER
around the related code.

Regards,
Terry

2023-03-24 21:44:13

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging

On Thu, Mar 23, 2023 at 04:38:06PM -0500, Terry Bowman wrote:
> The CXL driver plans to use cper_print_aer() for restricted CXL host
> (RCH) logging. cper_print_aer() is not exported and as a result is not
> available to the CXL driver or other loadable modules. Export
> cper_print_aer() making it available to CXL and other loadable modules.
>
> Signed-off-by: Terry Bowman <[email protected]>
> Cc: Mahesh J Salgaonkar <[email protected]>
> Cc: "Oliver O'Halloran" <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: [email protected]

Acked-by: Bjorn Helgaas <[email protected]>

But please update the subject line to:

PCI/AER: Export cper_print_aer() for use by modules

to match previous history.

> ---
> drivers/pci/pcie/aer.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 625f7b2cafe4..7f0f52d094a4 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -809,6 +809,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
> trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> aer_severity, tlp_header_valid, &aer->header_log);
> }
> +EXPORT_SYMBOL_GPL(cper_print_aer);
> #endif
>
> /**
> --
> 2.34.1
>

2023-03-24 21:55:50

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging



On 3/24/23 16:41, Bjorn Helgaas wrote:
> On Thu, Mar 23, 2023 at 04:38:06PM -0500, Terry Bowman wrote:
>> The CXL driver plans to use cper_print_aer() for restricted CXL host
>> (RCH) logging. cper_print_aer() is not exported and as a result is not
>> available to the CXL driver or other loadable modules. Export
>> cper_print_aer() making it available to CXL and other loadable modules.
>>
>> Signed-off-by: Terry Bowman <[email protected]>
>> Cc: Mahesh J Salgaonkar <[email protected]>
>> Cc: "Oliver O'Halloran" <[email protected]>
>> Cc: Bjorn Helgaas <[email protected]>
>> Cc: [email protected]
>
> Acked-by: Bjorn Helgaas <[email protected]>
>
> But please update the subject line to:
>
> PCI/AER: Export cper_print_aer() for use by modules
>
> to match previous history.
>

Hi Bjorn,

Yes, I'll update the subject line. Thanks.

>> ---
>> drivers/pci/pcie/aer.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 625f7b2cafe4..7f0f52d094a4 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -809,6 +809,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
>> trace_aer_event(dev_name(&dev->dev), (status & ~mask),
>> aer_severity, tlp_header_valid, &aer->header_log);
>> }
>> +EXPORT_SYMBOL_GPL(cper_print_aer);
>> #endif
>>
>> /**
>> --
>> 2.34.1
>>

2023-03-24 22:52:57

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

I'd call this a "PCI/AER: ..." patch since that's where all the
changes are.

On Thu, Mar 23, 2023 at 04:38:07PM -0500, Terry Bowman wrote:
> From: Robert Richter <[email protected]>
>
> In RCD mode a CXL device (RCD) is exposed as an RCiEP, but CXL
> downstream and upstream ports are not enumerated and not visible in
> the PCIe hierarchy. Protocol and link errors are sent to an RCEC.

"RCD" isn't a common term in drivers/pci; can you expand it once here?

> Now, RCH downstream port-detected errors are signaled as internal AER
> errors (UIE/CIE) with the RCEC's source ID. A CXL handler must then

Similarly, "UIE" and "CIE" are new to drivers/pci; can you expand them
before using? I assume Uncorrectable Internal Error (UIE) and
Corrected Internal Error (CIE)? (Annoying that the PCIe spec uses
"Correctable" in general, but "Corrected" for Internal Errors.)

> inspect the error status in various CXL registers residing in the
> dport's component register space (CXL RAS cap) or the dport's RCRB
> (AER ext cap). [1]
>
> This patch connects errors showing up in the RCEC's error handler with

"Connect errors ..." (we already know this text applies to *this
patch*).

> the CXL subsystem. Implement this by forwarding the error to all CXL
> devices below the RCEC. Since the entire CXL device is controlled only
> using PCIe Configuration Space of device 0, Function 0, only pass it
> there [2]. These devices have the Memory Device class code set
> (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver can
> implement the handler.

> The CXL device driver is then responsible to
> enable error reporting in the RCEC's AER cap

I don't know exactly what you mean by "error reporting in the RCEC's
AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
register and should already be enabled by pci_aer_init().

Maybe you mean setting AER mask/severity specifically for Internal
Errors? I'm hoping to get as much of AER management as we can in the
PCI core and out of drivers, so maybe we need a new PCI interface to
do that.

In any event, I assume this sort of configuration would be an
enumeration-time thing, while *this* patch is a run-time thing, so
maybe this information belongs with a different patch?

> (esp. CIE and UIE) and to
> inspect the dport's CXL registers in addition (CXL RAS cap and AER ext
> cap).
>
> The reason for choosing this implementation is that a CXL RCEC device
> is bound to the AER port driver, but the driver does not allow it to
> register a custom specific handler to support CXL. Connecting the RCEC
> hard-wired with a CXL handler does not work, as the CXL subsystem
> might not be present all the time. The alternative to add an
> implementation to the portdrv to allow the registration of a custom
> RCEC error handler isn't worth doing it as CXL would be its only user.
> Instead, just check for an CXL RCEC and pass it down to the connected
> CXL device's error handler.
>
> [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
>
> Co-developed-by: Terry Bowman <[email protected]>
> Signed-off-by: Terry Bowman <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>

Since you're sending this patch (Terry) your Signed-off-by should be
last.

> Cc: "Oliver O'Halloran" <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> drivers/pci/pcie/aer.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 45 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 7f0f52d094a4..d250a4caa85a 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -943,6 +943,49 @@ static bool find_source_device(struct pci_dev *parent,
> return true;
> }
>
> +#if IS_ENABLED(CONFIG_CXL_PCI)
> +
> +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info);
> +
> +static int handle_cxl_error_iter(struct pci_dev *dev, void *data)
> +{
> + struct aer_err_info *e_info = (struct aer_err_info *)data;
> +

Thanks for explaining the :00.0 in the commit log. I think a one-line
comment here would be useful too so future readers don't have to dig
out the commit to understand.

> + if (dev->devfn != PCI_DEVFN(0, 0))
> + return 0;
> +
> + /* Right now there is only a CXL.mem driver */
> + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
> + return 0;
> +
> + /* pci_dev_put() in handle_error_source() */
> + dev = pci_dev_get(dev);

I don't see why you need this. Didn't we get here via this path?

aer_isr
aer_isr_one_error
find_source_device
find_device_iter
if (is_error_source())
add_error_device
pci_dev_get <-- existing pci_dev_get()
aer_process_err_devices
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++)
if (aer_get_device_error_info(e_info->dev[i], e_info))
handle_error_source
+ handle_cxl_error
pci_dev_put(dev) <-- existing pci_dev_put()

So it looks like we wouldn't call handle_error_source() unless we had
a valid e_info->dev[i], which has already had pci_dev_get() called on
it.

Oh, I think I see ... handle_cxl_error() itself was called because an
RCEC reported an error on behalf of a CXL RCiEP (?), and then you use
pcie_walk_rcec() to look through all the associated RCiEPs, and
recursively call handle_error_source(), and we haven't acquired a
reference to those RCiEPs. Right?

But I thought the CXL things were not enumerated (first paragraph of
commit log)? But obviously these RCiEPs must be enumerated as PCI
devices or pcie_walk_rcec() and pci_dev_get() wouldn't work.

I haven't worked all the way through this, but I thought Sean Kelley's
and Qiuxu Zhuo's work was along the same line and might cover this,
e.g.,

a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")

But I guess maybe it's not quite the same case?

If you *do* need this, I know pci_dev_get(NULL) is a no-op, but since
you're testing for NULL anyway, I'd put it inside the "if" body.

> + if (dev)
> + handle_error_source(dev, e_info);
> +
> + return 0;
> +}
> +
> +static bool is_internal_error(struct aer_err_info *info)
> +{
> + if (info->severity == AER_CORRECTABLE)
> + return info->status & PCI_ERR_COR_INTERNAL;
> +
> + return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> +{
> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> + is_internal_error(info))

What's unique about Internal Errors? I'm trying to figure out why you
wouldn't do this for *all* CXL errors.

> + pcie_walk_rcec(dev, handle_cxl_error_iter, info);
> +}
> +
> +#else
> +static inline void handle_cxl_error(struct pci_dev *dev,
> + struct aer_err_info *info) { }
> +#endif
> +
> /**
> * handle_error_source - handle logging error into an event log
> * @dev: pointer to pci_dev data structure of error source device
> @@ -954,6 +997,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> {
> int aer = dev->aer_cap;
>
> + handle_cxl_error(dev, info);
> +
> if (info->severity == AER_CORRECTABLE) {
> /*
> * Correctable error does not need software intervention.
> --
> 2.34.1
>

2023-03-27 21:53:58

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

Bjorn,

thanks for you review.

On 24.03.23 17:36:56, Bjorn Helgaas wrote:
> I'd call this a "PCI/AER: ..." patch since that's where all the
> changes are.
>
> On Thu, Mar 23, 2023 at 04:38:07PM -0500, Terry Bowman wrote:
> > From: Robert Richter <[email protected]>
> >
> > In RCD mode a CXL device (RCD) is exposed as an RCiEP, but CXL
> > downstream and upstream ports are not enumerated and not visible in
> > the PCIe hierarchy. Protocol and link errors are sent to an RCEC.
>
> "RCD" isn't a common term in drivers/pci; can you expand it once here?
>
> > Now, RCH downstream port-detected errors are signaled as internal AER
> > errors (UIE/CIE) with the RCEC's source ID. A CXL handler must then
>
> Similarly, "UIE" and "CIE" are new to drivers/pci; can you expand them
> before using? I assume Uncorrectable Internal Error (UIE) and
> Corrected Internal Error (CIE)? (Annoying that the PCIe spec uses
> "Correctable" in general, but "Corrected" for Internal Errors.)
>
> > inspect the error status in various CXL registers residing in the
> > dport's component register space (CXL RAS cap) or the dport's RCRB
> > (AER ext cap). [1]
> >
> > This patch connects errors showing up in the RCEC's error handler with
>
> "Connect errors ..." (we already know this text applies to *this
> patch*).

I will update subject and description.

>
> > the CXL subsystem. Implement this by forwarding the error to all CXL
> > devices below the RCEC. Since the entire CXL device is controlled only
> > using PCIe Configuration Space of device 0, Function 0, only pass it
> > there [2]. These devices have the Memory Device class code set
> > (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver can
> > implement the handler.
>
> > The CXL device driver is then responsible to
> > enable error reporting in the RCEC's AER cap
>
> I don't know exactly what you mean by "error reporting in the RCEC's
> AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> register and should already be enabled by pci_aer_init().
>
> Maybe you mean setting AER mask/severity specifically for Internal
> Errors? I'm hoping to get as much of AER management as we can in the

Richt, this is implemented in patch #5 in function
rcec_enable_aer_ints().

> PCI core and out of drivers, so maybe we need a new PCI interface to
> do that.
>
> In any event, I assume this sort of configuration would be an
> enumeration-time thing, while *this* patch is a run-time thing, so
> maybe this information belongs with a different patch?

Do you mean once a Restricted CXL host (RCH) is detected, the internal
errors should be enabled in the device mask, all this done during
device enumeration? But wouldn't interrupts being enabled then before
the CXL device is ready?

>
> > (esp. CIE and UIE) and to
> > inspect the dport's CXL registers in addition (CXL RAS cap and AER ext
> > cap).
> >
> > The reason for choosing this implementation is that a CXL RCEC device
> > is bound to the AER port driver, but the driver does not allow it to
> > register a custom specific handler to support CXL. Connecting the RCEC
> > hard-wired with a CXL handler does not work, as the CXL subsystem
> > might not be present all the time. The alternative to add an
> > implementation to the portdrv to allow the registration of a custom
> > RCEC error handler isn't worth doing it as CXL would be its only user.
> > Instead, just check for an CXL RCEC and pass it down to the connected
> > CXL device's error handler.
> >
> > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> >
> > Co-developed-by: Terry Bowman <[email protected]>
> > Signed-off-by: Terry Bowman <[email protected]>
> > Signed-off-by: Robert Richter <[email protected]>
>
> Since you're sending this patch (Terry) your Signed-off-by should be
> last.
>
> > Cc: "Oliver O'Halloran" <[email protected]>
> > Cc: Bjorn Helgaas <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > drivers/pci/pcie/aer.c | 45 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 45 insertions(+)
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 7f0f52d094a4..d250a4caa85a 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -943,6 +943,49 @@ static bool find_source_device(struct pci_dev *parent,
> > return true;
> > }
> >
> > +#if IS_ENABLED(CONFIG_CXL_PCI)
> > +
> > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info);
> > +
> > +static int handle_cxl_error_iter(struct pci_dev *dev, void *data)
> > +{
> > + struct aer_err_info *e_info = (struct aer_err_info *)data;
> > +
>
> Thanks for explaining the :00.0 in the commit log. I think a one-line
> comment here would be useful too so future readers don't have to dig
> out the commit to understand.

Sure, will add a comment.

>
> > + if (dev->devfn != PCI_DEVFN(0, 0))
> > + return 0;
> > +
> > + /* Right now there is only a CXL.mem driver */
> > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
> > + return 0;
> > +
> > + /* pci_dev_put() in handle_error_source() */
> > + dev = pci_dev_get(dev);

This dev is the endpoint's device handle.

>
> I don't see why you need this. Didn't we get here via this path?
>
> aer_isr
> aer_isr_one_error
> find_source_device
> find_device_iter
> if (is_error_source())
> add_error_device
> pci_dev_get <-- existing pci_dev_get()

This is the RCEC's device handle. Note that downstream port errors
have the RCEC's error source id (the RCEC’s Bus, Device, and Function
Numbers, see CXL 3.0, 12.2.1.1). We pass this error to the CXL
endpoint's driver (RCD, the RCiEP) now. The iterator goes through all
endpoints connected to the RCEC.

> aer_process_err_devices
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++)
> if (aer_get_device_error_info(e_info->dev[i], e_info))
> handle_error_source
> + handle_cxl_error

For the endpoint we further call here now:

pcie_walk_rcec(dev, handle_cxl_error_iter, info)
handle_cxl_error_iter(endpoint_dev)
pci_dev_get(endpoint_dev)
handle_error_source(endpoint_dev)
pci_dev_put(endpoint_dev)

> pci_dev_put(dev) <-- existing pci_dev_put()
>
> So it looks like we wouldn't call handle_error_source() unless we had
> a valid e_info->dev[i], which has already had pci_dev_get() called on
> it.
>
> Oh, I think I see ... handle_cxl_error() itself was called because an
> RCEC reported an error on behalf of a CXL RCiEP (?), and then you use
> pcie_walk_rcec() to look through all the associated RCiEPs, and
> recursively call handle_error_source(), and we haven't acquired a
> reference to those RCiEPs. Right?

Yes, exacly.

>
> But I thought the CXL things were not enumerated (first paragraph of
> commit log)? But obviously these RCiEPs must be enumerated as PCI
> devices or pcie_walk_rcec() and pci_dev_get() wouldn't work.

The dport and uport are not enumerated, there are two RCRB ranges for
them. But errors are reported using the PCI hierarchy, using the RCEC
and the RCiEP. Once an internal error arrives with the RCEC as an
error source, the AER err cap in the RCRB must be checked.

>
> I haven't worked all the way through this, but I thought Sean Kelley's
> and Qiuxu Zhuo's work was along the same line and might cover this,
> e.g.,
>
> a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> 579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
>
> But I guess maybe it's not quite the same case?

Actually, we use this code to handle errors that are reported to the
RCEC and only implement here the CXL specifics. That is, checking if
the RCEC receives something from a CXL downstream port and forwarding
that to a CXL handler (this patch). The handler then checks the AER
err cap in the RCRB of all CXL downstream ports associated to the RCEC
(not visible in the PCI hierarchy), but discovered through the :00.0
RCiEP (patch #5).

>
> If you *do* need this, I know pci_dev_get(NULL) is a no-op, but since
> you're testing for NULL anyway, I'd put it inside the "if" body.
>
> > + if (dev)
> > + handle_error_source(dev, e_info);

The check is more for the (theoretical) case where we cannot increment
the refcount.

> > +
> > + return 0;
> > +}
> > +
> > +static bool is_internal_error(struct aer_err_info *info)
> > +{
> > + if (info->severity == AER_CORRECTABLE)
> > + return info->status & PCI_ERR_COR_INTERNAL;
> > +
> > + return info->status & PCI_ERR_UNC_INTN;
> > +}
> > +
> > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> > +{
> > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > + is_internal_error(info))
>
> What's unique about Internal Errors? I'm trying to figure out why you
> wouldn't do this for *all* CXL errors.

Per CXL specification downstream port errors are signaled using
internal errors. All other errors would be device specific, we cannot
handle that in a generic CXL driver. Note these are only errors
reported to the RCEC with the RCEC as a source, the RCiEP's errors are
handled using standard reporting already, see aer_isr_one_error() and
further calling pdrv->err_handler->cor_error_detected() or
pdrv->err_handler->error_detected().

-Robert

>
> > + pcie_walk_rcec(dev, handle_cxl_error_iter, info);
> > +}
> > +
> > +#else
> > +static inline void handle_cxl_error(struct pci_dev *dev,
> > + struct aer_err_info *info) { }
> > +#endif
> > +
> > /**
> > * handle_error_source - handle logging error into an event log
> > * @dev: pointer to pci_dev data structure of error source device
> > @@ -954,6 +997,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> > {
> > int aer = dev->aer_cap;
> >
> > + handle_cxl_error(dev, info);
> > +
> > if (info->severity == AER_CORRECTABLE) {
> > /*
> > * Correctable error does not need software intervention.
> > --
> > 2.34.1
> >

2023-03-27 23:25:23

by Dave Jiang

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging



On 3/23/23 2:38 PM, Terry Bowman wrote:
> RCH downstream port error logging is missing in the current CXL driver. The
> missing AER and RAS error logging is needed for communicating driver error
> details to userspace. Update the driver to include PCIe AER and CXL RAS
> error logging.
>
> Add RCH downstream port error handling into the existing RCiEP handler.
> The downstream port error handler is added to the RCiEP error handler
> because the downstream port is implemented in a RCRB, is not PCI
> enumerable, and as a result is not directly accessible to the PCI AER
> root port driver. The AER root port driver calls the RCiEP handler for
> handling RCD errors and RCH downstream port protocol errors.
>
> Update the cxl_mem driver to map the RCH RAS and AER register discovered
> earlier. The RAS and AER registers will be used in the RCH error handlers.
>
> Disable RCH downstream port's root port cmd interrupts. Enable RCEC AER
> CIE/UIE reporting because they are disabled by default.[1]
>
> Update existing RCiEP correctable and uncorrectable handlers to also call
> the RCH handler. The RCH handler will read the downstream port AER
> registers, check for error severity, and if an error exists will log
> using an existing kernel AER trace routine. The RCH handler will also
> reuse the existing RAS logging routine to log downstream port RAS
> errors if they exist.
>
> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
>
> Co-developed-by: Robert Richter <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> Signed-off-by: Terry Bowman <[email protected]>
> ---
> drivers/cxl/core/pci.c | 126 +++++++++++++++++++++++++++++++++----
> drivers/cxl/core/regs.c | 1 +
> drivers/cxl/cxl.h | 13 ++++
> drivers/cxl/mem.c | 134 ++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 262 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 7328a2552411..6e36471969ba 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -5,6 +5,7 @@
> #include <linux/delay.h>
> #include <linux/pci.h>
> #include <linux/pci-doe.h>
> +#include <linux/aer.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> #include <cxl.h>
> @@ -605,32 +606,88 @@ void read_cdat_data(struct cxl_port *port)
> }
> EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>
> -void cxl_cor_error_detected(struct pci_dev *pdev)
> +/* Get AER severity. Return false if there is no error. */
> +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
> + int *severity)
> +{
> + if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
> + if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
> + *severity = AER_FATAL;
> + else
> + *severity = AER_NONFATAL;
> + return true;
> + }
> +
> + if (aer_regs->cor_status & ~aer_regs->cor_mask) {
> + *severity = AER_CORRECTABLE;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Copy the AER capability registers to a buffer. This is necessary
> + * because RCRB AER capability is MMIO mapped. Clear the status
> + * after copying.
> + *
> + * @aer_base: base address of AER capability block in RCRB
> + * @aer_regs: destination for copying AER capability
> + */
> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
> + struct aer_capability_regs *aer_regs)
> +{
> + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
> + u32 *aer_regs_buf = (u32 *)aer_regs;
> + int n;
> +
> + if (!aer_base)
> + return false;
> +
> + for (n = 0; n < read_cnt; n++)
> + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
> +
> + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
> + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
> +
> + return true;
> +}
> +
> +static void __cxl_log_correctable_ras(struct cxl_dev_state *cxlds,
> + void __iomem *ras_base)
> {
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> void __iomem *addr;
> u32 status;
>
> - if (!cxlds->regs.ras)
> + if (!ras_base)
> return;
>
> - addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> + addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> }
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
> +
> +static void cxl_log_correctable_ras_endpoint(struct cxl_dev_state *cxlds)
> +{
> + return __cxl_log_correctable_ras(cxlds, cxlds->regs.ras);
> +}
> +
> +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds)
> +{
> + return __cxl_log_correctable_ras(cxlds, cxlds->regs.dport_ras);
> +}
>
> /* CXL spec rev3.0 8.2.4.16.1 */
> -static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log)
> +static void header_log_copy(void __iomem *ras_base, u32 *log)
> {
> void __iomem *addr;
> u32 *log_addr;
> int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32);
>
> - addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
> + addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET;
> log_addr = log;
>
> for (i = 0; i < log_u32_size; i++) {
> @@ -644,17 +701,18 @@ static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log)
> * Log the state of the RAS status registers and prepare them to log the
> * next error status. Return 1 if reset needed.
> */
> -static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
> +static bool __cxl_report_and_clear(struct cxl_dev_state *cxlds,
> + void __iomem *ras_base)
> {
> u32 hl[CXL_HEADERLOG_SIZE_U32];
> void __iomem *addr;
> u32 status;
> u32 fe;
>
> - if (!cxlds->regs.ras)
> + if (!ras_base)
> return false;
>
> - addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
> + addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
> return false;
> @@ -662,7 +720,7 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
> /* If multiple errors, log header points to first error from ctrl reg */
> if (hweight32(status) > 1) {
> void __iomem *rcc_addr =
> - cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET;
> + ras_base + CXL_RAS_CAP_CONTROL_OFFSET;
>
> fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
> readl(rcc_addr)));
> @@ -670,13 +728,54 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
> fe = status;
> }
>
> - header_log_copy(cxlds, hl);
> + header_log_copy(ras_base, hl);
> trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
> }
>
> +static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
> +{
> + return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
> +}
> +
> +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds)
> +{
> + return __cxl_report_and_clear(cxlds, cxlds->regs.dport_ras);
> +}
> +
> +static void cxl_rch_log_error(struct cxl_dev_state *cxlds)
> +{
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + struct aer_capability_regs aer_regs;
> + int severity;
> +
> + if (!cxl_rch_get_aer_info(cxlds->regs.aer, &aer_regs))
> + return;
> +
> + if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
> + return;
> +
> + cper_print_aer(pdev, severity, &aer_regs);
> +
> + if (severity == AER_CORRECTABLE)
> + cxl_log_correctable_ras_dport(cxlds);
> + else
> + cxl_report_and_clear_dport(cxlds);
> +}
> +
> +void cxl_cor_error_detected(struct pci_dev *pdev)
> +{
> + struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +
> + if (cxlds->rcd)
> + cxl_rch_log_error(cxlds);
> +
> + cxl_log_correctable_ras_endpoint(cxlds);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
> +
> pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> pci_channel_state_t state)
> {
> @@ -685,6 +784,9 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> struct device *dev = &cxlmd->dev;
> bool ue;
>
> + if (cxlds->rcd)
> + cxl_rch_log_error(cxlds);
> +
> /*
> * A frozen channel indicates an impending reset which is fatal to
> * CXL.mem operation, and will likely crash the system. On the off
> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> index 108a349d8101..7130f35891da 100644
> --- a/drivers/cxl/core/regs.c
> +++ b/drivers/cxl/core/regs.c
> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>
> return ret_val;
> }
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
>
> int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
> struct cxl_register_map *map, unsigned long map_mask)
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 9fd7df48ce99..7036e34354bc 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -66,6 +66,8 @@
> #define CXL_DECODER_MIN_GRANULARITY 256
> #define CXL_DECODER_MAX_ENCODED_IG 6
>
> +#define PCI_AER_CAPABILITY_LENGTH 56
> +
> static inline int cxl_hdm_decoder_count(u32 cap_hdr)
> {
> int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
> @@ -209,6 +211,15 @@ struct cxl_regs {
> struct_group_tagged(cxl_device_regs, device_regs,
> void __iomem *status, *mbox, *memdev;
> );
> +
> + /*
> + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
> + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors
> + */
> + struct_group_tagged(cxl_rch_regs, rch_regs,
> + void __iomem *aer;
> + void __iomem *dport_ras;
> + );
> };
>
> struct cxl_reg_map {
> @@ -249,6 +260,8 @@ struct cxl_register_map {
> };
> };
>
> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
> + resource_size_t length);
> void cxl_probe_component_regs(struct device *dev, void __iomem *base,
> struct cxl_component_reg_map *map);
> void cxl_probe_device_regs(struct device *dev, void __iomem *base,
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 12e8e8ebaac0..e217c44ed749 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -4,6 +4,7 @@
> #include <linux/device.h>
> #include <linux/module.h>
> #include <linux/pci.h>
> +#include <linux/aer.h>
>
> #include "cxlmem.h"
> #include "cxlpci.h"
> @@ -45,6 +46,132 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
> return 0;
> }
>
> +static int rcec_enable_aer_ints(struct pci_dev *pdev)
> +{
> + struct pci_dev *rcec = pdev->rcec;
> + int aer, rc;
> + u32 mask;
> +
> + if (!rcec)
> + return -ENODEV;
> +
> + /*
> + * Internal errors are masked by default, unmask RCEC's here
> + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
> + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
> + */
> + aer = rcec->aer_cap;
> + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
> + if (rc)
> + return rc;
> + mask &= ~PCI_ERR_UNC_INTN;
> + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
> + if (rc)
> + return rc;
> + mask &= ~PCI_ERR_COR_INTERNAL;
> + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
> +
> + return rc;
> +}
> +
> +static void disable_aer(void *_pdev)
> +{
> + struct pci_dev *pdev = (struct pci_dev *)_pdev;
> +
> + pci_disable_pcie_error_reporting(pdev);
> +
> + /*
> + * Keep the RCEC's internal AER enabled. There
> + * could be other RCiEPs using this RCEC.
> + */
> +}
> +
> +static void rch_disable_root_ints(void __iomem *aer_base)
> +{
> + u32 aer_cmd_mask, aer_cmd;
> +
> + /*
> + * Disable RCH root port command interrupts.
> + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
> + */
> + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
> + PCI_ERR_ROOT_CMD_NONFATAL_EN |
> + PCI_ERR_ROOT_CMD_FATAL_EN);
> + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
> + aer_cmd &= ~aer_cmd_mask;
> + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> +}
> +
> +static int cxl_ras_setup_interrupts(struct cxl_dev_state *cxlds)
> +{
> + struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> + int rc;
> +
> + if (cxlds->rcd) {
> + rch_disable_root_ints(cxlds->regs.aer);
> +
> + rc = rcec_enable_aer_ints(pdev);
> + if (rc)
> + return rc;
> + }
> +
> + rc = pci_enable_pcie_error_reporting(pdev);

Hi Terry, not sure if you saw this thread [1], but with the new changes
upstream [2] to the PCIe subsystem, Bjorn says we no longer need to call
pci_enable_pcie_error_report() by the driver.

[1]:
https://lore.kernel.org/linux-cxl/[email protected]/T/#mef401fb0580ebb4c4bc2a164f87e12b60cf76693
[2]: commit f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
native")

DJ

> + if (rc)
> + return rc;
> +
> + return devm_add_action_or_reset(&pdev->dev, disable_aer, pdev);
> +}
> +
> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds,
> + struct cxl_dport *parent_dport)
> +{
> + struct device *dev = parent_dport->dport;
> + resource_size_t aer_phys, ras_phys;
> + void __iomem *aer, *dport_ras;
> +
> + if (!parent_dport->rch)
> + return 0;
> +
> + if (!parent_dport->aer_cap || !parent_dport->ras_cap ||
> + parent_dport->component_reg_phys == CXL_RESOURCE_NONE)
> + return -ENODEV;
> +
> + aer_phys = parent_dport->aer_cap + parent_dport->rcrb;
> + aer = devm_cxl_iomap_block(dev, aer_phys,
> + PCI_AER_CAPABILITY_LENGTH);
> +
> + if (!aer)
> + return -ENOMEM;
> +
> + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys;
> + dport_ras = devm_cxl_iomap_block(dev, ras_phys,
> + CXL_RAS_CAPABILITY_LENGTH);
> +
> + if (!dport_ras)
> + return -ENOMEM;
> +
> + cxlds->regs.aer = aer;
> + cxlds->regs.dport_ras = dport_ras;
> +
> + return 0;
> +}
> +
> +static int cxl_setup_ras(struct cxl_dev_state *cxlds,
> + struct cxl_dport *parent_dport)
> +{
> + int rc;
> +
> + rc = cxl_rch_map_ras(cxlds, parent_dport);
> + if (rc)
> + return rc;
> +
> + return cxl_ras_setup_interrupts(cxlds);
> +}
> +
> static void cxl_rcrb_setup(struct cxl_dev_state *cxlds,
> struct cxl_dport *parent_dport)
> {
> @@ -93,6 +220,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>
> cxl_rcrb_setup(cxlds, parent_dport);
>
> + rc = cxl_setup_ras(cxlds, parent_dport);
> + /* Continue with RAS setup errors */
> + if (rc)
> + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc);
> + else
> + dev_info(&cxlmd->dev, "CXL error handling enabled\n");
> +
> endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys,
> parent_dport);
> if (IS_ERR(endpoint))

2023-03-28 13:47:43

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

Hi Bjorn,

On 3/24/23 17:36, Bjorn Helgaas wrote:
> I'd call this a "PCI/AER: ..." patch since that's where all the
> changes are.
>
> On Thu, Mar 23, 2023 at 04:38:07PM -0500, Terry Bowman wrote:
>> From: Robert Richter <[email protected]>
>>
>> In RCD mode a CXL device (RCD) is exposed as an RCiEP, but CXL
>> downstream and upstream ports are not enumerated and not visible in
>> the PCIe hierarchy. Protocol and link errors are sent to an RCEC.
>
> "RCD" isn't a common term in drivers/pci; can you expand it once here?
>
>> Now, RCH downstream port-detected errors are signaled as internal AER
>> errors (UIE/CIE) with the RCEC's source ID. A CXL handler must then
>
> Similarly, "UIE" and "CIE" are new to drivers/pci; can you expand them
> before using? I assume Uncorrectable Internal Error (UIE) and
> Corrected Internal Error (CIE)? (Annoying that the PCIe spec uses
> "Correctable" in general, but "Corrected" for Internal Errors.)
>
>> inspect the error status in various CXL registers residing in the
>> dport's component register space (CXL RAS cap) or the dport's RCRB
>> (AER ext cap). [1]
>>
>> This patch connects errors showing up in the RCEC's error handler with
>
> "Connect errors ..." (we already know this text applies to *this
> patch*).
>
>> the CXL subsystem. Implement this by forwarding the error to all CXL
>> devices below the RCEC. Since the entire CXL device is controlled only
>> using PCIe Configuration Space of device 0, Function 0, only pass it
>> there [2]. These devices have the Memory Device class code set
>> (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver can
>> implement the handler.
>
>> The CXL device driver is then responsible to
>> enable error reporting in the RCEC's AER cap
>
> I don't know exactly what you mean by "error reporting in the RCEC's
> AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> register and should already be enabled by pci_aer_init().
>
> Maybe you mean setting AER mask/severity specifically for Internal
> Errors? I'm hoping to get as much of AER management as we can in the
> PCI core and out of drivers, so maybe we need a new PCI interface to
> do that.
>
> In any event, I assume this sort of configuration would be an
> enumeration-time thing, while *this* patch is a run-time thing, so
> maybe this information belongs with a different patch?
>
>> (esp. CIE and UIE) and to
>> inspect the dport's CXL registers in addition (CXL RAS cap and AER ext
>> cap).
>>
>> The reason for choosing this implementation is that a CXL RCEC device
>> is bound to the AER port driver, but the driver does not allow it to
>> register a custom specific handler to support CXL. Connecting the RCEC
>> hard-wired with a CXL handler does not work, as the CXL subsystem
>> might not be present all the time. The alternative to add an
>> implementation to the portdrv to allow the registration of a custom
>> RCEC error handler isn't worth doing it as CXL would be its only user.
>> Instead, just check for an CXL RCEC and pass it down to the connected
>> CXL device's error handler.
>>
>> [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
>> [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
>>
>> Co-developed-by: Terry Bowman <[email protected]>
>> Signed-off-by: Terry Bowman <[email protected]>
>> Signed-off-by: Robert Richter <[email protected]>
>
> Since you're sending this patch (Terry) your Signed-off-by should be
> last.
>

I'll move my Signed-off-by to the last.

Regards,
Terry

2023-03-28 13:54:21

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] cxl/pci: Add RCH downstream port error logging

Hi dave,


On 3/27/23 18:21, Dave Jiang wrote:
>
>
> On 3/23/23 2:38 PM, Terry Bowman wrote:
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update the cxl_mem driver to map the RCH RAS and AER register discovered
>> earlier. The RAS and AER registers will be used in the RCH error handlers.
>>
>> Disable RCH downstream port's root port cmd interrupts. Enable RCEC AER
>> CIE/UIE reporting because they are disabled by default.[1]
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the downstream port AER
>> registers, check for error severity, and if an error exists will log
>> using an existing kernel AER trace routine. The RCH handler will also
>> reuse the existing RAS logging routine to log downstream port RAS
>> errors if they exist.
>>
>> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
>>
>> Co-developed-by: Robert Richter <[email protected]>
>> Signed-off-by: Robert Richter <[email protected]>
>> Signed-off-by: Terry Bowman <[email protected]>
>> ---
>>   drivers/cxl/core/pci.c  | 126 +++++++++++++++++++++++++++++++++----
>>   drivers/cxl/core/regs.c |   1 +
>>   drivers/cxl/cxl.h       |  13 ++++
>>   drivers/cxl/mem.c       | 134 ++++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 262 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 7328a2552411..6e36471969ba 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -5,6 +5,7 @@
>>   #include <linux/delay.h>
>>   #include <linux/pci.h>
>>   #include <linux/pci-doe.h>
>> +#include <linux/aer.h>
>>   #include <cxlpci.h>
>>   #include <cxlmem.h>
>>   #include <cxl.h>
>> @@ -605,32 +606,88 @@ void read_cdat_data(struct cxl_port *port)
>>   }
>>   EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>>   -void cxl_cor_error_detected(struct pci_dev *pdev)
>> +/* Get AER severity. Return false if there is no error. */
>> +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
>> +                     int *severity)
>> +{
>> +    if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
>> +        if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
>> +            *severity = AER_FATAL;
>> +        else
>> +            *severity = AER_NONFATAL;
>> +        return true;
>> +    }
>> +
>> +    if (aer_regs->cor_status & ~aer_regs->cor_mask) {
>> +        *severity = AER_CORRECTABLE;
>> +        return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/*
>> + * Copy the AER capability registers to a buffer. This is necessary
>> + * because RCRB AER capability is MMIO mapped. Clear the status
>> + * after copying.
>> + *
>> + * @aer_base: base address of AER capability block in RCRB
>> + * @aer_regs: destination for copying AER capability
>> + */
>> +static bool cxl_rch_get_aer_info(void __iomem *aer_base,
>> +                 struct aer_capability_regs *aer_regs)
>> +{
>> +    int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32);
>> +    u32 *aer_regs_buf = (u32 *)aer_regs;
>> +    int n;
>> +
>> +    if (!aer_base)
>> +        return false;
>> +
>> +    for (n = 0; n < read_cnt; n++)
>> +        aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
>> +
>> +    writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
>> +    writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
>> +
>> +    return true;
>> +}
>> +
>> +static void __cxl_log_correctable_ras(struct cxl_dev_state *cxlds,
>> +                      void __iomem *ras_base)
>>   {
>> -    struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>       void __iomem *addr;
>>       u32 status;
>>   -    if (!cxlds->regs.ras)
>> +    if (!ras_base)
>>           return;
>>   -    addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>> +    addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>       status = readl(addr);
>>       if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>>           writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>>           trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
>>       }
>>   }
>> -EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
>> +
>> +static void cxl_log_correctable_ras_endpoint(struct cxl_dev_state *cxlds)
>> +{
>> +    return __cxl_log_correctable_ras(cxlds, cxlds->regs.ras);
>> +}
>> +
>> +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds)
>> +{
>> +    return __cxl_log_correctable_ras(cxlds, cxlds->regs.dport_ras);
>> +}
>>     /* CXL spec rev3.0 8.2.4.16.1 */
>> -static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log)
>> +static void header_log_copy(void __iomem *ras_base, u32 *log)
>>   {
>>       void __iomem *addr;
>>       u32 *log_addr;
>>       int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32);
>>   -    addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
>> +    addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET;
>>       log_addr = log;
>>         for (i = 0; i < log_u32_size; i++) {
>> @@ -644,17 +701,18 @@ static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log)
>>    * Log the state of the RAS status registers and prepare them to log the
>>    * next error status. Return 1 if reset needed.
>>    */
>> -static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>> +static bool __cxl_report_and_clear(struct cxl_dev_state *cxlds,
>> +                  void __iomem *ras_base)
>>   {
>>       u32 hl[CXL_HEADERLOG_SIZE_U32];
>>       void __iomem *addr;
>>       u32 status;
>>       u32 fe;
>>   -    if (!cxlds->regs.ras)
>> +    if (!ras_base)
>>           return false;
>>   -    addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>> +    addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>>       status = readl(addr);
>>       if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
>>           return false;
>> @@ -662,7 +720,7 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>>       /* If multiple errors, log header points to first error from ctrl reg */
>>       if (hweight32(status) > 1) {
>>           void __iomem *rcc_addr =
>> -            cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET;
>> +            ras_base + CXL_RAS_CAP_CONTROL_OFFSET;
>>             fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
>>                      readl(rcc_addr)));
>> @@ -670,13 +728,54 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>>           fe = status;
>>       }
>>   -    header_log_copy(cxlds, hl);
>> +    header_log_copy(ras_base, hl);
>>       trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
>>       writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>         return true;
>>   }
>>   +static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
>> +{
>> +    return __cxl_report_and_clear(cxlds, cxlds->regs.ras);
>> +}
>> +
>> +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds)
>> +{
>> +    return __cxl_report_and_clear(cxlds, cxlds->regs.dport_ras);
>> +}
>> +
>> +static void cxl_rch_log_error(struct cxl_dev_state *cxlds)
>> +{
>> +    struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>> +    struct aer_capability_regs aer_regs;
>> +    int severity;
>> +
>> +    if (!cxl_rch_get_aer_info(cxlds->regs.aer, &aer_regs))
>> +        return;
>> +
>> +    if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
>> +        return;
>> +
>> +    cper_print_aer(pdev, severity, &aer_regs);
>> +
>> +    if (severity == AER_CORRECTABLE)
>> +        cxl_log_correctable_ras_dport(cxlds);
>> +    else
>> +        cxl_report_and_clear_dport(cxlds);
>> +}
>> +
>> +void cxl_cor_error_detected(struct pci_dev *pdev)
>> +{
>> +    struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> +
>> +    if (cxlds->rcd)
>> +        cxl_rch_log_error(cxlds);
>> +
>> +    cxl_log_correctable_ras_endpoint(cxlds);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL);
>> +
>>   pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>                       pci_channel_state_t state)
>>   {
>> @@ -685,6 +784,9 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>       struct device *dev = &cxlmd->dev;
>>       bool ue;
>>   +    if (cxlds->rcd)
>> +        cxl_rch_log_error(cxlds);
>> +
>>       /*
>>        * A frozen channel indicates an impending reset which is fatal to
>>        * CXL.mem operation, and will likely crash the system. On the off
>> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
>> index 108a349d8101..7130f35891da 100644
>> --- a/drivers/cxl/core/regs.c
>> +++ b/drivers/cxl/core/regs.c
>> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>>         return ret_val;
>>   }
>> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL);
>>     int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
>>                  struct cxl_register_map *map, unsigned long map_mask)
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index 9fd7df48ce99..7036e34354bc 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -66,6 +66,8 @@
>>   #define CXL_DECODER_MIN_GRANULARITY 256
>>   #define CXL_DECODER_MAX_ENCODED_IG 6
>>   +#define PCI_AER_CAPABILITY_LENGTH 56
>> +
>>   static inline int cxl_hdm_decoder_count(u32 cap_hdr)
>>   {
>>       int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr);
>> @@ -209,6 +211,15 @@ struct cxl_regs {
>>       struct_group_tagged(cxl_device_regs, device_regs,
>>           void __iomem *status, *mbox, *memdev;
>>       );
>> +
>> +    /*
>> +     * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode)
>> +     * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors
>> +     */
>> +    struct_group_tagged(cxl_rch_regs, rch_regs,
>> +        void __iomem *aer;
>> +        void __iomem *dport_ras;
>> +    );
>>   };
>>     struct cxl_reg_map {
>> @@ -249,6 +260,8 @@ struct cxl_register_map {
>>       };
>>   };
>>   +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>> +                   resource_size_t length);
>>   void cxl_probe_component_regs(struct device *dev, void __iomem *base,
>>                     struct cxl_component_reg_map *map);
>>   void cxl_probe_device_regs(struct device *dev, void __iomem *base,
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index 12e8e8ebaac0..e217c44ed749 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -4,6 +4,7 @@
>>   #include <linux/device.h>
>>   #include <linux/module.h>
>>   #include <linux/pci.h>
>> +#include <linux/aer.h>
>>     #include "cxlmem.h"
>>   #include "cxlpci.h"
>> @@ -45,6 +46,132 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>>       return 0;
>>   }
>>   +static int rcec_enable_aer_ints(struct pci_dev *pdev)
>> +{
>> +    struct pci_dev *rcec = pdev->rcec;
>> +    int aer, rc;
>> +    u32 mask;
>> +
>> +    if (!rcec)
>> +        return -ENODEV;
>> +
>> +    /*
>> +     * Internal errors are masked by default, unmask RCEC's here
>> +     * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
>> +     * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
>> +     */
>> +    aer = rcec->aer_cap;
>> +    rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
>> +    if (rc)
>> +        return rc;
>> +    mask &= ~PCI_ERR_UNC_INTN;
>> +    rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
>> +    if (rc)
>> +        return rc;
>> +
>> +    rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
>> +    if (rc)
>> +        return rc;
>> +    mask &= ~PCI_ERR_COR_INTERNAL;
>> +    rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
>> +
>> +    return rc;
>> +}
>> +
>> +static void disable_aer(void *_pdev)
>> +{
>> +    struct pci_dev *pdev = (struct pci_dev *)_pdev;
>> +
>> +    pci_disable_pcie_error_reporting(pdev);
>> +
>> +    /*
>> +     * Keep the RCEC's internal AER enabled. There
>> +     * could be other RCiEPs using this RCEC.
>> +     */
>> +}
>> +
>> +static void rch_disable_root_ints(void __iomem *aer_base)
>> +{
>> +    u32 aer_cmd_mask, aer_cmd;
>> +
>> +    /*
>> +     * Disable RCH root port command interrupts.
>> +     * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors
>> +     */
>> +    aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
>> +            PCI_ERR_ROOT_CMD_NONFATAL_EN |
>> +            PCI_ERR_ROOT_CMD_FATAL_EN);
>> +    aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
>> +    aer_cmd &= ~aer_cmd_mask;
>> +    writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>> +}
>> +
>> +static int cxl_ras_setup_interrupts(struct cxl_dev_state *cxlds)
>> +{
>> +    struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>> +    int rc;
>> +
>> +    if (cxlds->rcd) {
>> +        rch_disable_root_ints(cxlds->regs.aer);
>> +
>> +        rc = rcec_enable_aer_ints(pdev);
>> +        if (rc)
>> +            return rc;
>> +    }
>> +
>> +    rc = pci_enable_pcie_error_reporting(pdev);
>
> Hi Terry, not sure if you saw this thread [1], but with the new changes upstream [2] to the PCIe subsystem, Bjorn says we no longer need to call pci_enable_pcie_error_report() by the driver.
>
> [1]: https://lore.kernel.org/linux-cxl/[email protected]/T/#mef401fb0580ebb4c4bc2a164f87e12b60cf76693
> [2]: commit f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native")
>
> DJ
>

Ok. I'll remove the call to pci_enable_pcie_error_reporting(). This
will help simplify since it allows removing disable_aer() and call
to devm_add_action_or_reset() as well.

Regards,
Terry

2023-03-28 17:32:49

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

[+cc linux-pci, more error handling folks; beginning of thread at
https://lore.kernel.org/all/[email protected]/]

On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote:
> On 24.03.23 17:36:56, Bjorn Helgaas wrote:

> > > The CXL device driver is then responsible to
> > > enable error reporting in the RCEC's AER cap
> >
> > I don't know exactly what you mean by "error reporting in the RCEC's
> > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> > register and should already be enabled by pci_aer_init().
> >
> > Maybe you mean setting AER mask/severity specifically for Internal
> > Errors? I'm hoping to get as much of AER management as we can in the
>
> Richt, this is implemented in patch #5 in function
> rcec_enable_aer_ints().

I think we should add a PCI core interface for this so we can enforce
the AER ownership question (all the crud like pcie_aer_is_native()) in
one place.

> > PCI core and out of drivers, so maybe we need a new PCI interface to
> > do that.
> >
> > In any event, I assume this sort of configuration would be an
> > enumeration-time thing, while *this* patch is a run-time thing, so
> > maybe this information belongs with a different patch?
>
> Do you mean once a Restricted CXL host (RCH) is detected, the internal
> errors should be enabled in the device mask, all this done during
> device enumeration? But wouldn't interrupts being enabled then before
> the CXL device is ready?

I'm not sure what you mean by "before the CXL device is ready." What
makes a CXL device ready, and how do we know when it is ready?

pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc
as soon as we enumerate the device, before any driver claims the
device. I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and
PCI_ERR_UNC_INTN fiddling around the same time?

> > I haven't worked all the way through this, but I thought Sean Kelley's
> > and Qiuxu Zhuo's work was along the same line and might cover this,
> > e.g.,
> >
> > a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> > 579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> > af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
> >
> > But I guess maybe it's not quite the same case?
>
> Actually, we use this code to handle errors that are reported to the
> RCEC and only implement here the CXL specifics. That is, checking if
> the RCEC receives something from a CXL downstream port and forwarding
> that to a CXL handler (this patch). The handler then checks the AER
> err cap in the RCRB of all CXL downstream ports associated to the RCEC
> (not visible in the PCI hierarchy), but discovered through the :00.0
> RCiEP (patch #5).

There are two calls to pcie_walk_rcec():

1) The existing one in find_source_device()
2) The one you add in handle_cxl_error()

Does the call in handle_cxl_error() look at devices that the existing
call in find_source_device() does not? I'm trying to understand why
we need both calls.

> > > +static bool is_internal_error(struct aer_err_info *info)
> > > +{
> > > + if (info->severity == AER_CORRECTABLE)
> > > + return info->status & PCI_ERR_COR_INTERNAL;
> > > +
> > > + return info->status & PCI_ERR_UNC_INTN;
> > > +}
> > > +
> > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> > > +{
> > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > > + is_internal_error(info))
> >
> > What's unique about Internal Errors? I'm trying to figure out why you
> > wouldn't do this for *all* CXL errors.
>
> Per CXL specification downstream port errors are signaled using
> internal errors.

Maybe a spec reference here to explain is_internal_error()? Is the
point of the check to *exclude* non-internal errors? Or is basically
documentation that there shouldn't ever *be* any non-internal errors?
I guess the latter wouldn't make sense because at this point we don't
know whether this is a CXL hierarchy.

> All other errors would be device specific, we cannot
> handle that in a generic CXL driver.

I'm missing the point here. We don't have any device-specific error
handling in aer.c; it only connects the generic *reporting* mechanism
(AER log registers and Root Port interrupts) to the drivers that do
the device-specific things via err_handler hooks. I assume we want a
similar model for CXL.

Bjorn

2023-03-29 16:04:38

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

On 28.03.23 12:21:04, Bjorn Helgaas wrote:
> [+cc linux-pci, more error handling folks; beginning of thread at
> https://lore.kernel.org/all/[email protected]/]
>
> On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote:
> > On 24.03.23 17:36:56, Bjorn Helgaas wrote:
>
> > > > The CXL device driver is then responsible to
> > > > enable error reporting in the RCEC's AER cap
> > >
> > > I don't know exactly what you mean by "error reporting in the RCEC's
> > > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> > > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> > > register and should already be enabled by pci_aer_init().
> > >
> > > Maybe you mean setting AER mask/severity specifically for Internal
> > > Errors? I'm hoping to get as much of AER management as we can in the
> >
> > Richt, this is implemented in patch #5 in function
> > rcec_enable_aer_ints().
>
> I think we should add a PCI core interface for this so we can enforce
> the AER ownership question (all the crud like pcie_aer_is_native()) in
> one place.

Do you mean, code around functions rcec_enable_aer_ints() should be
moved to aer.c and the cxl handler then just assumes it is enabled
already? That looks feasible.

>
> > > PCI core and out of drivers, so maybe we need a new PCI interface to
> > > do that.
> > >
> > > In any event, I assume this sort of configuration would be an
> > > enumeration-time thing, while *this* patch is a run-time thing, so
> > > maybe this information belongs with a different patch?
> >
> > Do you mean once a Restricted CXL host (RCH) is detected, the internal
> > errors should be enabled in the device mask, all this done during
> > device enumeration? But wouldn't interrupts being enabled then before
> > the CXL device is ready?
>
> I'm not sure what you mean by "before the CXL device is ready." What
> makes a CXL device ready, and how do we know when it is ready?

The cxl_pci driver must be bound to a device which then further
creates a CXL mem dev. With that binding we can determine the
connected CXL dports from the cxl endpoints (which are seen as PCIe
endpoints) to inspect the CXL RAS caps (in the CXL component reg
space) and the PCIe AER caps (in the RCRB of the dport).

>
> pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc
> as soon as we enumerate the device, before any driver claims the
> device. I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and
> PCI_ERR_UNC_INTN fiddling around the same time?

Yes, if the CXL device is not yet bound, there is no handler attached
and AER errors are only handled on a PCI level. Though, we need to
make sure the status is cleared.

>
> > > I haven't worked all the way through this, but I thought Sean Kelley's
> > > and Qiuxu Zhuo's work was along the same line and might cover this,
> > > e.g.,
> > >
> > > a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> > > 579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> > > af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
> > >
> > > But I guess maybe it's not quite the same case?
> >
> > Actually, we use this code to handle errors that are reported to the
> > RCEC and only implement here the CXL specifics. That is, checking if
> > the RCEC receives something from a CXL downstream port and forwarding
> > that to a CXL handler (this patch). The handler then checks the AER
> > err cap in the RCRB of all CXL downstream ports associated to the RCEC
> > (not visible in the PCI hierarchy), but discovered through the :00.0
> > RCiEP (patch #5).
>
> There are two calls to pcie_walk_rcec():
>
> 1) The existing one in find_source_device()
> 2) The one you add in handle_cxl_error()
>
> Does the call in handle_cxl_error() look at devices that the existing
> call in find_source_device() does not? I'm trying to understand why
> we need both calls.

In case of a dport error, e_info will only contain the RCEC's id after
running find_source_device(). Thus, only the RCEC's handler would be
called. The portdrv is already bound to the device and currently
doesn't have a handler attached.

As described, due to cross dependencies between cxl and the portdrv,
instead of implementing a handler in the portdrv, we decided to
forward errors to the CXL endpoint driver and handle it there. So now,
in handle_cxl_error(), we check if the error source is an RCEC
attached to a CXL bus and we forward everything directly to the CXL
endpoint handler. pcie_walk_rcec() is used for that.

>
> > > > +static bool is_internal_error(struct aer_err_info *info)
> > > > +{
> > > > + if (info->severity == AER_CORRECTABLE)
> > > > + return info->status & PCI_ERR_COR_INTERNAL;
> > > > +
> > > > + return info->status & PCI_ERR_UNC_INTN;
> > > > +}
> > > > +
> > > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info *info)
> > > > +{
> > > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > > > + is_internal_error(info))
> > >
> > > What's unique about Internal Errors? I'm trying to figure out why you
> > > wouldn't do this for *all* CXL errors.
> >
> > Per CXL specification downstream port errors are signaled using
> > internal errors.
>
> Maybe a spec reference here to explain is_internal_error()? Is the
> point of the check to *exclude* non-internal errors? Or is basically
> documentation that there shouldn't ever *be* any non-internal errors?
> I guess the latter wouldn't make sense because at this point we don't
> know whether this is a CXL hierarchy.

It is described in CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected
Errors.

We do not handle errors other than internal ones, this is what the
check is for. In theory, an RCEC could also throw other kind of
errors. But, as per spec, once internal error are received from the
RCEC, the CXL dports need to be inspected.

>
> > All other errors would be device specific, we cannot
> > handle that in a generic CXL driver.
>
> I'm missing the point here. We don't have any device-specific error
> handling in aer.c; it only connects the generic *reporting* mechanism
> (AER log registers and Root Port interrupts) to the drivers that do
> the device-specific things via err_handler hooks. I assume we want a
> similar model for CXL.

With device specific I mean implementation defined and not described
in a specification. The CXL handler is sort of generic as it is
(solely) implementing the CXL spec. Hope that makes sense.

Thanks,

-Robert

2023-04-14 20:48:08

by Terry Bowman

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] pci/aer: Export cper_print_aer() for CXL driver logging

Thanks Sathya.

On 3/23/23 17:26, Sathyanarayanan Kuppuswamy wrote:
> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>