2024-04-07 21:06:01

by PJ Waskiewicz

[permalink] [raw]
Subject: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

From: PJ Waskiewicz <[email protected]>

Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
drivers on Emerald Rapids systems. However, on some production
systems from some vendors, a buggy BIOS exists that improperly
populates the ACPI => PCI mappings. This leads to the cxl_acpi
driver to fail probe when it cannot find the root port's _UID, in
order to look up the device's CXL attributes in the CEDT.

Add a bit more of a descriptive message that the lookup failure
could be a bad BIOS, rather than just "failed."

Signed-off-by: PJ Waskiewicz <[email protected]>
---
drivers/cxl/acpi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index af5cb818f84d..56019466a09c 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,

rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
if (rc != AE_OK) {
- dev_err(dev, "unable to retrieve _UID\n");
+ dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
return -ENOENT;
}

--
2.40.1



2024-04-07 21:28:41

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
>
> rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> if (rc != AE_OK) {
> - dev_err(dev, "unable to retrieve _UID\n");
> + dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
> return -ENOENT;
> }

dev_err(dev, FW_BUG "unable to retrieve _UID\n");
^^^^^^

There's a macro for that.

2024-04-08 02:03:34

by PJ Waskiewicz

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On 24/04/07 11:28PM, Lukas Wunner wrote:

Hi Lukas,

> On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> > --- a/drivers/cxl/acpi.c
> > +++ b/drivers/cxl/acpi.c
> > @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> >
> > rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> > if (rc != AE_OK) {
> > - dev_err(dev, "unable to retrieve _UID\n");
> > + dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
> > return -ENOENT;
> > }
>
> dev_err(dev, FW_BUG "unable to retrieve _UID\n");
> ^^^^^^
>
> There's a macro for that.

Doh...it's been awhile since I've crossed buggy BIOS's. Thanks for the
review and comment.

Updated patch:

cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

From: PJ Waskiewicz <[email protected]>

Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
drivers on Emerald Rapids systems. However, on some production
systems from some vendors, a buggy BIOS exists that improperly
populates the ACPI => PCI mappings. This leads to the cxl_acpi
driver to fail probe when it cannot find the root port's _UID, in
order to look up the device's CXL attributes in the CEDT.

Add a bit more of a descriptive message that the lookup failure
could be a bad BIOS, rather than just "failed."

v2: Updated message to use existing FW_BUG macro

Signed-off-by: PJ Waskiewicz <[email protected]>
---
drivers/cxl/acpi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index af5cb818f84d..d321cfbd4682 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,

rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
if (rc != AE_OK) {
- dev_err(dev, "unable to retrieve _UID\n");
+ dev_err(dev, FW_BUG "unable to retrieve _UID\n");
return -ENOENT;
}

--
2.40.1

2024-04-08 08:34:45

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On Sun, 7 Apr 2024 19:03:23 -0700
PJ Waskiewicz <[email protected]> wrote:

> On 24/04/07 11:28PM, Lukas Wunner wrote:
>
> Hi Lukas,
>
> > On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> > > --- a/drivers/cxl/acpi.c
> > > +++ b/drivers/cxl/acpi.c
> > > @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> > >
> > > rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> > > if (rc != AE_OK) {
> > > - dev_err(dev, "unable to retrieve _UID\n");
> > > + dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
> > > return -ENOENT;
> > > }
> >
> > dev_err(dev, FW_BUG "unable to retrieve _UID\n");
> > ^^^^^^
> >
> > There's a macro for that.
>
> Doh...it's been awhile since I've crossed buggy BIOS's. Thanks for the
> review and comment.
>
> Updated patch:
>
> cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure
>
> From: PJ Waskiewicz <[email protected]>
>
> Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> drivers on Emerald Rapids systems. However, on some production
> systems from some vendors, a buggy BIOS exists that improperly
> populates the ACPI => PCI mappings. This leads to the cxl_acpi
> driver to fail probe when it cannot find the root port's _UID, in
> order to look up the device's CXL attributes in the CEDT.
>
> Add a bit more of a descriptive message that the lookup failure
> could be a bad BIOS, rather than just "failed."
>
> v2: Updated message to use existing FW_BUG macro
Move the change log "v2..." etc below the ---
as we don't want it in the long term git log + better to send a fresh
patch in a separate thread.

Other than that seems reasonable to hint it is probably a bios
bug - however I wonder how many other cases we should do this for and
whether it is worth the effort of marking them all?

Jonathan



>
> Signed-off-by: PJ Waskiewicz <[email protected]>
> ---
> drivers/cxl/acpi.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index af5cb818f84d..d321cfbd4682 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
>
> rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> if (rc != AE_OK) {
> - dev_err(dev, "unable to retrieve _UID\n");
> + dev_err(dev, FW_BUG "unable to retrieve _UID\n");
> return -ENOENT;
> }
>


2024-04-08 16:55:07

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

ppwaskie@ wrote:
> From: PJ Waskiewicz <[email protected]>
>
> Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> drivers on Emerald Rapids systems. However, on some production
> systems from some vendors, a buggy BIOS exists that improperly
> populates the ACPI => PCI mappings. This leads to the cxl_acpi
> driver to fail probe when it cannot find the root port's _UID, in
> order to look up the device's CXL attributes in the CEDT.
>
> Add a bit more of a descriptive message that the lookup failure
> could be a bad BIOS, rather than just "failed."

Makes sense, but is the goal here to name and shame the BIOS, or find a
potential quirk workaround? Presumably we could fall back to parsing
_UID instead of a string and then get some guidance from said BIOS about
how to lookup the corresponding ACPI0016 device from that identifier.

In other words, I see this patch as a warning shot of, "hey,
$platform_vendor if you
don't want folks to RMA these platforms please tell us how to do the
association Linux expects per the spec". Otherwise, this can escalate to
a loud WARN_TAINT(TAINT_FIRMWARE_WORKAROUND...), but I first want more
details from this platform like an acpidump and the exact error code
acpi_evaluate_integer() is returning.

2024-04-08 19:25:12

by PJ Waskiewicz

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On 24/04/08 09:54AM, Dan Williams wrote:
> ppwaskie@ wrote:
> > From: PJ Waskiewicz <[email protected]>
> >
> > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > drivers on Emerald Rapids systems. However, on some production
> > systems from some vendors, a buggy BIOS exists that improperly
> > populates the ACPI => PCI mappings. This leads to the cxl_acpi
> > driver to fail probe when it cannot find the root port's _UID, in
> > order to look up the device's CXL attributes in the CEDT.
> >
> > Add a bit more of a descriptive message that the lookup failure
> > could be a bad BIOS, rather than just "failed."
>
> Makes sense, but is the goal here to name and shame the BIOS, or find a
> potential quirk workaround? Presumably we could fall back to parsing
> _UID instead of a string and then get some guidance from said BIOS about
> how to lookup the corresponding ACPI0016 device from that identifier.

In this particular case, I tried making sense of what was the _UID
value, and what was actually in the CEDT. There was no sense to be
made.

For this device, it was ACPI0016:02 with a _UID of CX02. For this
particular vendor BIOS, all ACPI0016:* devices' _UID's counted up from
CX01 => CX* sequentially. But what was actually in the CEDT in this
particular case for ACPI0016:02 was 49. I attempted hex, octal, atoi(),
literal string interpretation per-character, etc. It was just plain
wrong.

> In other words, I see this patch as a warning shot of, "hey,
> $platform_vendor if you
> don't want folks to RMA these platforms please tell us how to do the
> association Linux expects per the spec". Otherwise, this can escalate to
> a loud WARN_TAINT(TAINT_FIRMWARE_WORKAROUND...), but I first want more
> details from this platform like an acpidump and the exact error code
> acpi_evaluate_integer() is returning.

Pasting an acpidump is difficult... It'll be tricky since this particular
host is walled off from the world. And moving data in and out of this
environment is quite challenging due to regulatory reasons.

acpi_evaluate_integer() in this case was returning AE_BUFFER_OVERFLOW.

In the meantime, I'm fine either fixing up the commit message per
Jonathan's review, or I'm fine shelving it in favor of a broader effort
to fix the underlying BIOS's with the vendors. I don't have a strong
preference. I've been in the weeds with this for awhile, so I know why
it's breaking. But someone new to CXL with shiny new hardware may be
left scratching their heads.

-PJ

2024-04-08 19:30:06

by PJ Waskiewicz

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On 24/04/08 09:34AM, Jonathan Cameron wrote:
> On Sun, 7 Apr 2024 19:03:23 -0700
> PJ Waskiewicz <[email protected]> wrote:
>
> > On 24/04/07 11:28PM, Lukas Wunner wrote:
> >
> > Hi Lukas,
> >
> > > On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> > > > --- a/drivers/cxl/acpi.c
> > > > +++ b/drivers/cxl/acpi.c
> > > > @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> > > >
> > > > rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> > > > if (rc != AE_OK) {
> > > > - dev_err(dev, "unable to retrieve _UID\n");
> > > > + dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
> > > > return -ENOENT;
> > > > }
> > >
> > > dev_err(dev, FW_BUG "unable to retrieve _UID\n");
> > > ^^^^^^
> > >
> > > There's a macro for that.
> >
> > Doh...it's been awhile since I've crossed buggy BIOS's. Thanks for the
> > review and comment.
> >
> > Updated patch:
> >
> > cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure
> >
> > From: PJ Waskiewicz <[email protected]>
> >
> > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > drivers on Emerald Rapids systems. However, on some production
> > systems from some vendors, a buggy BIOS exists that improperly
> > populates the ACPI => PCI mappings. This leads to the cxl_acpi
> > driver to fail probe when it cannot find the root port's _UID, in
> > order to look up the device's CXL attributes in the CEDT.
> >
> > Add a bit more of a descriptive message that the lookup failure
> > could be a bad BIOS, rather than just "failed."
> >
> > v2: Updated message to use existing FW_BUG macro
> Move the change log "v2..." etc below the ---
> as we don't want it in the long term git log + better to send a fresh
> patch in a separate thread.

Thanks, it's been awhile, and my normal (i.e. old) workflow isn't
available to me just quite yet. I can fix and send a new patch, but
I'll hold off and see what Dan's thoughts are after my reply to his
reply.

> Other than that seems reasonable to hint it is probably a bios
> bug - however I wonder how many other cases we should do this for and
> whether it is worth the effort of marking them all?

I can confirm this was definitely a BIOS bug in this particular case.
The vendor spun a quick test BIOS for us to test on an EMR and SPR host,
and the _UID's were finally correct. I could successfully walk the CEDT
and get to the CAPS structs I was after (link speed, bus width, etc.).

I'd be fine also marking the others, but I don't have any easy way to
validate that I'd hit those cases. My BIOS for this platform is only
minorly broken. I suppose it could be mocked in QEMU to cause those to
fail...

-PJ

2024-04-08 20:45:40

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

PJ Waskiewicz wrote:
[..]
> > Other than that seems reasonable to hint it is probably a bios
> > bug - however I wonder how many other cases we should do this for and
> > whether it is worth the effort of marking them all?
>
> I can confirm this was definitely a BIOS bug in this particular case.
> The vendor spun a quick test BIOS for us to test on an EMR and SPR host,
> and the _UID's were finally correct. I could successfully walk the CEDT
> and get to the CAPS structs I was after (link speed, bus width, etc.).

Oh, in that case I think there's no need to worry about a Linux quirk.

I do think cxl_acpi has multiple overlapping failure cases when what you
really want to know is whether:

"CXL host bridge can be found in CEDT.CHBS"

Turns out that straightforward message is aleady a driver message, but
it gets skipped in this case. So, I am thinking of cleanup /
clarification along the following lines:

1/ Lean on the existing cxl_get_chbs() validation paths to report on
errors

2/ Include the device-name rather than the UID since if UID is
unreliable it does not help you communicate with your BIOS vendor. I.e.
give a breadcrumb for the BIOS engineer to match the AML device name
with the CEDT content.

3/ Do not fail driver load on a single host-bridge parsing failure

4/ These are all cxl_acpi driver init events, so consistently use the
ACPI0017 device, and the cxl_acpi driver, as the originator of the error
message.

Would this clarification have saved you time with the debug?

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index 32091379a97b..5a70d7312c64 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -511,29 +511,26 @@ static int cxl_get_chbs_iter(union acpi_subtable_headers *header, void *arg,
return 0;
}

-static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
- struct cxl_chbs_context *ctx)
+static void cxl_get_chbs(struct device *dev, struct acpi_device *hb,
+ struct cxl_chbs_context *ctx)
{
- unsigned long long uid;
int rc;

- rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
- if (rc != AE_OK) {
- dev_err(dev, "unable to retrieve _UID\n");
- return -ENOENT;
- }
-
- dev_dbg(dev, "UID found: %lld\n", uid);
- *ctx = (struct cxl_chbs_context) {
+ *ctx = (struct cxl_chbs_context){
.dev = dev,
- .uid = uid,
.base = CXL_RESOURCE_NONE,
.cxl_version = UINT_MAX,
};

- acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
+ rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL,
+ &ctx->uid);
+ if (rc != AE_OK) {
+ dev_dbg(dev, "unable to retrieve _UID\n");
+ return;
+ }

- return 0;
+ dev_dbg(dev, "UID found: %lld\n", ctx->uid);
+ acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
}

static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
@@ -561,7 +558,6 @@ static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
static int add_host_bridge_dport(struct device *match, void *arg)
{
int ret;
- acpi_status rc;
struct device *bridge;
struct cxl_dport *dport;
struct cxl_chbs_context ctx;
@@ -573,19 +569,16 @@ static int add_host_bridge_dport(struct device *match, void *arg)
if (!hb)
return 0;

- rc = cxl_get_chbs(match, hb, &ctx);
- if (rc)
- return rc;
-
+ cxl_get_chbs(match, hb, &ctx);
if (ctx.cxl_version == UINT_MAX) {
- dev_warn(match, "No CHBS found for Host Bridge (UID %lld)\n",
- ctx.uid);
+ dev_err(host, FW_BUG "No CHBS found for Host Bridge (%s)\n",
+ dev_name(match));
return 0;
}

if (ctx.base == CXL_RESOURCE_NONE) {
- dev_warn(match, "CHBS invalid for Host Bridge (UID %lld)\n",
- ctx.uid);
+ dev_err(host, FW_BUG "CHBS invalid for Host Bridge (%s)\n",
+ dev_name(match));
return 0;
}

@@ -650,13 +643,11 @@ static int add_host_bridge_uport(struct device *match, void *arg)
return 0;
}

- rc = cxl_get_chbs(match, hb, &ctx);
- if (rc)
- return rc;
-
+ cxl_get_chbs(match, hb, &ctx);
if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) {
- dev_warn(bridge,
- "CXL CHBS version mismatch, skip port registration\n");
+ dev_err(host,
+ FW_BUG "CXL CHBS version mismatch, skip port registration for %s\n",
+ dev_name(match));
return 0;
}


2024-04-08 21:43:01

by PJ Waskiewicz

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On 24/04/08 01:45PM, Dan Williams wrote:
> PJ Waskiewicz wrote:
> [..]
> > > Other than that seems reasonable to hint it is probably a bios
> > > bug - however I wonder how many other cases we should do this for and
> > > whether it is worth the effort of marking them all?
> >
> > I can confirm this was definitely a BIOS bug in this particular case.
> > The vendor spun a quick test BIOS for us to test on an EMR and SPR host,
> > and the _UID's were finally correct. I could successfully walk the CEDT
> > and get to the CAPS structs I was after (link speed, bus width, etc.).
>
> Oh, in that case I think there's no need to worry about a Linux quirk.

Copy that.

> I do think cxl_acpi has multiple overlapping failure cases when what you
> really want to know is whether:
>
> "CXL host bridge can be found in CEDT.CHBS"

Yes.

> Turns out that straightforward message is aleady a driver message, but
> it gets skipped in this case. So, I am thinking of cleanup /
> clarification along the following lines:
>
> 1/ Lean on the existing cxl_get_chbs() validation paths to report on
> errors
>
> 2/ Include the device-name rather than the UID since if UID is
> unreliable it does not help you communicate with your BIOS vendor. I.e.
> give a breadcrumb for the BIOS engineer to match the AML device name
> with the CEDT content.
>
> 3/ Do not fail driver load on a single host-bridge parsing failure
>
> 4/ These are all cxl_acpi driver init events, so consistently use the
> ACPI0017 device, and the cxl_acpi driver, as the originator of the error
> message.
>
> Would this clarification have saved you time with the debug?

Inline below, but yes, this would have helped sped up the debug quite a
bit. Since the initial debug was happening on SPR, and I couldn't use
the host drivers, I was pulling the logic from the CXL host drivers into
my drivers to skip the need for ACPI0017.

> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 32091379a97b..5a70d7312c64 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -511,29 +511,26 @@ static int cxl_get_chbs_iter(union acpi_subtable_headers *header, void *arg,
> return 0;
> }
>
> -static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> - struct cxl_chbs_context *ctx)
> +static void cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> + struct cxl_chbs_context *ctx)
> {
> - unsigned long long uid;
> int rc;
>
> - rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> - if (rc != AE_OK) {
> - dev_err(dev, "unable to retrieve _UID\n");
> - return -ENOENT;
> - }
> -
> - dev_dbg(dev, "UID found: %lld\n", uid);
> - *ctx = (struct cxl_chbs_context) {
> + *ctx = (struct cxl_chbs_context){
> .dev = dev,
> - .uid = uid,
> .base = CXL_RESOURCE_NONE,
> .cxl_version = UINT_MAX,
> };
>
> - acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
> + rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL,
> + &ctx->uid);
> + if (rc != AE_OK) {
> + dev_dbg(dev, "unable to retrieve _UID\n");
> + return;
> + }

Before I read the changes below, I didn't see why this was still
useful...but I do see it now in full context.

>
> - return 0;
> + dev_dbg(dev, "UID found: %lld\n", ctx->uid);
> + acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
> }
>
> static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
> @@ -561,7 +558,6 @@ static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
> static int add_host_bridge_dport(struct device *match, void *arg)
> {
> int ret;
> - acpi_status rc;
> struct device *bridge;
> struct cxl_dport *dport;
> struct cxl_chbs_context ctx;
> @@ -573,19 +569,16 @@ static int add_host_bridge_dport(struct device *match, void *arg)
> if (!hb)
> return 0;
>
> - rc = cxl_get_chbs(match, hb, &ctx);
> - if (rc)
> - return rc;
> -
> + cxl_get_chbs(match, hb, &ctx);
> if (ctx.cxl_version == UINT_MAX) {
> - dev_warn(match, "No CHBS found for Host Bridge (UID %lld)\n",
> - ctx.uid);
> + dev_err(host, FW_BUG "No CHBS found for Host Bridge (%s)\n",
> + dev_name(match));
> return 0;
> }

This would have been more obvious that the lookup failed for "reasons"
instead of AE_BUFFER_OVERFLOW (which led to the fun Alice-style LXR
spelunking).

>
> if (ctx.base == CXL_RESOURCE_NONE) {
> - dev_warn(match, "CHBS invalid for Host Bridge (UID %lld)\n",
> - ctx.uid);
> + dev_err(host, FW_BUG "CHBS invalid for Host Bridge (%s)\n",
> + dev_name(match));

I think this is a good catch-all too in case the lookup is good, but the
vendor borked filling it in properly.

> return 0;
> }
>
> @@ -650,13 +643,11 @@ static int add_host_bridge_uport(struct device *match, void *arg)
> return 0;
> }
>
> - rc = cxl_get_chbs(match, hb, &ctx);
> - if (rc)
> - return rc;
> -
> + cxl_get_chbs(match, hb, &ctx);
> if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) {
> - dev_warn(bridge,
> - "CXL CHBS version mismatch, skip port registration\n");
> + dev_err(host,
> + FW_BUG "CXL CHBS version mismatch, skip port registration for %s\n",
> + dev_name(match));
> return 0;
> }

I like your version here much better than mine. If you want to merge
that, then:

Reviewed-by: PJ Waskiewicz <[email protected]>

2024-04-09 04:23:46

by PJ Waskiewicz

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On Mon, 2024-04-08 at 14:32 -0700, PJ Waskiewicz wrote:
> >
> > Turns out that straightforward message is aleady a driver message,
> > but
> > it gets skipped in this case. So, I am thinking of cleanup /
> > clarification along the following lines:
> >
> > 1/ Lean on the existing cxl_get_chbs() validation paths to report
> > on
> > errors
> >
> > 2/ Include the device-name rather than the UID since if UID is
> > unreliable it does not help you communicate with your BIOS vendor.
> > I.e.
> > give a breadcrumb for the BIOS engineer to match the AML device
> > name
> > with the CEDT content.
> >
> > 3/ Do not fail driver load on a single host-bridge parsing failure
> >
> > 4/ These are all cxl_acpi driver init events, so consistently use
> > the
> > ACPI0017 device, and the cxl_acpi driver, as the originator of the
> > error
> > message.
> >
> > Would this clarification have saved you time with the debug?

I guess I should have asked: would you like me to pull this patch in
and give it a test on a known broken host? I'm happy to do it.

-PJ


2024-04-09 13:34:18

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> From: PJ Waskiewicz <[email protected]>
>
> Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> drivers on Emerald Rapids systems. However, on some production
> systems from some vendors, a buggy BIOS exists that improperly
> populates the ACPI => PCI mappings.

Can you be more specific about what this ACPI => PCI mapping is?
If you already know what the problem is, I'm sure this is obvious, but
otherwise it's not.

> This leads to the cxl_acpi
> driver to fail probe when it cannot find the root port's _UID, in
> order to look up the device's CXL attributes in the CEDT.
>
> Add a bit more of a descriptive message that the lookup failure
> could be a bad BIOS, rather than just "failed."
>
> Signed-off-by: PJ Waskiewicz <[email protected]>
> ---
> drivers/cxl/acpi.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index af5cb818f84d..56019466a09c 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
>
> rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> if (rc != AE_OK) {
> - dev_err(dev, "unable to retrieve _UID\n");
> + dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
> return -ENOENT;
> }
>
> --
> 2.40.1
>

2024-04-29 05:57:26

by PJ Waskiewicz

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On Tue, 2024-04-09 at 08:22 -0500, Bjorn Helgaas wrote:
> On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> > From: PJ Waskiewicz <[email protected]>
> >
> > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > drivers on Emerald Rapids systems.  However, on some production
> > systems from some vendors, a buggy BIOS exists that improperly
> > populates the ACPI => PCI mappings.
>
> Can you be more specific about what this ACPI => PCI mapping is?
> If you already know what the problem is, I'm sure this is obvious,
> but
> otherwise it's not.

Apologies for the delay in response. Things got a bit busy with travel
and whatnot...

On one of these particular hosts, in /sys/bus/acpi/devices/ACPI0016:00,
for example, the UID would be something like CX01. It isn't an u64 at
all, and there's no atoi() or other conversions that would match what
the UID should be.

In my case, /sys/bus/acpi/devices/ACPI0016:02/ is my CXL device in
question. The UID that is presented from enumeration was CX02.
However, if I scour the CEDT manually, the UID of my particular CXL
device is really UID 49.

So, if I went from the PCI/CXL device side, and called something along
the lines of to_cxl_host_bridge() and tried to go from the pci_dev to
the acpi_handle, I'd get CX02 back. Then trying to use that to call
acpi_table_parse_cedt() would fail.

The BIOS fix from the vendor corrected the UID enumeration on the ACPI
side. This allowed things to properly line up when traversing through
the kernel APIs and parsing the ACPI tables.

Let me know if this helps clarify! If not, I can try and get more
detailed info.

-PJ

2024-04-29 15:31:49

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

On Sun, Apr 28, 2024 at 10:57:13PM -0700, PJ Waskiewicz wrote:
> On Tue, 2024-04-09 at 08:22 -0500, Bjorn Helgaas wrote:
> > On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected] wrote:
> > > From: PJ Waskiewicz <[email protected]>
> > >
> > > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > > drivers on Emerald Rapids systems.  However, on some production
> > > systems from some vendors, a buggy BIOS exists that improperly
> > > populates the ACPI => PCI mappings.
> >
> > Can you be more specific about what this ACPI => PCI mapping is?
> > If you already know what the problem is, I'm sure this is obvious,
> > but
> > otherwise it's not.
>
> Apologies for the delay in response. Things got a bit busy with travel
> and whatnot...
>
> On one of these particular hosts, in /sys/bus/acpi/devices/ACPI0016:00,
> for example, the UID would be something like CX01. It isn't an u64 at
> all, and there's no atoi() or other conversions that would match what
> the UID should be.
>
> In my case, /sys/bus/acpi/devices/ACPI0016:02/ is my CXL device in
> question. The UID that is presented from enumeration was CX02.
> However, if I scour the CEDT manually, the UID of my particular CXL
> device is really UID 49.
>
> So, if I went from the PCI/CXL device side, and called something along
> the lines of to_cxl_host_bridge() and tried to go from the pci_dev to
> the acpi_handle, I'd get CX02 back. Then trying to use that to call
> acpi_table_parse_cedt() would fail.
>
> The BIOS fix from the vendor corrected the UID enumeration on the ACPI
> side. This allowed things to properly line up when traversing through
> the kernel APIs and parsing the ACPI tables.

IIUC, _HID ACPI0016 indicates a CXL host bridge. ACPI r6.5, sec
6.5.11, says "The _UID object is required in order to allow OSPM to
match entries in the CEDT to devices present in the ACPI namespace."

I don't see anything about a requirement to map an ACPI0016 devices to
a PCI device. At least in the non-CXL world, there *is* no way to map
a PNP0A08 device to a PCI device because a host bridge is not a PCI
devices itself (it has an unspecified non-PCI primary interface and a
PCI secondary interface).

So from the patch and the ACPI/CXL specs, it looks like the problem
doesn't involve PCI at all; it just looks like an ACPI0016 object is
required to contain a _UID, and on this buggy BIOS it doesn't.

My question was just prompted by the "ACPI => PCI mapping" in the
commit log. Since PCI doesn't seem involved, maybe just drop that
reference?

It's just a buggy BIOS that doesn't supply _UID for an ACPI0016
object, so you can't locate the corresponding CEDT entry, right?

Bjorn

2024-04-29 18:35:32

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

Bjorn Helgaas wrote:
> On Sun, Apr 28, 2024 at 10:57:13PM -0700, PJ Waskiewicz wrote:
> > On Tue, 2024-04-09 at 08:22 -0500, Bjorn Helgaas wrote:
> > > On Sun, Apr 07, 2024 at 02:05:26PM -0700, [email protected]?wrote:
> > > > From: PJ Waskiewicz <[email protected]>
> > > >
> > > > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > > > drivers on Emerald Rapids systems.? However, on some production
> > > > systems from some vendors, a buggy BIOS exists that improperly
> > > > populates the ACPI => PCI mappings.
> > >
> > > Can you be more specific about what this ACPI => PCI mapping is?
> > > If you already know what the problem is, I'm sure this is obvious,
> > > but otherwise it's not.
[..]
> It's just a buggy BIOS that doesn't supply _UID for an ACPI0016
> object, so you can't locate the corresponding CEDT entry, right?

Correct, the problem is 100% contained to ACPI, and PCI is innocent. The
ACPI bug leads to failures to associate ACPI host-bridge objects with
CEDT.CHBS entries.

ACPI to PCI association is then typical pci_root lookup, i.e.:

pci_root = acpi_pci_find_root(hb->handle);
bridge = pci_root->bus->bridge;