2020-05-15 10:28:18

by Xiaochun Lee

[permalink] [raw]
Subject: [PATCH v2] x86/PCI: Mark Power Control Unit as having non-compliant BARs

From: Xiaochun Lee <[email protected]>

The device [8086:a26c] is a Power Control Unit of
Intel Ice Lake Server Processor and devices [8086:a1ec,a1ed]
are the Power Control Unit of Intel Xeon Scalable Processor,
kernel treats their pci BARs as a base address register that
leading to a boot failure like:
"pci 0000:00:11.0: [Firmware Bug]: reg 0x30: invalid BAR (can't size)".

The symptoms in Ice Lake processor is:
"QU99 ICE LAKE ES1 HCC 24C 185W 3200 L-0"

The information of the device [8086:a26c] list as below:
00:11.0 Unassigned class [ff00]: Intel Corporation Device a26c (rev 03)
        Subsystem: Lenovo Device 7811
        Flags: fast devsel, NUMA node 0
        Expansion ROM at <ignored> [disabled]
        Capabilities: [80] Power Management version 3

The symptoms in Xeon Scalable Processor is:
"Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz"
"Intel(R) Xeon(R) Gold 6252 CPU @ 2.00GHz"

The information of the Device [8086:a1ec] list as below:
00:11.0 Unassigned class [ff00]: Intel Corporation C620 Series Chipset Family MROM 0 [8086:a1ec] (rev 09)
        Subsystem: Lenovo Device [17aa:7805]
        Latency: 0, Cache Line Size: 64 bytes
        NUMA node: 0
        Expansion ROM at <ignored> [disabled]
        Capabilities: [80] Power Management version 3

There are no other BARs on this devices, so mark the PCU as having
non-compliant BARs, therefore we don't try to probe any of them.

Signed-off-by: Xiaochun Lee <[email protected]>
---
arch/x86/pci/fixup.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index e723559..d9abc67 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -563,6 +563,9 @@ static void twinhead_reserve_killing_zone(struct pci_dev *dev)
* Erratum BDF2
* PCI BARs in the Home Agent Will Return Non-Zero Values During Enumeration
* http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v4-spec-update.html
+ *
+ * Device [8086:a26c]
+ * Devices [8086:a1ec,a1ed]
*/
static void pci_invalid_bar(struct pci_dev *dev)
{
@@ -572,6 +575,9 @@ static void pci_invalid_bar(struct pci_dev *dev)
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6f60, pci_invalid_bar);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fa0, pci_invalid_bar);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fc0, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa1ec, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa1ed, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa26c, pci_invalid_bar);

/*
* Device [1022:7808]
--
1.8.3.1


2020-05-15 19:26:21

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Mark Power Control Unit as having non-compliant BARs

On Fri, May 15, 2020 at 06:07:51AM -0400, Xiaochun Lee wrote:
> From: Xiaochun Lee <[email protected]>
>
> The device [8086:a26c] is a Power Control Unit of
> Intel Ice Lake Server Processor and devices [8086:a1ec,a1ed]
> are the Power Control Unit of Intel Xeon Scalable Processor,
> kernel treats their pci BARs as a base address register that
> leading to a boot failure like:
> "pci 0000:00:11.0: [Firmware Bug]: reg 0x30: invalid BAR (can't size)".

Do you have a spec that says these are Power Control Units? The spec
I found for the C620 PCH claims these are all "MROM" devices related
to "Enterprise Value Add", "Intel Management Engine", and "Innovation
Engine" configuration.

I updated the commit log, added [8086:a26d] as mentioned in that spec,
added a stable tag, and applied the patch below to pci/misc for v5.8.
Let me know if that doesn't look right.

> The symptoms in Ice Lake processor is:
> "QU99 ICE LAKE ES1 HCC 24C 185W 3200 L-0"
>
> The information of the device [8086:a26c] list as below:
> 00:11.0 Unassigned class [ff00]: Intel Corporation Device a26c (rev 03)
> ??????? Subsystem: Lenovo Device 7811
> ??????? Flags: fast devsel, NUMA node 0
> ??????? Expansion ROM at <ignored> [disabled]
> ??????? Capabilities: [80] Power Management version 3
>
> The symptoms in Xeon Scalable Processor is:
> "Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz"
> "Intel(R) Xeon(R) Gold 6252 CPU @ 2.00GHz"
>
> The information of the Device [8086:a1ec] list as below:
> 00:11.0 Unassigned class [ff00]: Intel Corporation C620 Series Chipset Family MROM?0 [8086:a1ec] (rev 09)
> ??????? Subsystem: Lenovo Device [17aa:7805]
> ??????? Latency: 0, Cache Line Size: 64 bytes
> ??????? NUMA node: 0
> ??????? Expansion ROM at <ignored> [disabled]
> ??????? Capabilities: [80] Power Management version 3
>
> There are no other BARs on this devices, so mark the PCU as having
> non-compliant BARs, therefore we don't try to probe any of them.
>
> Signed-off-by: Xiaochun Lee <[email protected]>
> ---
> arch/x86/pci/fixup.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
> index e723559..d9abc67 100644
> --- a/arch/x86/pci/fixup.c
> +++ b/arch/x86/pci/fixup.c
> @@ -563,6 +563,9 @@ static void twinhead_reserve_killing_zone(struct pci_dev *dev)
> * Erratum BDF2
> * PCI BARs in the Home Agent Will Return Non-Zero Values During Enumeration
> * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v4-spec-update.html
> + *
> + * Device [8086:a26c]
> + * Devices [8086:a1ec,a1ed]
> */
> static void pci_invalid_bar(struct pci_dev *dev)
> {
> @@ -572,6 +575,9 @@ static void pci_invalid_bar(struct pci_dev *dev)
> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6f60, pci_invalid_bar);
> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fa0, pci_invalid_bar);
> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fc0, pci_invalid_bar);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa1ec, pci_invalid_bar);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa1ed, pci_invalid_bar);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa26c, pci_invalid_bar);
>
> /*
> * Device [1022:7808]
> --
> 1.8.3.1

commit 1574051e52cb ("x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs")
Author: Xiaochun Lee <[email protected]>
Date: Thu May 14 23:31:07 2020 -0400

x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs

The Intel C620 Platform Controller Hub has MROM functions that have non-PCI
registers (undocumented in the public spec) where BAR 0 is supposed to be,
which results in messages like this:

pci 0000:00:11.0: [Firmware Bug]: reg 0x30: invalid BAR (can't size)

Mark these MROM functions as having non-compliant BARs so we don't try to
probe any of them. There are no other BARs on these devices.

See the Intel C620 Series Chipset Platform Controller Hub Datasheet,
May 2019, Document Number 336067-007US, sec 2.1, 35.5, 35.6.

[bhelgaas: commit log, add 0xa26d]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Xiaochun Lee <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Cc: [email protected]

diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index e723559c386a..0c67a5a94de3 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -572,6 +572,10 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2fc0, pci_invalid_bar);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6f60, pci_invalid_bar);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fa0, pci_invalid_bar);
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x6fc0, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa1ec, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa1ed, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa26c, pci_invalid_bar);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0xa26d, pci_invalid_bar);

/*
* Device [1022:7808]