2020-06-17 16:50:08

by Mika Westerberg

[permalink] [raw]
Subject: [PATCH RESEND] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap

Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
space") made the resource allocation code to avoid all regions that are
in E820 table. This prevents the kernel to assign MMIO resources to
regions that may be real RAM for example.

However, at least with Lenovo Yoca C940 and S740 this causes problems
when allocating resources for PCIe devices behind Thunderbolt port(s).

On Yoga S740 the E820 table contains an entry like this:

BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved

and ACPI _CRS method for the host bridge returns these windows:

pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]

Note that the 0x45400000-0xbfffffff entry is also included in the E820
table and marked as "reserved".

When Thunderbolt device is connected and the PCIe gets tunneled PCI core
tries to allocate memory for the new devices but it fails because all
the resources are inside this reserved region so arch_remove_reservations()
clips them which makes the resource assignment fail as in below log:

pci 0000:00:07.0: PCI bridge to [bus 01-2a]
pci 0000:00:07.0: bridge window [mem 0x46000000-0x521fffff]
pci 0000:00:07.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
...
pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
pci 0000:01:00.0: bridge window [io 0x1000-0x5fff] shrunken by 0x0000000000004000
pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
pci 0000:01:00.0: BAR 7: assigned [io 0x4000-0x4fff]

The 01:00.0 is the upstream port of the PCIe switch that is connected to
the PCIe root port (00:07.1) over Thunderbolt link.

If I add "efi=debug" to the command line I can see that the EFI memory
map actually contains several entries:

[Reserved | | | | | | | | | |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
[Reserved | | | | | | | | | |WB| | |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
[Reserved | | | | | | | | | | | | | ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
[Memory Mapped I/O |RUN| | | | | | | | | | | |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)

I think the EFI stub merges these consecutive entries into that single
E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
actually covers the PCI host bridge window entirely. However, since
there is corresponding E820 type for this it is simply marked as
E820_TYPE_RESERVED.

All in all, I think we can fix this by modifying arch_remove_reservations()
to check the EFI type as well and if it is EFI_MEMORY_MAPPED_IO skip the
clipping in that case.

Reported-by: Benoit Grégoire <[email protected]>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459
Signed-off-by: Mika Westerberg <[email protected]>
---
arch/x86/kernel/resource.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
index 9b9fb7882c20..c0bc9117dd7d 100644
--- a/arch/x86/kernel/resource.c
+++ b/arch/x86/kernel/resource.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/efi.h>
#include <linux/ioport.h>
#include <asm/e820/api.h>

@@ -36,6 +37,36 @@ static void remove_e820_regions(struct resource *avail)
}
}

+#ifdef CONFIG_EFI
+static bool efi_mmio_mem(const struct resource *avail)
+{
+ resource_size_t start, end;
+ efi_memory_desc_t desc;
+
+ if (!efi_enabled(EFI_MEMMAP) ||
+ efi_mem_desc_lookup(avail->start, &desc))
+ return false;
+
+ start = desc.phys_addr;
+ end = desc.phys_addr + (desc.num_pages << EFI_PAGE_SHIFT) - 1;
+
+ /*
+ * No need to clip the resource if it is fully contained in an
+ * EFI memory mapped region.
+ */
+ if (avail->start >= start && avail->end <= end &&
+ desc.type == EFI_MEMORY_MAPPED_IO)
+ return true;
+
+ return false;
+}
+#else
+static inline bool efi_mmio_mem(const struct resource *avail)
+{
+ return false;
+}
+#endif
+
void arch_remove_reservations(struct resource *avail)
{
/*
@@ -46,6 +77,7 @@ void arch_remove_reservations(struct resource *avail)
if (avail->flags & IORESOURCE_MEM) {
resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);

- remove_e820_regions(avail);
+ if (!efi_mmio_mem(avail))
+ remove_e820_regions(avail);
}
}
--
2.27.0.rc2


2021-10-08 08:57:58

by Werner Sembach

[permalink] [raw]
Subject: Re: [PATCH RESEND] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap

> Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
> space") made the resource allocation code to avoid all regions that are
> in E820 table. This prevents the kernel to assign MMIO resources to
> regions that may be real RAM for example.
>
> However, at least with Lenovo Yoca C940 and S740 this causes problems
> when allocating resources for PCIe devices behind Thunderbolt port(s).
>
> On Yoga S740 the E820 table contains an entry like this:
>
> BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved
>
> and ACPI _CRS method for the host bridge returns these windows:
>
> pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
> pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
> pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
> pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]
>
> Note that the 0x45400000-0xbfffffff entry is also included in the E820
> table and marked as "reserved".
>
> When Thunderbolt device is connected and the PCIe gets tunneled PCI core
> tries to allocate memory for the new devices but it fails because all
> the resources are inside this reserved region so arch_remove_reservations()
> clips them which makes the resource assignment fail as in below log:
>
> pci 0000:00:07.0: PCI bridge to [bus 01-2a]
> pci 0000:00:07.0: bridge window [mem 0x46000000-0x521fffff]
> pci 0000:00:07.0: bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
> ...
> pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
> pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
> pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
> pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
> pci 0000:01:00.0: bridge window [io 0x1000-0x5fff] shrunken by 0x0000000000004000
> pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
> pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
> pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
> pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
> pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
> pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
> pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
> pci 0000:01:00.0: BAR 7: assigned [io 0x4000-0x4fff]
>
> The 01:00.0 is the upstream port of the PCIe switch that is connected to
> the PCIe root port (00:07.1) over Thunderbolt link.
>
> If I add "efi=debug" to the command line I can see that the EFI memory
> map actually contains several entries:
>
> [Reserved | | | | | | | | | |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
> [Reserved | | | | | | | | | |WB| | |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
> [Reserved | | | | | | | | | | | | | ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
> [Memory Mapped I/O |RUN| | | | | | | | | | | |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)
>
> I think the EFI stub merges these consecutive entries into that single
> E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
> actually covers the PCI host bridge window entirely. However, since
> there is corresponding E820 type for this it is simply marked as
> E820_TYPE_RESERVED.
>
> All in all, I think we can fix this by modifying arch_remove_reservations()
> to check the EFI type as well and if it is EFI_MEMORY_MAPPED_IO skip the
> clipping in that case.
>
> Reported-by: Benoit Grégoire <[email protected]>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459 <https://bugzilla.kernel.org/show_bug.cgi?id=206459>
> Signed-off-by: Mika Westerberg <[email protected]>
> ---
> arch/x86/kernel/resource.c <https://lore.kernel.org/lkml/[email protected]/#Z31arch:x86:kernel:resource.c> | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff
> <https://lore.kernel.org/lkml/[email protected]/#iZ31arch:x86:kernel:resource.c>
> --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c index 9b9fb7882c20..c0bc9117dd7d 100644 ---
> a/arch/x86/kernel/resource.c +++ b/arch/x86/kernel/resource.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0
> +#include <linux/efi.h> #include <linux/ioport.h>
> #include <asm/e820/api.h>
>
> @@ -36,6 +37,36 @@ static void remove_e820_regions(struct resource *avail) }
> }
>
> +#ifdef CONFIG_EFI +static bool efi_mmio_mem(const struct resource *avail) +{ + resource_size_t start, end; +
> efi_memory_desc_t desc; + + if (!efi_enabled(EFI_MEMMAP) || + efi_mem_desc_lookup(avail->start, &desc)) + return
> false; + + start = desc.phys_addr; + end = desc.phys_addr + (desc.num_pages << EFI_PAGE_SHIFT) - 1; + + /* + * No need
> to clip the resource if it is fully contained in an + * EFI memory mapped region. + */ + if (avail->start >= start &&
> avail->end <= end && + desc.type == EFI_MEMORY_MAPPED_IO) + return true; + + return false; +} +#else +static inline
> bool efi_mmio_mem(const struct resource *avail) +{ + return false; +} +#endif + void arch_remove_reservations(struct resource *avail)
> {
> /*
> @@ -46,6 +77,7 @@ void arch_remove_reservations(struct resource *avail) if (avail->flags & IORESOURCE_MEM) {
> resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);
>
> - remove_e820_regions(avail); + if (!efi_mmio_mem(avail)) + remove_e820_regions(avail); }
> }
> --
> 2.27.0.rc2
Greetings,

Is there any update on this matter? Also happens on discrete Thunderbolt 4 chips:
https://bugzilla.kernel.org/show_bug.cgi?id=214259

Kind Regards,

Werner Sembach

2021-10-08 09:26:12

by Mika Westerberg

[permalink] [raw]
Subject: Re: [PATCH RESEND] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap

Hi,

On Fri, Oct 08, 2021 at 10:55:49AM +0200, Werner Sembach wrote:
> Is there any update on this matter? Also happens on discrete Thunderbolt 4 chips:
> https://bugzilla.kernel.org/show_bug.cgi?id=214259

AFAICT no updates.

@Bjorn, x86 maintainers,

If there are no alternatives can we get this patch merged so that people
don't need to carry out-of-tree patches to get their systems working?

2021-10-08 09:47:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RESEND] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap

On Fri, Oct 08, 2021 at 12:23:31PM +0300, Mika Westerberg wrote:
> Hi,
>
> On Fri, Oct 08, 2021 at 10:55:49AM +0200, Werner Sembach wrote:
> > Is there any update on this matter? Also happens on discrete Thunderbolt 4 chips:
> > https://bugzilla.kernel.org/show_bug.cgi?id=214259
>
> AFAICT no updates.
>
> @Bjorn, x86 maintainers,
>
> If there are no alternatives can we get this patch merged so that people
> don't need to carry out-of-tree patches to get their systems working?

Just my 2¢ from briefly skimming over this:

So this reads yet again as BIOS is to blame but what else is new?

"All in all, I think we can fix this by modifying
arch_remove_reservations() to check the EFI type as well and if it is
EFI_MEMORY_MAPPED_IO skip the clipping in that case."

And this like we should trust EFI to mark those regions properly, which
is more of the same but in different color.

That original commit talks about windoze doing a different allocation
scheme and thus not trusting the untrustworthy firmware anyway and that
sounds like something we should do too. But WTH do I know?!

So I'd prefer if Bjorn chimed in here.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-10-08 10:59:20

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH RESEND] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap

[+cc linux-pci]

On Fri, Oct 08, 2021 at 11:45:38AM +0200, Borislav Petkov wrote:
> On Fri, Oct 08, 2021 at 12:23:31PM +0300, Mika Westerberg wrote:
> > Hi,
> >
> > On Fri, Oct 08, 2021 at 10:55:49AM +0200, Werner Sembach wrote:
> > > Is there any update on this matter? Also happens on discrete Thunderbolt 4 chips:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=214259
> >
> > AFAICT no updates.
> >
> > @Bjorn, x86 maintainers,
> >
> > If there are no alternatives can we get this patch merged so that people
> > don't need to carry out-of-tree patches to get their systems working?
>
> Just my 2? from briefly skimming over this:
>
> So this reads yet again as BIOS is to blame but what else is new?
>
> "All in all, I think we can fix this by modifying
> arch_remove_reservations() to check the EFI type as well and if it is
> EFI_MEMORY_MAPPED_IO skip the clipping in that case."
>
> And this like we should trust EFI to mark those regions properly, which
> is more of the same but in different color.
>
> That original commit talks about windoze doing a different allocation
> scheme and thus not trusting the untrustworthy firmware anyway and that
> sounds like something we should do too. But WTH do I know?!

There are a couple other threads reporting similar issues:

https://lore.kernel.org/r/[email protected]
https://lore.kernel.org/r/[email protected]

I think 4dc2287c1805 ("x86: avoid E820 regions when allocating address
space") was a mistake and we should remove that instead of adding more
complexity to it.

But that requires another approach to fix the issue that 4dc2287c1805
addressed.

Bjorn

2021-10-08 15:45:27

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH RESEND] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap

On Fri, Oct 08, 2021 at 05:56:40AM -0500, Bjorn Helgaas wrote:
> On Fri, Oct 08, 2021 at 11:45:38AM +0200, Borislav Petkov wrote:
> > On Fri, Oct 08, 2021 at 12:23:31PM +0300, Mika Westerberg wrote:
> > > On Fri, Oct 08, 2021 at 10:55:49AM +0200, Werner Sembach wrote:
> > > > Is there any update on this matter? Also happens on discrete Thunderbolt 4 chips:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=214259
> > >
> > > AFAICT no updates.
> > >
> > > @Bjorn, x86 maintainers,
> > >
> > > If there are no alternatives can we get this patch merged so that people
> > > don't need to carry out-of-tree patches to get their systems working?
> >
> > Just my 2? from briefly skimming over this:
> >
> > So this reads yet again as BIOS is to blame but what else is new?
> >
> > "All in all, I think we can fix this by modifying
> > arch_remove_reservations() to check the EFI type as well and if it is
> > EFI_MEMORY_MAPPED_IO skip the clipping in that case."
> >
> > And this like we should trust EFI to mark those regions properly, which
> > is more of the same but in different color.
> >
> > That original commit talks about windoze doing a different allocation
> > scheme and thus not trusting the untrustworthy firmware anyway and that
> > sounds like something we should do too. But WTH do I know?!
>
> There are a couple other threads reporting similar issues:

Bug reports from these threads:

> https://lore.kernel.org/r/[email protected]

http://bugs.launchpad.net/bugs/1931715
http://bugs.launchpad.net/bugs/1932069
http://bugs.launchpad.net/bugs/1921649

> https://lore.kernel.org/r/[email protected]

https://bugzilla.redhat.com/show_bug.cgi?id=1868899
https://bugzilla.redhat.com/show_bug.cgi?id=1871793
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1878279

Most of these bug reports mention "pci=nocrs" as being a workaround.
Obviously not a solution, but may be a way to limp along in the
meantime.