2010-11-29 18:37:27

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G


The last 1M before 4G contains the processor restart vector and usually
the system ROM. We don't know the actual ROM size; I chose 1M because
that's how much Windows 7 appears to avoid.

Without this check, we can allocate PCI space that will never work. On
Matthew's HP 2530p, we put the Intel GTT "Flush Page" at the very last
page, which causes a spontaneous power-off:

pci_root PNP0A08:00: host bridge window [mem 0xfee01000-0xffffffff]
fffff000-ffffffff : Intel Flush Page (assigned by intel-gtt)

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23542
Reported-by: Matthew Garrett <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/include/asm/e820.h | 3 +++
arch/x86/pci/i386.c | 10 +++++++++-
2 files changed, 12 insertions(+), 1 deletions(-)


diff --git a/arch/x86/include/asm/e820.h b/arch/x86/include/asm/e820.h
index 5be1542..c1e908f 100644
--- a/arch/x86/include/asm/e820.h
+++ b/arch/x86/include/asm/e820.h
@@ -72,6 +72,9 @@ struct e820map {
#define BIOS_BEGIN 0x000a0000
#define BIOS_END 0x00100000

+#define BIOS_ROM_BASE 0xfff00000
+#define BIOS_ROM_END 0x100000000ULL
+
#ifdef __KERNEL__
/* see comment in arch/x86/kernel/e820.c */
extern struct e820map e820;
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index c4bb261..6890241 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -65,8 +65,14 @@ pcibios_align_resource(void *data, const struct resource *res,
resource_size_t size, resource_size_t align)
{
struct pci_dev *dev = data;
- resource_size_t start = round_down(res->end - size + 1, align);
+ resource_size_t start, end = res->end;

+ /* Make sure we don't allocate from the last 1M before 4G */
+ if (res->flags & IORESOURCE_MEM) {
+ if (end >= BIOS_ROM_BASE && end < BIOS_ROM_END)
+ end = BIOS_ROM_BASE - 1;
+ }
+ start = round_down(end - size + 1, align);
if (res->flags & IORESOURCE_IO) {

/*
@@ -80,6 +86,8 @@ pcibios_align_resource(void *data, const struct resource *res,
} else if (res->flags & IORESOURCE_MEM) {
if (start < BIOS_END)
start = res->end; /* fail; no space */
+ if (start >= BIOS_ROM_BASE && start < BIOS_ROM_END)
+ start = ALIGN(BIOS_ROM_END, align);
}
return start;
}


2010-11-29 18:35:13

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Monday, November 29, 2010 11:30:09 am Bjorn Helgaas wrote:
>
> The last 1M before 4G contains the processor restart vector and usually
> the system ROM. We don't know the actual ROM size; I chose 1M because
> that's how much Windows 7 appears to avoid.
>
> Without this check, we can allocate PCI space that will never work. On
> Matthew's HP 2530p, we put the Intel GTT "Flush Page" at the very last
> page, which causes a spontaneous power-off:
>
> pci_root PNP0A08:00: host bridge window [mem 0xfee01000-0xffffffff]
> fffff000-ffffffff : Intel Flush Page (assigned by intel-gtt)
>
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23542
> Reported-by: Matthew Garrett <[email protected]>
> Signed-off-by: Bjorn Helgaas <[email protected]>
> ---
>
> arch/x86/include/asm/e820.h | 3 +++
> arch/x86/pci/i386.c | 10 +++++++++-
> 2 files changed, 12 insertions(+), 1 deletions(-)
>
>
> diff --git a/arch/x86/include/asm/e820.h b/arch/x86/include/asm/e820.h
> index 5be1542..c1e908f 100644
> --- a/arch/x86/include/asm/e820.h
> +++ b/arch/x86/include/asm/e820.h
> @@ -72,6 +72,9 @@ struct e820map {
> #define BIOS_BEGIN 0x000a0000
> #define BIOS_END 0x00100000
>
> +#define BIOS_ROM_BASE 0xfff00000
> +#define BIOS_ROM_END 0x100000000ULL

I'm really not thrilled about hard-coding these addresses, so I'd
love it if somebody could suggest a way to discover them from the
BIOS.

The E820 map doesn't reserve the last page:

BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
BIOS-e820: 00000000fffa0000 - 00000000fffa7000 (reserved)

and I don't think there's any ACPI device that does either.

Bjorn

2010-11-29 18:36:39

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Mon, Nov 29, 2010 at 11:30:09AM -0700, Bjorn Helgaas wrote:
>
> The last 1M before 4G contains the processor restart vector and usually
> the system ROM. We don't know the actual ROM size; I chose 1M because
> that's how much Windows 7 appears to avoid.
>
> Without this check, we can allocate PCI space that will never work. On
> Matthew's HP 2530p, we put the Intel GTT "Flush Page" at the very last
> page, which causes a spontaneous power-off:
>
> pci_root PNP0A08:00: host bridge window [mem 0xfee01000-0xffffffff]
> fffff000-ffffffff : Intel Flush Page (assigned by intel-gtt)
>
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23542
> Reported-by: Matthew Garrett <[email protected]>
> Signed-off-by: Bjorn Helgaas <[email protected]>

Tested-by: Matthew Garrett <[email protected]>

--
Matthew Garrett | [email protected]

2010-11-29 18:52:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On 11/29/2010 12:34 PM, Bjorn Helgaas wrote:
> On Monday, November 29, 2010 11:30:09 am Bjorn Helgaas wrote:
>>
>> The last 1M before 4G contains the processor restart vector and usually
>> the system ROM. We don't know the actual ROM size; I chose 1M because
>> that's how much Windows 7 appears to avoid.
>>
>> Without this check, we can allocate PCI space that will never work. On
>> Matthew's HP 2530p, we put the Intel GTT "Flush Page" at the very last
>> page, which causes a spontaneous power-off:
>>
>> pci_root PNP0A08:00: host bridge window [mem 0xfee01000-0xffffffff]
>> fffff000-ffffffff : Intel Flush Page (assigned by intel-gtt)
>>
>> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23542
>> Reported-by: Matthew Garrett <[email protected]>
>> Signed-off-by: Bjorn Helgaas <[email protected]>
>> ---
>>
>> arch/x86/include/asm/e820.h | 3 +++
>> arch/x86/pci/i386.c | 10 +++++++++-
>> 2 files changed, 12 insertions(+), 1 deletions(-)
>>
>>
>> diff --git a/arch/x86/include/asm/e820.h b/arch/x86/include/asm/e820.h
>> index 5be1542..c1e908f 100644
>> --- a/arch/x86/include/asm/e820.h
>> +++ b/arch/x86/include/asm/e820.h
>> @@ -72,6 +72,9 @@ struct e820map {
>> #define BIOS_BEGIN 0x000a0000
>> #define BIOS_END 0x00100000
>>
>> +#define BIOS_ROM_BASE 0xfff00000
>> +#define BIOS_ROM_END 0x100000000ULL
>
> I'm really not thrilled about hard-coding these addresses, so I'd
> love it if somebody could suggest a way to discover them from the
> BIOS.
>
> The E820 map doesn't reserve the last page:
>
> BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
> BIOS-e820: 00000000fffa0000 - 00000000fffa7000 (reserved)
>
> and I don't think there's any ACPI device that does either.
>

It is certainly reasonable to block off the last chunk of the 32-bit
address space. Some systems double-decode it to avoid issues with
A20M#, so I would argue that we should avoid at least 2 MiB.

As far as discovering them from the BIOS, there is a way to do it --
E820. This is a fallback for the case where the BIOS has plain and
simply failed to provide it, and so a heuristic is probably the best we
can do. Probing is extremely unsafe.

-hpa

2010-11-29 21:32:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Monday, November 29, 2010 11:51:27 am H. Peter Anvin wrote:
> On 11/29/2010 12:34 PM, Bjorn Helgaas wrote:
> > On Monday, November 29, 2010 11:30:09 am Bjorn Helgaas wrote:
> >>
> >> The last 1M before 4G contains the processor restart vector and usually
> >> the system ROM. We don't know the actual ROM size; I chose 1M because
> >> that's how much Windows 7 appears to avoid.
> >>
> >> Without this check, we can allocate PCI space that will never work. On
> >> Matthew's HP 2530p, we put the Intel GTT "Flush Page" at the very last
> >> page, which causes a spontaneous power-off:
> >>
> >> pci_root PNP0A08:00: host bridge window [mem 0xfee01000-0xffffffff]
> >> fffff000-ffffffff : Intel Flush Page (assigned by intel-gtt)
> >>
> >> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23542
> >> Reported-by: Matthew Garrett <[email protected]>
> >> Signed-off-by: Bjorn Helgaas <[email protected]>
> >> ---
> >>
> >> arch/x86/include/asm/e820.h | 3 +++
> >> arch/x86/pci/i386.c | 10 +++++++++-
> >> 2 files changed, 12 insertions(+), 1 deletions(-)
> >>
> >>
> >> diff --git a/arch/x86/include/asm/e820.h b/arch/x86/include/asm/e820.h
> >> index 5be1542..c1e908f 100644
> >> --- a/arch/x86/include/asm/e820.h
> >> +++ b/arch/x86/include/asm/e820.h
> >> @@ -72,6 +72,9 @@ struct e820map {
> >> #define BIOS_BEGIN 0x000a0000
> >> #define BIOS_END 0x00100000
> >>
> >> +#define BIOS_ROM_BASE 0xfff00000
> >> +#define BIOS_ROM_END 0x100000000ULL
> >
> > I'm really not thrilled about hard-coding these addresses, so I'd
> > love it if somebody could suggest a way to discover them from the
> > BIOS.
> >
> > The E820 map doesn't reserve the last page:
> >
> > BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
> > BIOS-e820: 00000000fffa0000 - 00000000fffa7000 (reserved)
> >
> > and I don't think there's any ACPI device that does either.

Oops, egg on my face. In this case, there *is* an ACPI INT0800 device
at 0xff000000-0xffffffff, which should prevent us from allocating that
space for anything else. Only problem is, we IGNORE that useful bit of
information.

> It is certainly reasonable to block off the last chunk of the 32-bit
> address space. Some systems double-decode it to avoid issues with
> A20M#, so I would argue that we should avoid at least 2 MiB.
>
> As far as discovering them from the BIOS, there is a way to do it --
> E820. This is a fallback for the case where the BIOS has plain and
> simply failed to provide it, and so a heuristic is probably the best we
> can do. Probing is extremely unsafe.

I think it's clearly a bug that Linux ignores ACPI resource information
(except PNP0C01/PNP0C02 motherboard devices). If we fix that bug, it
will fix Matthew's 2530p.

We might still want a patch like this current one because it could
work around some BIOS defects, and because I think it's too late to
fix the ACPI resource problem for .37. But I'm not convinced we
should reserve more than Windows does, because that may keep us from
discovering other important Linux problems.

Bjorn

2010-11-29 21:36:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On 11/29/2010 01:32 PM, Bjorn Helgaas wrote:
>
> Oops, egg on my face. In this case, there *is* an ACPI INT0800 device
> at 0xff000000-0xffffffff, which should prevent us from allocating that
> space for anything else. Only problem is, we IGNORE that useful bit of
> information.
>

Eep... why?

>> It is certainly reasonable to block off the last chunk of the 32-bit
>> address space. Some systems double-decode it to avoid issues with
>> A20M#, so I would argue that we should avoid at least 2 MiB.
>>
>> As far as discovering them from the BIOS, there is a way to do it --
>> E820. This is a fallback for the case where the BIOS has plain and
>> simply failed to provide it, and so a heuristic is probably the best we
>> can do. Probing is extremely unsafe.
>
> I think it's clearly a bug that Linux ignores ACPI resource information
> (except PNP0C01/PNP0C02 motherboard devices). If we fix that bug, it
> will fix Matthew's 2530p.

I would definitely agree with that.

> We might still want a patch like this current one because it could
> work around some BIOS defects, and because I think it's too late to
> fix the ACPI resource problem for .37. But I'm not convinced we
> should reserve more than Windows does, because that may keep us from
> discovering other important Linux problems.

I'm not so sure about that... it feels like a pretty weak argument that
we might work on some machines even though our code isn't perfect.

-hpa

2010-11-29 22:04:43

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Monday, November 29, 2010 02:35:05 pm H. Peter Anvin wrote:
> On 11/29/2010 01:32 PM, Bjorn Helgaas wrote:
> >
> > Oops, egg on my face. In this case, there *is* an ACPI INT0800 device
> > at 0xff000000-0xffffffff, which should prevent us from allocating that
> > space for anything else. Only problem is, we IGNORE that useful bit of
> > information.
>
> Eep... why?

We have always ignored ACPI device resource information (except for
PNP0C01 and PNP0C02 motherboard devices). There are a number of issues
in the way: conflicts with the hardcoded assumptions about legacy
devices, init ordering (e.g., we currently initialize PCI before
PNPACPI), ACPI resource interpretation (e.g., we don't do it the way
Windows does so we trip over BIOS bugs), the E820 mess (E820 is non-
hierarchical, but we wedge it into the iomem_resource tree anyway),
etc.

I've been working on these for the last few years, but we aren't
there yet.

> >> It is certainly reasonable to block off the last chunk of the 32-bit
> >> address space. Some systems double-decode it to avoid issues with
> >> A20M#, so I would argue that we should avoid at least 2 MiB.
> >>
> >> As far as discovering them from the BIOS, there is a way to do it --
> >> E820. This is a fallback for the case where the BIOS has plain and
> >> simply failed to provide it, and so a heuristic is probably the best we
> >> can do. Probing is extremely unsafe.
> >
> > I think it's clearly a bug that Linux ignores ACPI resource information
> > (except PNP0C01/PNP0C02 motherboard devices). If we fix that bug, it
> > will fix Matthew's 2530p.
>
> I would definitely agree with that.
>
> > We might still want a patch like this current one because it could
> > work around some BIOS defects, and because I think it's too late to
> > fix the ACPI resource problem for .37. But I'm not convinced we
> > should reserve more than Windows does, because that may keep us from
> > discovering other important Linux problems.
>
> I'm not so sure about that... it feels like a pretty weak argument that
> we might work on some machines even though our code isn't perfect.

I think we're talking about whether to reserve the top 1MB or top 2MB.
I freely admit I don't know the right answer. My point is merely that
since we're using a heuristic anyway, copying Windows is a pretty good
starting point. In my mind, doing something different requires a
stronger argument than "it might fix some machines where Windows is
broken."

Bjorn

2010-11-29 22:11:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On 11/29/2010 02:04 PM, Bjorn Helgaas wrote:
>>
>>> We might still want a patch like this current one because it could
>>> work around some BIOS defects, and because I think it's too late to
>>> fix the ACPI resource problem for .37. But I'm not convinced we
>>> should reserve more than Windows does, because that may keep us from
>>> discovering other important Linux problems.
>>
>> I'm not so sure about that... it feels like a pretty weak argument that
>> we might work on some machines even though our code isn't perfect.
>
> I think we're talking about whether to reserve the top 1MB or top 2MB.
> I freely admit I don't know the right answer. My point is merely that
> since we're using a heuristic anyway, copying Windows is a pretty good
> starting point. In my mind, doing something different requires a
> stronger argument than "it might fix some machines where Windows is
> broken."
>

Of course. I did, however, point out the reason *why* in this case:
there are a lot of platforms known (including quite probably *ALL*
pre-E820 systems) to decode 2 MiB for the ROM, due to A20 masking.
Windows doesn't care about those older systems.

-hpa

2010-12-03 01:19:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Mon, Nov 29, 2010 at 2:04 PM, Bjorn Helgaas <[email protected]> wrote:
>
> I think we're talking about whether to reserve the top 1MB or top 2MB.
> I freely admit I don't know the right answer. ?My point is merely that
> since we're using a heuristic anyway, copying Windows is a pretty good
> starting point. ?In my mind, doing something different requires a
> stronger argument than "it might fix some machines where Windows is
> broken."

What's the status of this? The original patch is pretty nasty, and I
think that hack to put things in the bios_align_resource() function is
just disgusting.

And why is the fix not the really _trivial_ one, which does just this:

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index ca0437c..aef9f77 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -141,7 +141,7 @@ void dma32_reserve_bootmem(void);

/* generic pci stuff */
#include <asm-generic/pci.h>
-#define PCIBIOS_MAX_MEM_32 0xffffffff
+#define PCIBIOS_MAX_MEM_32 0xfff00000

#ifdef CONFIG_NUMA
/* Returns the node based on pci bus */

Hmm?

(Ok, so that doesn't protect a 64-bit resource that just happens to be
inside a window that ends at 0xffffffff, but if you have those kinds
of bus windows, that means that there's nothing there at the 4GB mark
anyway, no?)

Linus

Linus

2010-12-03 15:16:00

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Thursday, December 02, 2010 06:19:20 pm Linus Torvalds wrote:
> On Mon, Nov 29, 2010 at 2:04 PM, Bjorn Helgaas <[email protected]> wrote:
> >
> > I think we're talking about whether to reserve the top 1MB or top 2MB.
> > I freely admit I don't know the right answer. My point is merely that
> > since we're using a heuristic anyway, copying Windows is a pretty good
> > starting point. In my mind, doing something different requires a
> > stronger argument than "it might fix some machines where Windows is
> > broken."
>
> What's the status of this? The original patch is pretty nasty, and I
> think that hack to put things in the bios_align_resource() function is
> just disgusting.
>
> And why is the fix not the really _trivial_ one, which does just this:
>
> diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
> index ca0437c..aef9f77 100644
> --- a/arch/x86/include/asm/pci.h
> +++ b/arch/x86/include/asm/pci.h
> @@ -141,7 +141,7 @@ void dma32_reserve_bootmem(void);
>
> /* generic pci stuff */
> #include <asm-generic/pci.h>
> -#define PCIBIOS_MAX_MEM_32 0xffffffff
> +#define PCIBIOS_MAX_MEM_32 0xfff00000
>
> #ifdef CONFIG_NUMA
> /* Returns the node based on pci bus */
>
> Hmm?
>
> (Ok, so that doesn't protect a 64-bit resource that just happens to be
> inside a window that ends at 0xffffffff, but if you have those kinds
> of bus windows, that means that there's nothing there at the 4GB mark
> anyway, no?)

Huh. That is a nice simple change, and I suspect it would work for
Matthew.

But it's a PCI-specific solution to a general problem. We want to
protect that resource from all allocate_resource() calls, and PCI is
the most important but not the only caller.

Maybe cluttering bios_align_resource() is less disgusting if we think
of it as "arch_allocate_resource()" instead of merely "align_resource()"?

I think we're going to need even more stuff there soon ...the current
way we handle E820 "reserved" areas is broken. (I have some evidence
that Windows ignores E820 reservations when allocating device resources,
so I'm not 100% convinced that Linux should pay attention to them, but
historically, we've certainly tried to.)

We try to put E820 reservations in iomem_resource, but that's not a
good fit because the resource tree is strictly hierarchical, and E820
reserved areas are not. An E820 entry might cover part of, all of, or
even several device resources.

This leads to ugliness like reserve_region_with_split(),
insert_resource_expand_to_fit(), and the confusing situation with
e820_reserve_resources() and e820_reserve_resources_late(). If we're
going to look at E820 reservations, I think we'd be better off if we
left them out of iomem_resource and had an arch hook in allocate_resource()
that could trim them out.

Bjorn

2010-12-09 16:54:40

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] x86/PCI: never allocate PCI space from the last 1M below 4G

On Friday, December 03, 2010 08:15:48 am Bjorn Helgaas wrote:
> On Thursday, December 02, 2010 06:19:20 pm Linus Torvalds wrote:
> > On Mon, Nov 29, 2010 at 2:04 PM, Bjorn Helgaas <[email protected]> wrote:
> > >
> > > I think we're talking about whether to reserve the top 1MB or top 2MB.
> > > I freely admit I don't know the right answer. My point is merely that
> > > since we're using a heuristic anyway, copying Windows is a pretty good
> > > starting point. In my mind, doing something different requires a
> > > stronger argument than "it might fix some machines where Windows is
> > > broken."
> >
> > What's the status of this? The original patch is pretty nasty, and I
> > think that hack to put things in the bios_align_resource() function is
> > just disgusting.

OK, let's drop this approach. My next proposal is the
"PCI, PNP, resources: effectively reserve ACPI device resources"
thread I posted yesterday (with mangled cover letter) here:

https://lkml.org/lkml/2010/12/8/352

Bjorn