2010-12-08 21:36:14

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas


This adds arch_remove_reservations(), which an arch can implement if it
needs to protect part of the address space from allocation.

Sometimes that can be done by just requesting a resource. This hook is to
cover cases where protected area doesn't fit well in the hierarchical
resource tree. For example, x86 BIOS E820 reservations are not related
to devices, so they may overlap part of, all of, or more than a device
resource.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

include/linux/ioport.h | 1 +
kernel/resource.c | 8 ++++++++
2 files changed, 9 insertions(+), 0 deletions(-)


diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index d377ea8..b92b8b4 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -124,6 +124,7 @@ extern void reserve_region_with_split(struct resource *root,
extern struct resource *insert_resource_conflict(struct resource *parent, struct resource *new);
extern int insert_resource(struct resource *parent, struct resource *new);
extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
+extern void arch_remove_reservations(struct resource *avail);
extern int allocate_resource(struct resource *root, struct resource *new,
resource_size_t size, resource_size_t min,
resource_size_t max, resource_size_t align,
diff --git a/kernel/resource.c b/kernel/resource.c
index 9fad33e..246957e 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -374,6 +374,10 @@ int __weak page_is_ram(unsigned long pfn)
return walk_system_ram_range(pfn, 1, NULL, __is_ram) == 1;
}

+void __weak arch_remove_reservations(struct resource *avail)
+{
+}
+
static resource_size_t simple_align_resource(void *data,
const struct resource *avail,
resource_size_t size,
@@ -426,6 +430,7 @@ static int find_resource_from_top(struct resource *root, struct resource *new,
struct resource *this;
struct resource tmp, avail, alloc;

+ tmp.flags = new->flags;
tmp.start = root->end;
tmp.end = root->end;

@@ -438,6 +443,7 @@ static int find_resource_from_top(struct resource *root, struct resource *new,
tmp.start = root->start;

resource_clip(&tmp, min, max);
+ arch_remove_reservations(&tmp);

/* Check for overflow after ALIGN() */
avail = *new;
@@ -478,6 +484,7 @@ static int find_resource(struct resource *root, struct resource *new,
struct resource *this = root->child;
struct resource tmp = *new, avail, alloc;

+ tmp.flags = new->flags;
tmp.start = root->start;
/*
* Skip past an allocated resource that starts at 0, since the
@@ -495,6 +502,7 @@ static int find_resource(struct resource *root, struct resource *new,
tmp.end = root->end;

resource_clip(&tmp, min, max);
+ arch_remove_reservations(&tmp);

/* Check for overflow after ALIGN() */
avail = *new;


2010-12-08 21:36:20

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 2/5] x86: avoid BIOS area when allocating address space


This implements arch_remove_reservations() so allocate_resource() can
avoid any arch-specific reserved areas. This currently just avoids the
BIOS area (the first 1MB), but could be used for E820 reserved areas if
that turns out to be necessary.

We previously avoided this area in pcibios_align_resource(). This patch
moves the test from that PCI-specific path to a generic path, so *all*
resource allocations will avoid this area.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/resource.c | 14 ++++++++++++++
arch/x86/pci/i386.c | 3 ---
3 files changed, 15 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/kernel/resource.c


diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9e13763..1e99475 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -45,6 +45,7 @@ obj-y += pci-dma.o quirks.o i8237.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o hw_breakpoint.o
obj-y += tsc.o io_delay.o rtc.o
obj-y += pci-iommu_table.o
+obj-y += resource.o

obj-$(CONFIG_X86_TRAMPOLINE) += trampoline.o
obj-y += process.o
diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
new file mode 100644
index 0000000..5dd6473
--- /dev/null
+++ b/arch/x86/kernel/resource.c
@@ -0,0 +1,14 @@
+#include <linux/ioport.h>
+#include <asm/e820.h>
+
+void arch_remove_reservations(struct resource *avail)
+{
+ /*
+ * Trim out the area reserved for BIOS (low 1MB). We could also remove
+ * E820 "reserved" areas here.
+ */
+ if (avail->flags & IORESOURCE_MEM) {
+ if (avail->start < BIOS_END)
+ avail->start = BIOS_END;
+ }
+}
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index c4bb261..b261011 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -77,9 +77,6 @@ pcibios_align_resource(void *data, const struct resource *res,
*/
if (!skip_isa_ioresource_align(dev))
start &= ~0x300;
- } else if (res->flags & IORESOURCE_MEM) {
- if (start < BIOS_END)
- start = res->end; /* fail; no space */
}
return start;
}

2010-12-08 21:36:32

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources


The HP nx6325 BIOS doesn't report any devices in the [0xf8000000-0xfbffffff]
region via ACPI devices or the E820 memory map, but when we assign it to the
00:14.4 bridge as a prefetchable memory window, the machine hangs.

I determined experimentally that there are only three 1MB regions in
that area that cause trouble, so this fixup builds a fake PNP device
that consumes those regions.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23332
Reported-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
---

drivers/pnp/quirks.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)


diff --git a/drivers/pnp/quirks.c b/drivers/pnp/quirks.c
index f18bb69..e7de402 100644
--- a/drivers/pnp/quirks.c
+++ b/drivers/pnp/quirks.c
@@ -343,7 +343,37 @@ static struct pnp_protocol pnp_fixup_protocol = {
.name = "Plug and Play fixup",
};

+static int __init hp_nx6325_fixup(const struct dmi_system_id *d)
+{
+ struct pnp_dev *dev;
+
+ /*
+ * The BIOS apparently forgot to describe some regions in the
+ * address map. See https://bugzilla.kernel.org/show_bug.cgi?id=23332
+ */
+
+ dev = pnp_alloc_dev(&pnp_fixup_protocol, 0, "LNXHAZRD");
+ if (!dev)
+ return 0;
+
+ dev->active = 1;
+ pnp_add_mem_resource(dev, 0xf8300000, 0xf83fffff, 0);
+ pnp_add_mem_resource(dev, 0xf8500000, 0xf85fffff, 0);
+ pnp_add_mem_resource(dev, 0xf9100000, 0xf91fffff, 0);
+ pnp_add_device(dev);
+ dev_info(&dev->dev, "added to work around BIOS defect\n");
+ return 0;
+}
+
static const struct dmi_system_id pnp_fixup_table[] __initconst = {
+ {
+ .callback = hp_nx6325_fixup,
+ .ident = "HP nx6325 laptop",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq nx6325"),
+ },
+ },
{}
};

2010-12-08 21:36:27

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 4/5] PNP: add framework for platform PNP quirks


This allows platform quirks to fabricate PNP devices to describe things
that should have been described via ACPI. For example, if the BIOS writer
forgets to describe a device, we may assign its address space to another
device, causing a conflict. We can avoid the conflict by making a fake PNP
device to stand in for the one the BIOS missed.

In that case, there's no ACPI or PNPBIOS device, so we need a new
pnp_protocol that doesn't go back to firmware for get/set/wakeup/
suspend/etc.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

drivers/pnp/base.h | 2 ++
drivers/pnp/core.c | 9 ++++++++-
drivers/pnp/quirks.c | 15 +++++++++++++++
3 files changed, 25 insertions(+), 1 deletions(-)


diff --git a/drivers/pnp/base.h b/drivers/pnp/base.h
index 19bc736..dca301e 100644
--- a/drivers/pnp/base.h
+++ b/drivers/pnp/base.h
@@ -7,6 +7,8 @@ extern spinlock_t pnp_lock;
extern struct device_attribute pnp_interface_attrs[];
void *pnp_alloc(long size);

+void platform_pnp_fixups(void);
+
int pnp_register_protocol(struct pnp_protocol *protocol);
void pnp_unregister_protocol(struct pnp_protocol *protocol);

diff --git a/drivers/pnp/core.c b/drivers/pnp/core.c
index 0f34d96..5076493 100644
--- a/drivers/pnp/core.c
+++ b/drivers/pnp/core.c
@@ -212,7 +212,14 @@ void __pnp_remove_device(struct pnp_dev *dev)

static int __init pnp_init(void)
{
- return bus_register(&pnp_bus_type);
+ int ret;
+
+ ret = bus_register(&pnp_bus_type);
+ if (ret)
+ return ret;
+
+ platform_pnp_fixups();
+ return 0;
}

subsys_initcall(pnp_init);
diff --git a/drivers/pnp/quirks.c b/drivers/pnp/quirks.c
index dfbd5a6..f18bb69 100644
--- a/drivers/pnp/quirks.c
+++ b/drivers/pnp/quirks.c
@@ -15,6 +15,7 @@

#include <linux/types.h>
#include <linux/kernel.h>
+#include <linux/dmi.h>
#include <linux/string.h>
#include <linux/slab.h>
#include <linux/pnp.h>
@@ -337,3 +338,17 @@ void pnp_fixup_device(struct pnp_dev *dev)
f->quirk_function(dev);
}
}
+
+static struct pnp_protocol pnp_fixup_protocol = {
+ .name = "Plug and Play fixup",
+};
+
+static const struct dmi_system_id pnp_fixup_table[] __initconst = {
+ {}
+};
+
+void __init platform_pnp_fixups(void)
+{
+ pnp_register_protocol(&pnp_fixup_protocol);
+ dmi_check_system(pnp_fixup_table);
+}

2010-12-08 21:37:14

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 3/5] x86: avoid PNP resources when allocating address space


Remove PNP resources from "available space" used by allocate_resource().
This would be better done by putting the PNP resources directly in the
resource maps (iomem_resource, etc), but there are some issues that
keep us from doing that yet.

This patch keeps us from handing out PNP device address space to other
callers of allocate_resource().

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23332
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23542
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23802
Reported-by: Matthew Garrett <[email protected]>
Reported-by: Dan Williams <[email protected]>
Reported-by: Jiri Slaby <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/kernel/resource.c | 78 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 78 insertions(+), 0 deletions(-)


diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
index 5dd6473..1daee92 100644
--- a/arch/x86/kernel/resource.c
+++ b/arch/x86/kernel/resource.c
@@ -1,6 +1,80 @@
#include <linux/ioport.h>
+#include <linux/pnp.h>
#include <asm/e820.h>

+#ifdef CONFIG_PNP
+static bool resource_conflict(struct resource *res, resource_size_t start,
+ resource_size_t end)
+{
+ /*
+ * Return true if and only if "res" conflicts with the [start-end]
+ * range.
+ */
+ return res->start <= end && res->end >= start;
+}
+
+static void resource_split(struct resource *res, resource_size_t start,
+ resource_size_t end, struct resource *low,
+ struct resource *high)
+{
+ /*
+ * If "res" conflicts with [start-end], split "res" into the
+ * part below "start" (low) and the part above "end" (high),
+ * either (or both) of which may be empty.
+ *
+ * If there's no conflict, return the entire "res" as "low".
+ */
+ *low = *res;
+ low->start = res->start;
+ low->end = res->start - 1; /* default to empty (size 0) */
+
+ *high = *res;
+ high->end = res->end;
+ high->start = res->end + 1; /* default to empty (size 0) */
+
+ if (!resource_conflict(res, start, end)) {
+ low->end = res->end;
+ return;
+ }
+
+ if (res->start < start)
+ low->end = start - 1;
+
+ if (res->end > end)
+ high->start = end + 1;
+}
+
+static void pnp_remove_reservations(struct resource *avail)
+{
+ unsigned long type = resource_type(avail);
+ struct pnp_dev *dev;
+ int i;
+ struct resource *res, low, high;
+
+ /*
+ * Clip the available region to avoid PNP devices. The PNP
+ * resources really should be in the resource map to begin with,
+ * but there are still some issues preventing that.
+ */
+ pnp_for_each_dev(dev) {
+ i = 0;
+ res = pnp_get_resource(dev, type, i++);
+ while (res) {
+ if (!(res->flags & IORESOURCE_WINDOW)) {
+ resource_split(avail, res->start, res->end,
+ &low, &high);
+ if (resource_size(&low) > resource_size(&high))
+ *avail = low;
+ else
+ *avail = high;
+ }
+
+ res = pnp_get_resource(dev, type, i++);
+ }
+ }
+}
+#endif
+
void arch_remove_reservations(struct resource *avail)
{
/*
@@ -11,4 +85,8 @@ void arch_remove_reservations(struct resource *avail)
if (avail->start < BIOS_END)
avail->start = BIOS_END;
}
+
+#ifdef CONFIG_PNP
+ pnp_remove_reservations(avail);
+#endif
}

2010-12-08 21:37:49

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 0/5] resources: add arch hook for preventing allocation in reserved areas

Crap, here's the cover letter I meant to include.

[RFC] PCI, PNP, resources: effectively reserve ACPI device resources

2.6.37-rc includes some patches of mine to change the way we allocate address
space for devices, and these caused some regressions.

Most device resource allocation is done by the BIOS, but Linux does allocate
space if the BIOS doesn't do it (e.g., for CardBus or other hotplug devices)
or if the BIOS did it wrong (e.g., it put a PCI device outside the host bridge
windows).

My original 2.6.37-rc patches make Linux allocate space from the top down
instead of from the bottom up. We did this because that's what Windows does,
and doing it the same way helps us avoid some BIOS defects, such as
https://bugzilla.kernel.org/show_bug.cgi?id=16228 .

The regressions caused by these changes include:

https://bugzilla.kernel.org/show_bug.cgi?id=23332

Here, we allocated a 64MB bridge window for a subtractive-decode bridge
leading to a CardBus controller, and the window conflicts with devices not
reported by the BIOS. Windows doesn't see this problem because it relies
on the fact that the bridge is in subtractive-decode mode and doesn't
program the window at all.

https://bugzilla.kernel.org/show_bug.cgi?id=23542
https://bugzilla.kernel.org/show_bug.cgi?id=23802

In both cases, Linux allocated space that conflicts with an ACPI device
because it ignores resource usage reported by ACPI. Windows avoids this
problem because it pays attention to the ACPI _CRS methods that report
device resource usage.

I think the best way to address these problems would be to put ACPI device
resources in the resource maps, just like we do for PCI devices. We tried this
a couple years ago (http://lkml.org/lkml/2007/11/1/214) and tripped over some
subtle issues. We should try again, but not at this stage of the -rc.

Instead, these patches take the approach of adding a filter in the
allocate_resource() path so each arch can prevent allocation of any address
ranges it cares about. For x86, that means looking through all the PNP
devices and protecting the address space they use.

In addition, for 23332, I added a quirk that builds a fake PNP device to
protect the unreported devices. I'm a little nervous about this because I've
been told that the WHQL tests look for unreported devices, and the HP nx6325
should have passwd. But it's the best I can think of for now.

Bjorn

2010-12-10 20:30:18

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Wed, 08 Dec 2010 14:36:06 -0700
Bjorn Helgaas <[email protected]> wrote:

>
> This adds arch_remove_reservations(), which an arch can implement if it
> needs to protect part of the address space from allocation.
>
> Sometimes that can be done by just requesting a resource. This hook is to
> cover cases where protected area doesn't fit well in the hierarchical
> resource tree. For example, x86 BIOS E820 reservations are not related
> to devices, so they may overlap part of, all of, or more than a device
> resource.
>
> Signed-off-by: Bjorn Helgaas <[email protected]>
> ---

Hm, this is bigger than the simple change of just avoiding the high 2M;
Linus have you checked it out yet? It's nicer than simply adjusting
PCIBIOS_MAX_MEM since it will affect all resource callers rather than
just PCI, but it's definitely bigger.

If you want just the simple change for 2.6.37 I can push that, but
we'll need to get a tested-by from Matthew:

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index ca0437c..aef9f77 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -141,7 +141,7 @@ void dma32_reserve_bootmem(void);

/* generic pci stuff */
#include <asm-generic/pci.h>
-#define PCIBIOS_MAX_MEM_32 0xffffffff
+#define PCIBIOS_MAX_MEM_32 0xfff00000

#ifdef CONFIG_NUMA
/* Returns the node based on pci bus */

and I'll queue up this set for 2.6.38.

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2010-12-10 20:36:15

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

[Actually cc'ing Matthew this time]

On Fri, 10 Dec 2010 12:30:08 -0800
Jesse Barnes <[email protected]> wrote:

> On Wed, 08 Dec 2010 14:36:06 -0700
> Bjorn Helgaas <[email protected]> wrote:
>
> >
> > This adds arch_remove_reservations(), which an arch can implement if it
> > needs to protect part of the address space from allocation.
> >
> > Sometimes that can be done by just requesting a resource. This hook is to
> > cover cases where protected area doesn't fit well in the hierarchical
> > resource tree. For example, x86 BIOS E820 reservations are not related
> > to devices, so they may overlap part of, all of, or more than a device
> > resource.
> >
> > Signed-off-by: Bjorn Helgaas <[email protected]>
> > ---
>
> Hm, this is bigger than the simple change of just avoiding the high 2M;
> Linus have you checked it out yet? It's nicer than simply adjusting
> PCIBIOS_MAX_MEM since it will affect all resource callers rather than
> just PCI, but it's definitely bigger.
>
> If you want just the simple change for 2.6.37 I can push that, but
> we'll need to get a tested-by from Matthew:
>
> diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
> index ca0437c..aef9f77 100644
> --- a/arch/x86/include/asm/pci.h
> +++ b/arch/x86/include/asm/pci.h
> @@ -141,7 +141,7 @@ void dma32_reserve_bootmem(void);
>
> /* generic pci stuff */
> #include <asm-generic/pci.h>
> -#define PCIBIOS_MAX_MEM_32 0xffffffff
> +#define PCIBIOS_MAX_MEM_32 0xfff00000
>
> #ifdef CONFIG_NUMA
> /* Returns the node based on pci bus */
>
> and I'll queue up this set for 2.6.38.
>
> Thanks,


--
Jesse Barnes, Intel Open Source Technology Center

2010-12-10 21:24:21

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Friday, December 10, 2010 01:36:09 pm Jesse Barnes wrote:
> [Actually cc'ing Matthew this time]
>
> On Fri, 10 Dec 2010 12:30:08 -0800
> Jesse Barnes <[email protected]> wrote:
>
> > On Wed, 08 Dec 2010 14:36:06 -0700
> > Bjorn Helgaas <[email protected]> wrote:
> >
> > >
> > > This adds arch_remove_reservations(), which an arch can implement if it
> > > needs to protect part of the address space from allocation.
> > >
> > > Sometimes that can be done by just requesting a resource. This hook is to
> > > cover cases where protected area doesn't fit well in the hierarchical
> > > resource tree. For example, x86 BIOS E820 reservations are not related
> > > to devices, so they may overlap part of, all of, or more than a device
> > > resource.
> > >
> > > Signed-off-by: Bjorn Helgaas <[email protected]>
> > > ---
> >
> > Hm, this is bigger than the simple change of just avoiding the high 2M;
> > Linus have you checked it out yet? It's nicer than simply adjusting
> > PCIBIOS_MAX_MEM since it will affect all resource callers rather than
> > just PCI, but it's definitely bigger.

Dan Williams also reproduced the problem on a 2530p and tested this
series (though I think he actually tested a backport in Fedora):
https://bugzilla.kernel.org/show_bug.cgi?id=23542#c7
https://bugzilla.kernel.org/show_bug.cgi?id=23542#c23

The simple PCIBIOS_MAX_MEM_32 change below will fix the 2530p problem,
but it won't help fix the nx6325 problem Rafael reported; details at:
https://bugzilla.kernel.org/show_bug.cgi?id=23332#c20

Rafael did confirm on #acpi Wednesday that the full series fixed
his nx6325.

Bjorn

> > If you want just the simple change for 2.6.37 I can push that, but
> > we'll need to get a tested-by from Matthew:
> >
> > diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
> > index ca0437c..aef9f77 100644
> > --- a/arch/x86/include/asm/pci.h
> > +++ b/arch/x86/include/asm/pci.h
> > @@ -141,7 +141,7 @@ void dma32_reserve_bootmem(void);
> >
> > /* generic pci stuff */
> > #include <asm-generic/pci.h>
> > -#define PCIBIOS_MAX_MEM_32 0xffffffff
> > +#define PCIBIOS_MAX_MEM_32 0xfff00000
> >
> > #ifdef CONFIG_NUMA
> > /* Returns the node based on pci bus */
> >
> > and I'll queue up this set for 2.6.38.
> >
> > Thanks,
>
>
>

2010-12-11 01:37:54

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Fri, 10 Dec 2010 14:07:24 -0700
Bjorn Helgaas <[email protected]> wrote:

> On Friday, December 10, 2010 01:36:09 pm Jesse Barnes wrote:
> > [Actually cc'ing Matthew this time]
> >
> > On Fri, 10 Dec 2010 12:30:08 -0800
> > Jesse Barnes <[email protected]> wrote:
> >
> > > On Wed, 08 Dec 2010 14:36:06 -0700
> > > Bjorn Helgaas <[email protected]> wrote:
> > >
> > > >
> > > > This adds arch_remove_reservations(), which an arch can implement if it
> > > > needs to protect part of the address space from allocation.
> > > >
> > > > Sometimes that can be done by just requesting a resource. This hook is to
> > > > cover cases where protected area doesn't fit well in the hierarchical
> > > > resource tree. For example, x86 BIOS E820 reservations are not related
> > > > to devices, so they may overlap part of, all of, or more than a device
> > > > resource.
> > > >
> > > > Signed-off-by: Bjorn Helgaas <[email protected]>
> > > > ---
> > >
> > > Hm, this is bigger than the simple change of just avoiding the high 2M;
> > > Linus have you checked it out yet? It's nicer than simply adjusting
> > > PCIBIOS_MAX_MEM since it will affect all resource callers rather than
> > > just PCI, but it's definitely bigger.
>
> Dan Williams also reproduced the problem on a 2530p and tested this
> series (though I think he actually tested a backport in Fedora):
> https://bugzilla.kernel.org/show_bug.cgi?id=23542#c7
> https://bugzilla.kernel.org/show_bug.cgi?id=23542#c23
>
> The simple PCIBIOS_MAX_MEM_32 change below will fix the 2530p problem,
> but it won't help fix the nx6325 problem Rafael reported; details at:
> https://bugzilla.kernel.org/show_bug.cgi?id=23332#c20
>
> Rafael did confirm on #acpi Wednesday that the full series fixed
> his nx6325.

Thanks, I'll add Dan and Rafael's tested-bys to the patches (they're
already in my for-linus tree). Unless Linus has a problem with them
I'll send them over to him this weekend or Monday.

--
Jesse Barnes, Intel Open Source Technology Center

2010-12-12 03:31:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Wed, Dec 8, 2010 at 1:36 PM, Bjorn Helgaas <[email protected]> wrote:
>
> The HP nx6325 BIOS doesn't report any devices in the [0xf8000000-0xfbffffff]
> region via ACPI devices or the E820 memory map, but when we assign it to the
> 00:14.4 bridge as a prefetchable memory window, the machine hangs.

Quite frankly, I think this patch sucks.

It sucks because these kinds of hw-specific patches are fundamentally
a sign of something else being wrong. Why didn't windows hit this? Why
do we need this total hack?

And is there any reason at all to believe that that one particular
laptop is really special? I doubt it. And what happens for the next
random machine that comes along an hits this?

Maybe we should just say that if we know the bridge is negative
decode, and it hasn't been set up by the BIOS, we just don't allocate
it at all. And try to look like Windows.

Or figure out what else Windows is doing differently.

The whole "allocate bottom up" old PCI allocation has _years_ of
testing and quirk that have been gathered over a long time. We can't
just say "we'll do the same thing for the top-down allocator".

The WHOLE AND ONLY POINT of the top-down allocator was to act lik
Windows and not need crap like this. If that doesn't work, then I
seriously don't think we should change bottom-up to top-down at all,
and for 2.6.37 we should just revert the "set to top-down by default".

Seriously. That "whole and only point" thing is important. If we need
hacks like this, then we shouldn't do it. We're much better off with
the model that has year of testing an not the upheaval. Top-down
allocation is in _no_ way inherently better, the only excuse for it
was supposed to be "we don't need these kinds of hooks".

Linus

2010-12-12 03:34:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Fri, Dec 10, 2010 at 5:37 PM, Jesse Barnes <[email protected]> wrote:
>
> Thanks, I'll add Dan and Rafael's tested-bys to the patches (they're
> already in my for-linus tree). ?Unless Linus has a problem with them
> I'll send them over to him this weekend or Monday.

See my other email I just sent out.

I really am not going to take some totally new experimental and hacky
major PCI resource management thing this late in the -rc game. No way,
no how.

If the top-down allocator is causing regressions that cannot be fixed
by _simple_ patches, we're simply going to have to undo it. What's the
advantage of top-down? None. Not if we then need all this crap, which
we could as easily do on top of the bottom-up one WITHOUT any
regressions.

Why isn't anybody else questioning the whole basic premise here?

Linus

2010-12-12 04:16:21

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Sat, 11 Dec 2010 19:34:05 -0800
Linus Torvalds <[email protected]> wrote:

> On Fri, Dec 10, 2010 at 5:37 PM, Jesse Barnes <[email protected]> wrote:
> >
> > Thanks, I'll add Dan and Rafael's tested-bys to the patches (they're
> > already in my for-linus tree).  Unless Linus has a problem with them
> > I'll send them over to him this weekend or Monday.
>
> See my other email I just sent out.
>
> I really am not going to take some totally new experimental and hacky
> major PCI resource management thing this late in the -rc game. No way,
> no how.
>
> If the top-down allocator is causing regressions that cannot be fixed
> by _simple_ patches, we're simply going to have to undo it. What's the
> advantage of top-down? None. Not if we then need all this crap, which
> we could as easily do on top of the bottom-up one WITHOUT any
> regressions.
>
> Why isn't anybody else questioning the whole basic premise here?

Questioning the whole premise is fine, but so far we've gone in (or at
least think we're going in) a consistent direction: behave like Windows
on platforms designed for Windows to avoid bugs that Windows doesn't
hit and enable all the same devices Windows allows.

But yes, I really don't like the nx6325 patch either; there's obviously
something we're still missing that's preventing us from doing the right
thing on that platform. Quirking it isn't a good long term answer.

--
Jesse Barnes, Intel Open Source Technology Center

2010-12-12 05:23:26

by Dave Airlie

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Sun, Dec 12, 2010 at 1:30 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Dec 8, 2010 at 1:36 PM, Bjorn Helgaas <[email protected]> wrote:
>>
>> The HP nx6325 BIOS doesn't report any devices in the [0xf8000000-0xfbffffff]
>> region via ACPI devices or the E820 memory map, but when we assign it to the
>> 00:14.4 bridge as a prefetchable memory window, the machine hangs.
>
> Quite frankly, I think this patch sucks.
>
> It sucks because these kinds of hw-specific patches are fundamentally
> a sign of something else being wrong. Why didn't windows hit this? Why
> do we need this total hack?
>
> And is there any reason at all to believe that that one particular
> laptop is really special? I doubt it. And what happens for the next
> random machine that comes along an hits this?
>
> Maybe we should just say that if we know the bridge is negative
> decode, and it hasn't been set up by the BIOS, we just don't allocate
> it at all. And try to look like Windows.
>
> Or figure out what else Windows is doing differently.
>
> The whole "allocate bottom up" old PCI allocation has _years_ of
> testing and quirk that have been gathered over a long time. We can't
> just say "we'll do the same thing for the top-down allocator".
>
> The WHOLE AND ONLY POINT of the top-down allocator was to act lik
> Windows and not need crap like this. If that doesn't work, then I
> seriously don't think we should change bottom-up to top-down at all,
> and for 2.6.37 we should just revert the "set to top-down by default".
>
> Seriously. That "whole and only point" thing is important. If we need
> hacks like this, then we shouldn't do it. We're much better off with
> the model that has year of testing an not the upheaval. Top-down
> allocation is in _no_ way inherently better, the only excuse for it
> was supposed to be "we don't need these kinds of hooks".

I agree, I've got an NX6125 predecessor of this and I'll take any
money that is has an equally insane BIOS, considering its the worst
laptop in my pile in nearly every way imaginable.

and I suspect the HP laptop braindamage won't end with those two.

Dave.

2010-12-12 06:17:37

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Sat, Dec 11, 2010 at 07:30:54PM -0800, Linus Torvalds wrote:
> On Wed, Dec 8, 2010 at 1:36 PM, Bjorn Helgaas <[email protected]> wrote:
> >
> > The HP nx6325 BIOS doesn't report any devices in the [0xf8000000-0xfbffffff]
> > region via ACPI devices or the E820 memory map, but when we assign it to the
> > 00:14.4 bridge as a prefetchable memory window, the machine hangs.
>
> Quite frankly, I think this patch sucks.
>
> It sucks because these kinds of hw-specific patches are fundamentally
> a sign of something else being wrong. Why didn't windows hit this? Why
> do we need this total hack?

I agree, it *does* suck, and I *am* quite worried about how many
issues like this we might trip over.

Windows didn't hit this because of other differences:
- Windows relies on subtractive decode; Linux programs a window
- Windows gives the downstream CardBus bridge a 4K and a 64M mem window;
Linux gives it two 64M windows
- Windows doesn't align the CardBus windows on their size; Linux does

Under Windows, the CardBus windows don't conflict with the unreported
devices. The larger windows allocated by Linux do. So relying on
subtractive decode would work until we plug in a CardBus device that
expects to *use* that area, and then the device won't work.

> And is there any reason at all to believe that that one particular
> laptop is really special? I doubt it. And what happens for the next
> random machine that comes along an hits this?
>
> Maybe we should just say that if we know the bridge is negative
> decode, and it hasn't been set up by the BIOS, we just don't allocate
> it at all. And try to look like Windows.

I do like this idea more than I did at first, even though it's not a
complete fix, because it's much better to have a non-working CardBus
device than a hanging machine. But I guess we'd still want to fix
that device, and then we're back at a hw-specific quirk like this one.
Maybe we should do both (leave the bridge alone and keep the quirk).

> Or figure out what else Windows is doing differently.
>
> The whole "allocate bottom up" old PCI allocation has _years_ of
> testing and quirk that have been gathered over a long time. We can't
> just say "we'll do the same thing for the top-down allocator".

True (although most of the quirks I can think of are in the form of
hard-coded legacy device reservations, and we're keeping those).

If we didn't care about host bridge _CRS, there'd be no reason to
change the old PCI allocation scheme. But I think we *do* care and
will care more in the future. We frequently assign resources to
devices the BIOS didn't configure, and that only works if we're lucky
or there's only one host bridge.

> The WHOLE AND ONLY POINT of the top-down allocator was to act lik
> Windows and not need crap like this. If that doesn't work, then I
> seriously don't think we should change bottom-up to top-down at all,
> and for 2.6.37 we should just revert the "set to top-down by default".
>
> Seriously. That "whole and only point" thing is important. If we need
> hacks like this, then we shouldn't do it. We're much better off with
> the model that has year of testing an not the upheaval. Top-down
> allocation is in _no_ way inherently better, the only excuse for it
> was supposed to be "we don't need these kinds of hooks".

Not really -- the main point here is to make multi-host bridge
machines work reliably, and I really don't see a way to do that
without using _CRS.

If we're going to use _CRS, I think in the long run we'll be better
off if we do it similarly to Windows, despite these early problems.

Bjorn

2010-12-12 13:21:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Sunday, December 12, 2010, Jesse Barnes wrote:
> On Sat, 11 Dec 2010 19:34:05 -0800
> Linus Torvalds <[email protected]> wrote:
>
> > On Fri, Dec 10, 2010 at 5:37 PM, Jesse Barnes <[email protected]> wrote:
> > >
> > > Thanks, I'll add Dan and Rafael's tested-bys to the patches (they're
> > > already in my for-linus tree). Unless Linus has a problem with them
> > > I'll send them over to him this weekend or Monday.
> >
> > See my other email I just sent out.
> >
> > I really am not going to take some totally new experimental and hacky
> > major PCI resource management thing this late in the -rc game. No way,
> > no how.
> >
> > If the top-down allocator is causing regressions that cannot be fixed
> > by _simple_ patches, we're simply going to have to undo it. What's the
> > advantage of top-down? None. Not if we then need all this crap, which
> > we could as easily do on top of the bottom-up one WITHOUT any
> > regressions.
> >
> > Why isn't anybody else questioning the whole basic premise here?
>
> Questioning the whole premise is fine, but so far we've gone in (or at
> least think we're going in) a consistent direction: behave like Windows
> on platforms designed for Windows to avoid bugs that Windows doesn't
> hit and enable all the same devices Windows allows.
>
> But yes, I really don't like the nx6325 patch either; there's obviously
> something we're still missing that's preventing us from doing the right
> thing on that platform. Quirking it isn't a good long term answer.

OK, so I guess the best thing we can do for 2.6.37 is to revert
1af3c2e (x86: allocate space within a region top-down), right?

Rafael

2010-12-13 05:49:18

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Sun, Dec 12, 2010 at 02:20:56PM +0100, Rafael J. Wysocki wrote:
> On Sunday, December 12, 2010, Jesse Barnes wrote:
> > On Sat, 11 Dec 2010 19:34:05 -0800
> > Linus Torvalds <[email protected]> wrote:
> >
> > > On Fri, Dec 10, 2010 at 5:37 PM, Jesse Barnes <[email protected]> wrote:
> > > >
> > > > Thanks, I'll add Dan and Rafael's tested-bys to the patches (they're
> > > > already in my for-linus tree). Unless Linus has a problem with them
> > > > I'll send them over to him this weekend or Monday.
> > >
> > > See my other email I just sent out.
> > >
> > > I really am not going to take some totally new experimental and hacky
> > > major PCI resource management thing this late in the -rc game. No way,
> > > no how.
> > >
> > > If the top-down allocator is causing regressions that cannot be fixed
> > > by _simple_ patches, we're simply going to have to undo it. What's the
> > > advantage of top-down? None. Not if we then need all this crap, which
> > > we could as easily do on top of the bottom-up one WITHOUT any
> > > regressions.
> > >
> > > Why isn't anybody else questioning the whole basic premise here?
> >
> > Questioning the whole premise is fine, but so far we've gone in (or at
> > least think we're going in) a consistent direction: behave like Windows
> > on platforms designed for Windows to avoid bugs that Windows doesn't
> > hit and enable all the same devices Windows allows.
> >
> > But yes, I really don't like the nx6325 patch either; there's obviously
> > something we're still missing that's preventing us from doing the right
> > thing on that platform. Quirking it isn't a good long term answer.
>
> OK, so I guess the best thing we can do for 2.6.37 is to revert
> 1af3c2e (x86: allocate space within a region top-down), right?

That's a possibility, but I'm not sure it's the best. It would avoid
the nx6325 landmine, but we could easily step on a different one on
other machines.

If Windows uses the end of available space first, that's the best-tested
territory, and I think it's better to try to stay in it rather revert
1af3c2e and start mapping different territory that really isn't tested
by Windows at all.

On the nx6325, we stray out of that tested territory because Linux opens
windows on subtractive-decode bridges and allocates more space to
CardBus bridges.

I think it'd be fairly simple to leave subtractive-decode bridges alone,
and that might make the nx6325 work well enough for .37, even without
the quirk. This feels like a change that might be the right thing to do
in general. Using subtractive rather than positive decode would give up
a little performance, but I doubt it's important

There's another regression,
https://bugzilla.kernel.org/show_bug.cgi?id=23132, that seems similar,
though we haven't debugged it yet. It also has to do with assigning
resources that aren't strictly needed to bridges, so there might be more
changes we should make to that strategy.

Bjorn

2010-12-13 13:48:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas


* Bjorn Helgaas <[email protected]> wrote:

> > OK, so I guess the best thing we can do for 2.6.37 is to revert 1af3c2e (x86:
> > allocate space within a region top-down), right?
>
> That's a possibility, but I'm not sure it's the best. It would avoid the nx6325
> landmine, but we could easily step on a different one on other machines.
>
> If Windows uses the end of available space first, that's the best-tested
> territory, and I think it's better to try to stay in it rather revert 1af3c2e and
> start mapping different territory that really isn't tested by Windows at all.
>
> On the nx6325, we stray out of that tested territory because Linux opens windows
> on subtractive-decode bridges and allocates more space to CardBus bridges.
>
> I think it'd be fairly simple to leave subtractive-decode bridges alone, and that
> might make the nx6325 work well enough for .37, even without the quirk. This
> feels like a change that might be the right thing to do in general. Using
> subtractive rather than positive decode would give up a little performance, but I
> doubt it's important

Is there a patch for that that people could try?

The regression needs to be addressed one way or another: either by improving the
"compatible with Windows" approach to actually work (without quirks), or by
reverting to the tested-for-years Linux solution.

( What we always try to avoid is trading in a set of regressions for another set of
regressions. That way lies madness, it can not result in reliable, provable
progress. )

Thanks,

Ingo

2010-12-14 20:35:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Sat, Dec 11, 2010 at 10:17 PM, Bjorn Helgaas <[email protected]> wrote:
>
> Not really -- the main point here is to make multi-host bridge
> machines work reliably, and I really don't see a way to do that
> without using _CRS.
>
> If we're going to use _CRS, I think in the long run we'll be better
> off if we do it similarly to Windows, despite these early problems.

It's not about any "despite these early problems".

It's about "clearly we're not doing things at all like Windows, and
it's just broken".

The thing is, we will never be able to match Windows exactly. It may
well have random hardcoded quirks we simply don't know about.

I'm perfectly happy with you aiming to use _CRS. I am _not_ happy with
you then using that as an excuse to then do things that don't work.

We will NOT start doing random BIOS-specific quirks just because
top-down allocations hit other bugs than bottom-up ones do. Just no.
We'll continue doing that we have tried to do, which is to perhaps
have quirks that are specific to *hardware* (like the ones in
drivers/pci/quirks.c) and just filling in stuff that some BIOSes are
known to get wrong.

That's a maintainable approach. But it's maintainable ONLY if we then
don't do other random changes that invalidates all the years of
testing we've had.

Linus

2010-12-14 20:45:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Tue, Dec 14, 2010 at 12:34 PM, Linus Torvalds
<[email protected]> wrote:
>
> That's a maintainable approach. But it's maintainable ONLY if we then
> don't do other random changes that invalidates all the years of
> testing we've had.

Btw, looking at all the x86-specific commits that have gone in, I'm
*extremely* unhappy that they apparently stopped honoring that
"resource_alloc_from_bottom" flag that I explicitly asked for.

So it looks like it's not enough to just set that flag. We have to
actually revert all the commits in this area as broken.

Which is sad, but since they clearly *are* broken and don't honor the
flag that was there explicitly to avoid this problem and make it easy
to test reverting it, I'm really pissed off. The WHOLE POINT of that
flag was to give people an option to say "use the old resource
allocation order because the new one doesn't work for me".

So at this point the only question is whether I should just revert the
whole effing lot, or whether there are patches to fix the code to
honor the "allocate from bottom" bit and then just set it by default
again.

Bjorn? Preferences?

Linus

2010-12-14 23:57:37

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Tuesday, December 14, 2010 01:44:51 pm Linus Torvalds wrote:
> On Tue, Dec 14, 2010 at 12:34 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > That's a maintainable approach. But it's maintainable ONLY if we then
> > don't do other random changes that invalidates all the years of
> > testing we've had.
>
> Btw, looking at all the x86-specific commits that have gone in, I'm
> *extremely* unhappy that they apparently stopped honoring that
> "resource_alloc_from_bottom" flag that I explicitly asked for.

In 20-20 hindsight, I should have made that switch affect more things.
I tried to do what you asked; I obviously just didn't do enough, and
I am sorry.

> So it looks like it's not enough to just set that flag. We have to
> actually revert all the commits in this area as broken.
>
> Which is sad, but since they clearly *are* broken and don't honor the
> flag that was there explicitly to avoid this problem and make it easy
> to test reverting it, I'm really pissed off. The WHOLE POINT of that
> flag was to give people an option to say "use the old resource
> allocation order because the new one doesn't work for me".
>
> So at this point the only question is whether I should just revert the
> whole effing lot, or whether there are patches to fix the code to
> honor the "allocate from bottom" bit and then just set it by default
> again.
>
> Bjorn? Preferences?

Let me identify the set of reversion candidates and the consequences,
and then we can figure out whether it's better to retreat or push
forward.

Bjorn

2010-12-15 00:12:42

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/5] resources: add arch hook for preventing allocation in reserved areas

On Monday, December 13, 2010 06:47:39 am Ingo Molnar wrote:
>
> * Bjorn Helgaas <[email protected]> wrote:
>
> > > OK, so I guess the best thing we can do for 2.6.37 is to revert 1af3c2e (x86:
> > > allocate space within a region top-down), right?
> >
> > That's a possibility, but I'm not sure it's the best. It would avoid the nx6325
> > landmine, but we could easily step on a different one on other machines.
> >
> > If Windows uses the end of available space first, that's the best-tested
> > territory, and I think it's better to try to stay in it rather revert 1af3c2e and
> > start mapping different territory that really isn't tested by Windows at all.
> >
> > On the nx6325, we stray out of that tested territory because Linux opens windows
> > on subtractive-decode bridges and allocates more space to CardBus bridges.
> >
> > I think it'd be fairly simple to leave subtractive-decode bridges alone, and that
> > might make the nx6325 work well enough for .37, even without the quirk. This
> > feels like a change that might be the right thing to do in general. Using
> > subtractive rather than positive decode would give up a little performance, but I
> > doubt it's important
>
> Is there a patch for that that people could try?

Yes. https://bugzilla.kernel.org/show_bug.cgi?id=23332#c22

It works for me, but it's such a new idea to me that I'm not quite
used to it and don't know how confident to be yet.

Bjorn

2010-12-15 06:02:35

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Tue, Dec 14, 2010 at 12:44:51PM -0800, Linus Torvalds wrote:
> On Tue, Dec 14, 2010 at 12:34 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > That's a maintainable approach. But it's maintainable ONLY if we then
> > don't do other random changes that invalidates all the years of
> > testing we've had.
>
> Btw, looking at all the x86-specific commits that have gone in, I'm
> *extremely* unhappy that they apparently stopped honoring that
> "resource_alloc_from_bottom" flag that I explicitly asked for.
>
> So it looks like it's not enough to just set that flag. We have to
> actually revert all the commits in this area as broken.
>
> Which is sad, but since they clearly *are* broken and don't honor the
> flag that was there explicitly to avoid this problem and make it easy
> to test reverting it, I'm really pissed off. The WHOLE POINT of that
> flag was to give people an option to say "use the old resource
> allocation order because the new one doesn't work for me".
>
> So at this point the only question is whether I should just revert the
> whole effing lot, or whether there are patches to fix the code to
> honor the "allocate from bottom" bit and then just set it by default
> again.

I'm not sure there's value in having both "pci=nocrs" and
"resource_alloc_from_bottom" flags. What if I made it so
we allocate top-down if and only if we're using _CRS?

Then old boxes where we don't look at _CRS would use the same
bottom-up behavior they always have (this is another major
screwup in the current tree -- we currently do top-down on
these boxes for no good reason).

And on new boxes, we default to using _CRS and allocating
top-down, but if that doesn't work, we can use "pci=nocrs"
and go back to the old "ignore _CRS and allocate bottom-up"
behavior.

Here's a sample patch which I will test and document if you
think it's a reasonable approach:


commit e39250083dbdd0b42e886aa858d0ffbc86e618c4
Author: Bjorn Helgaas <[email protected]>
Date: Tue Dec 14 22:10:16 2010 -0700

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 21c6746..85268f8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -769,7 +769,6 @@ void __init setup_arch(char **cmdline_p)

x86_init.oem.arch_setup();

- resource_alloc_from_bottom = 0;
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
setup_memory_map();
parse_setup_data();
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 0972315..ca770fc 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -68,6 +68,9 @@ void __init pci_acpi_crs_quirks(void)
"if necessary, use \"pci=%s\" and report a bug\n",
pci_use_crs ? "Using" : "Ignoring",
pci_use_crs ? "nocrs" : "use_crs");
+
+ if (pci_use_crs)
+ resource_alloc_from_top = 1;
}

static acpi_status
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index c4bb261..57a1374 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -65,9 +65,16 @@ pcibios_align_resource(void *data, const struct resource *res,
resource_size_t size, resource_size_t align)
{
struct pci_dev *dev = data;
- resource_size_t start = round_down(res->end - size + 1, align);
+ resource_size_t start;
+
+ if (resource_alloc_from_top)
+ start = round_down(res->end - size + 1, align);
+ else
+ start = res->start;

if (res->flags & IORESOURCE_IO) {
+ if (skip_isa_ioresource_align(dev))
+ return start;

/*
* If we're avoiding ISA aliases, the largest contiguous I/O
@@ -75,11 +82,18 @@ pcibios_align_resource(void *data, const struct resource *res,
* all 256-byte and smaller alignments, so the result will
* still be correctly aligned.
*/
- if (!skip_isa_ioresource_align(dev))
+ if (resource_alloc_from_top)
start &= ~0x300;
+ else if (start & 0x300)
+ start = (start + 0x3ff) & ~0x3ff;
+
} else if (res->flags & IORESOURCE_MEM) {
- if (start < BIOS_END)
- start = res->end; /* fail; no space */
+ if (start < BIOS_END) {
+ if (resource_alloc_from_top)
+ start = res->end; /* fail; no space */
+ else
+ start = BIOS_END;
+ }
}
return start;
}
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 003170e..bd59bc5 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -160,7 +160,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
resource_size_t),
void *alignf_data)
{
- int ret = -ENOMEM;
+ int i, ret = -ENOMEM;
struct resource *r;
resource_size_t max = -1;
unsigned int type = res->flags & IORESOURCE_TYPE_BITS;
@@ -171,26 +171,51 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
if (!(res->flags & IORESOURCE_MEM_64))
max = PCIBIOS_MAX_MEM_32;

- /* Look for space at highest addresses first */
- r = pci_bus_find_resource_prev(bus, type, NULL);
- for ( ; r; r = pci_bus_find_resource_prev(bus, type, r)) {
- /* type_mask must match */
- if ((res->flags ^ r->flags) & type_mask)
- continue;
-
- /* We cannot allocate a non-prefetching resource
- from a pre-fetching area */
- if ((r->flags & IORESOURCE_PREFETCH) &&
- !(res->flags & IORESOURCE_PREFETCH))
- continue;
-
- /* Ok, try it out.. */
- ret = allocate_resource(r, res, size,
- r->start ? : min,
- max, align,
- alignf, alignf_data);
- if (ret == 0)
- break;
+ if (resource_alloc_from_top) {
+ /* Look for space at highest addresses first */
+ r = pci_bus_find_resource_prev(bus, type, NULL);
+ for ( ; r; r = pci_bus_find_resource_prev(bus, type, r)) {
+ /* type_mask must match */
+ if ((res->flags ^ r->flags) & type_mask)
+ continue;
+
+ /* We cannot allocate a non-prefetching resource
+ from a pre-fetching area */
+ if ((r->flags & IORESOURCE_PREFETCH) &&
+ !(res->flags & IORESOURCE_PREFETCH))
+ continue;
+
+ /* Ok, try it out.. */
+ ret = allocate_resource(r, res, size,
+ r->start ? : min,
+ max, align,
+ alignf, alignf_data);
+ if (ret == 0)
+ break;
+ }
+ } else {
+ pci_bus_for_each_resource(bus, r, i) {
+ if (!r)
+ continue;
+
+ /* type_mask must match */
+ if ((res->flags ^ r->flags) & type_mask)
+ continue;
+
+ /* We cannot allocate a non-prefetching resource
+ from a pre-fetching area */
+ if ((r->flags & IORESOURCE_PREFETCH) &&
+ !(res->flags & IORESOURCE_PREFETCH))
+ continue;
+
+ /* Ok, try it out.. */
+ ret = allocate_resource(r, res, size,
+ r->start ? : min,
+ max, align,
+ alignf, alignf_data);
+ if (ret == 0)
+ break;
+ }
}
return ret;
}
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index d377ea8..d427cd5 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -112,7 +112,7 @@ struct resource_list {
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
-extern int resource_alloc_from_bottom;
+extern int resource_alloc_from_top;

extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
extern int request_resource(struct resource *root, struct resource *new);
diff --git a/kernel/resource.c b/kernel/resource.c
index 9fad33e..275b414 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -41,21 +41,13 @@ EXPORT_SYMBOL(iomem_resource);
static DEFINE_RWLOCK(resource_lock);

/*
- * By default, we allocate free space bottom-up. The architecture can request
- * top-down by clearing this flag. The user can override the architecture's
- * choice with the "resource_alloc_from_bottom" kernel boot option, but that
- * should only be a debugging tool.
+ * By default, we allocate free space bottom-up, as we have done in the past.
+ * Architectures can request top-down by setting "resource_alloc_from_top".
+ * On x86, we do this when we use PCI host bridge ACPI _CRS information.
+ * If the user turns off _CRS with "pci=nocrs", we use the default bottom-up
+ * allocation strategy.
*/
-int resource_alloc_from_bottom = 1;
-
-static __init int setup_alloc_from_bottom(char *s)
-{
- printk(KERN_INFO
- "resource: allocating from bottom-up; please report a bug\n");
- resource_alloc_from_bottom = 1;
- return 0;
-}
-early_param("resource_alloc_from_bottom", setup_alloc_from_bottom);
+int resource_alloc_from_top;

static void *r_next(struct seq_file *m, void *v, loff_t *pos)
{
@@ -545,10 +537,10 @@ int allocate_resource(struct resource *root, struct resource *new,
alignf = simple_align_resource;

write_lock(&resource_lock);
- if (resource_alloc_from_bottom)
- err = find_resource(root, new, size, min, max, align, alignf, alignf_data);
- else
+ if (resource_alloc_from_top)
err = find_resource_from_top(root, new, size, min, max, align, alignf, alignf_data);
+ else
+ err = find_resource(root, new, size, min, max, align, alignf, alignf_data);
if (err >= 0 && __request_resource(root, new))
err = -EBUSY;
write_unlock(&resource_lock);

2010-12-15 06:26:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Tue, Dec 14, 2010 at 12:34:20PM -0800, Linus Torvalds wrote:
> On Sat, Dec 11, 2010 at 10:17 PM, Bjorn Helgaas <[email protected]> wrote:
> >
> > Not really -- the main point here is to make multi-host bridge
> > machines work reliably, and I really don't see a way to do that
> > without using _CRS.
> >
> > If we're going to use _CRS, I think in the long run we'll be better
> > off if we do it similarly to Windows, despite these early problems.
>
> It's not about any "despite these early problems".
>
> It's about "clearly we're not doing things at all like Windows, and
> it's just broken".
>
> The thing is, we will never be able to match Windows exactly. It may
> well have random hardcoded quirks we simply don't know about.

Granted.

> I'm perfectly happy with you aiming to use _CRS. I am _not_ happy with
> you then using that as an excuse to then do things that don't work.

I don't want to do things that make you unhappy :)

> We will NOT start doing random BIOS-specific quirks just because
> top-down allocations hit other bugs than bottom-up ones do. Just no.
> We'll continue doing that we have tried to do, which is to perhaps
> have quirks that are specific to *hardware* (like the ones in
> drivers/pci/quirks.c) and just filling in stuff that some BIOSes are
> known to get wrong.

I've only proposed one BIOS-specific quirk, which is the one for the
nx6325 unreported regions, and I identified things we do differently
than Windows that explain why we see the problem and Windows doesn't.

If we stop opening windows on subtractive-decode bridges, we don't
need that quirk to avoid the hang. We will still need it if we
want to use more than 40-odd MB of space on a PC Card.

I'm pretty confident that if we could find PC Cards that require
enough space, they wouldn't work under Windows either.

I don't know whether the other patches in this series make you
unhappy. I'm really not happy with how I implemented the avoidance
of ACPI devices when doing PCI allocation, but I do think we need
to avoid them *somehow*, and I was looking for a minimal quick
fix at this point in the cycle.

Avoiding ACPI devices fixes the Matthew's 2530p problem. We can
also avoid that particular problem with the simple PCIBIOS_MAX_MEM_32
change you proposed. However, avoiding ACPI devices fixes other
problems at the same time, such as this one:
https://bugzilla.kernel.org/show_bug.cgi?id=23802
where we put the intel-gtt "flush page" on top of an ACPI TPM
device.

Bjorn

2010-12-15 07:11:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Tue, Dec 14, 2010 at 10:26 PM, Bjorn Helgaas <[email protected]> wrote:
>
> I don't know whether the other patches in this series make you
> unhappy. ?I'm really not happy with how I implemented the avoidance
> of ACPI devices when doing PCI allocation, but I do think we need
> to avoid them *somehow*, and I was looking for a minimal quick
> fix at this point in the cycle.

So the "avoid ACPI devices" part makes sense, and doesn't involved
quirks, so I don't hate it at all the same way I hated the HP quirk.

However, I hate how it makes the allocation logic opaque. You can no
longer tell from the regular non-debug dmesg and the /proc/iomem _why_
something got allocated the way it did, because there are hidden
rules. That makes things awkward, methinks.

Also, quite frankly, I wonder what happens after release when somebody
shows another machine that simply stopped working because the
allocation strategy didn't work for it. The hw coverage that -rc6 gets
is tiny compared to a real release.

IOW, what's the long-term strategy for this? The only sane long-term
strategy I can see is the one we have _always_ done, which is to try
to populate the memory resource tree with what simply matches reality.
The whole "ok, we know the hardware better than the BIOS does" is a
_stable_ strategy. In contrast, the things you propose are NOT stable
strategies, they all depend on basically "we match windows exactly
and/or trust ACPI". Both of which are *known* to be failing models.

That's why I'm somewhat upset. Your whole strategy seems to depend on
a known broken model. We _know_ ACPI tables are crap much of the time.
So we know that "avoiding ACPI resources" is inevitably insufficient.

And that's why I hate the "switch everything around" model. Yes, we
have a known way to fix things up - namely to actually detect the
hardware itself properly when firmware inevitably screws up - but the
very act of switching things around will pretty much guarantee that
all our years of effort is of dubious value, and we'll end up finding
other laptops that used to work and no longer does.

Only switching around when _CRS is used is possible, and shouldn't
cause any regressions if we continue to default to not using _CRS. But
you want to switch that default around at some point, don't you? At
which point we'll be up sh*t creek again. See what I'm saying?

Which all makes me suspect that we'd be much better off just doing the
bottom-up allocation even for _CRS. And maybe CRS works fine then when
we combine our hardware knowledge with the ACPI region avoidance.

Linus

2010-12-15 18:18:46

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Wednesday, December 15, 2010 12:03:15 am Linus Torvalds wrote:
> On Tue, Dec 14, 2010 at 10:26 PM, Bjorn Helgaas <[email protected]> wrote:
> >
> > I don't know whether the other patches in this series make you
> > unhappy. I'm really not happy with how I implemented the avoidance
> > of ACPI devices when doing PCI allocation, but I do think we need
> > to avoid them *somehow*, and I was looking for a minimal quick
> > fix at this point in the cycle.
>
> So the "avoid ACPI devices" part makes sense, and doesn't involved
> quirks, so I don't hate it at all the same way I hated the HP quirk.
>
> However, I hate how it makes the allocation logic opaque. You can no
> longer tell from the regular non-debug dmesg and the /proc/iomem _why_
> something got allocated the way it did, because there are hidden
> rules. That makes things awkward, methinks.
>
> Also, quite frankly, I wonder what happens after release when somebody
> shows another machine that simply stopped working because the
> allocation strategy didn't work for it. The hw coverage that -rc6 gets
> is tiny compared to a real release.
>
> IOW, what's the long-term strategy for this? The only sane long-term
> strategy I can see is the one we have _always_ done, which is to try
> to populate the memory resource tree with what simply matches reality.
> The whole "ok, we know the hardware better than the BIOS does" is a
> _stable_ strategy. In contrast, the things you propose are NOT stable
> strategies, they all depend on basically "we match windows exactly
> and/or trust ACPI". Both of which are *known* to be failing models.
>
> That's why I'm somewhat upset. Your whole strategy seems to depend on
> a known broken model. We _know_ ACPI tables are crap much of the time.
> So we know that "avoiding ACPI resources" is inevitably insufficient.
>
> And that's why I hate the "switch everything around" model. Yes, we
> have a known way to fix things up - namely to actually detect the
> hardware itself properly when firmware inevitably screws up - but the
> very act of switching things around will pretty much guarantee that
> all our years of effort is of dubious value, and we'll end up finding
> other laptops that used to work and no longer does.
>
> Only switching around when _CRS is used is possible, and shouldn't
> cause any regressions if we continue to default to not using _CRS. But
> you want to switch that default around at some point, don't you? At
> which point we'll be up sh*t creek again. See what I'm saying?
>
> Which all makes me suspect that we'd be much better off just doing the
> bottom-up allocation even for _CRS. And maybe CRS works fine then when
> we combine our hardware knowledge with the ACPI region avoidance.

ACPI devices tend to be at high addresses, so allocating top-down
is definitely more dangerous unless we explicitly avoid them. I
should have realized that and done something like patches 1-3 of
this series before the top-down patches.

Doing it bottom-up would very likely work better than the "top-down
without avoiding ACPI regions" model we currently have, at least in
the short term. We *would* have to do something to avoid E820
reservations to fix this:
https://bugzilla.kernel.org/show_bug.cgi?id=16228,
but that's doable.

So here's my proposal for .37:
- Keep the current state of _CRS enabled by default (for 2008
and newer machines).
- Allocate bottom-up always
- Avoid E820 reservations

That should fix all the regressions I'm aware of. I'll work on
the patches this afternoon.

Bjorn

2010-12-15 18:28:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On 12/15/2010 10:18 AM, Bjorn Helgaas wrote:
>
> ACPI devices tend to be at high addresses, so allocating top-down
> is definitely more dangerous unless we explicitly avoid them. I
> should have realized that and done something like patches 1-3 of
> this series before the top-down patches.
>
> Doing it bottom-up would very likely work better than the "top-down
> without avoiding ACPI regions" model we currently have, at least in
> the short term. We *would* have to do something to avoid E820
> reservations to fix this:
> https://bugzilla.kernel.org/show_bug.cgi?id=16228,
> but that's doable.
>
> So here's my proposal for .37:
> - Keep the current state of _CRS enabled by default (for 2008
> and newer machines).
> - Allocate bottom-up always
> - Avoid E820 reservations
>
> That should fix all the regressions I'm aware of. I'll work on
> the patches this afternoon.
>

At the same time, I would like to see a few things done as a matter of
course:

a) reserve the top 2 MiB of the 32-bit address space. There *will* be
ROM at the top of the 32-bit address space; it's a fact of the
architecture, and on at least older systems it was common to have a
shadow 1 MiB below.

b) we may want to consider doing special things in the 0xFExxxxxx memory
range, which is used by the CPU-APIC-MSI system in recent processors.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-12-15 19:22:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/5] PNP: HP nx6325 fixup: reserve unreported resources

On Wed, Dec 15, 2010 at 10:18 AM, Bjorn Helgaas <[email protected]> wrote:
>
> So here's my proposal for .37:
> ?- Keep the current state of _CRS enabled by default (for 2008
> ? ?and newer machines).
> ?- Allocate bottom-up always
> ?- Avoid E820 reservations

Sounds sane to me. And yes, the fact that the BIOS tends to allocate
things top-down probably does mean that the whole bottom-up approach
tends to be the safer model.

Linus