2010-12-16 17:38:21

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 0/9] PCI: revert to allocating bottom-up, avoid E820 areas

[This is 2.6.37 material]

In 2.6.34 and later, we automatically use "pci=use_crs" on recent (2008 or
newer) machines, which means we assign PCI resources from the host bridge
windows described by ACPI _CRS.

In many cases, _CRS claims the windows are bigger than they really are, e.g.,
they may overlap system memory, ROM, or other ACPI devices.

Linux allocates PCI memory bottom-up, ignores most ACPI devices, and does an
incomplete job of avoiding E820 reserved areas, which caused problems like
https://bugzilla.kernel.org/show_bug.cgi?id=16228 .

As far as I can tell, Windows deals with this situation by removing all ACPI
resources and allocating the remaining space top-down.

I tried to make Linux allocate top-down as well, but since we don't have a good
mechanism for avoiding the ACPI devices that clutter the top of memory, we
tripped over several of those, which resulted in these regressions:

https://bugzilla.kernel.org/show_bug.cgi?id=23132 (Dell 1545)
https://bugzilla.kernel.org/show_bug.cgi?id=23332 (HP nx6325)
https://bugzilla.kernel.org/show_bug.cgi?id=23542 (HP 2530p)
https://bugzilla.kernel.org/show_bug.cgi?id=23802 (PCI / TPM conflict)

So I think the best band-aid for now is to revert to bottom-up allocation
and do a better job of avoiding E820 reserved areas. Obviously, it's not
complete because we could still trip over ACPI devices (as in 23802), but
that's a problem we've always had and it's not as likely if we go bottom-up.

Patches 1-5 of this series are straight reverts ("git revert -n") of these
patches which were merged in 2.6.37-rc1:

82e3e76 PCI: fix pci_bus_alloc_resource() hang, prefer positive decode
1af3c2e x86: allocate space within a region top-down
dc9887d x86/PCI: allocate space from the end of a region, not the beginning
b126b47 PCI: allocate bus resources from the top down
e7f8567 resources: support allocating space within a region from the top down

Patches 6-9 add the ability to avoid E820 regions and the top 2MB below 4GB.
The last patch (avoid top 2MB) is not strictly necessary in that it doesn't
fix any current issues I'm aware of.

Bjorn
---

Bjorn Helgaas (9):
Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode"
Revert "x86: allocate space within a region top-down"
Revert "x86/PCI: allocate space from the end of a region, not the beginning"
Revert "PCI: allocate bus resources from the top down"
Revert "resources: support allocating space within a region from the top down"
resources: add arch hook for preventing allocation in reserved areas
x86: avoid low BIOS area when allocating address space
x86: avoid E820 regions when allocating address space
x86: avoid high BIOS area when allocating address space


Documentation/kernel-parameters.txt | 5 --
arch/x86/include/asm/e820.h | 3 +
arch/x86/kernel/Makefile | 1
arch/x86/kernel/resource.c | 48 ++++++++++++++++
arch/x86/kernel/setup.c | 1
arch/x86/pci/i386.c | 18 ++----
drivers/pci/bus.c | 81 ++-------------------------
include/linux/ioport.h | 2 -
kernel/resource.c | 104 +++--------------------------------
9 files changed, 73 insertions(+), 190 deletions(-)
create mode 100644 arch/x86/kernel/resource.c


2010-12-16 17:38:29

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 1/9] Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode"


This reverts commit 82e3e767c21fef2b1b38868e20eb4e470a1e38e3.

We're going back to considering bus resources in the order we found
them (in _CRS order, when we're using _CRS), so we don't need to
define any ordering.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

drivers/pci/bus.c | 70 ++++++++++++++++-------------------------------------
1 files changed, 21 insertions(+), 49 deletions(-)


diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 003170e..5624db8 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -64,57 +64,17 @@ void pci_bus_remove_resources(struct pci_bus *bus)
}
}

-static bool pci_bus_resource_better(struct resource *res1, bool pos1,
- struct resource *res2, bool pos2)
-{
- /* If exactly one is positive decode, always prefer that one */
- if (pos1 != pos2)
- return pos1 ? true : false;
-
- /* Prefer the one that contains the highest address */
- if (res1->end != res2->end)
- return (res1->end > res2->end) ? true : false;
-
- /* Otherwise, prefer the one with highest "center of gravity" */
- if (res1->start != res2->start)
- return (res1->start > res2->start) ? true : false;
-
- /* Otherwise, choose one arbitrarily (but consistently) */
- return (res1 > res2) ? true : false;
-}
-
-static bool pci_bus_resource_positive(struct pci_bus *bus, struct resource *res)
-{
- struct pci_bus_resource *bus_res;
-
- /*
- * This relies on the fact that pci_bus.resource[] refers to P2P or
- * CardBus bridge base/limit registers, which are always positively
- * decoded. The pci_bus.resources list contains host bridge or
- * subtractively decoded resources.
- */
- list_for_each_entry(bus_res, &bus->resources, list) {
- if (bus_res->res == res)
- return (bus_res->flags & PCI_SUBTRACTIVE_DECODE) ?
- false : true;
- }
- return true;
-}
-
/*
- * Find the next-best bus resource after the cursor "res". If the cursor is
- * NULL, return the best resource. "Best" means that we prefer positive
- * decode regions over subtractive decode, then those at higher addresses.
+ * Find the highest-address bus resource below the cursor "res". If the
+ * cursor is NULL, return the highest resource.
*/
static struct resource *pci_bus_find_resource_prev(struct pci_bus *bus,
unsigned int type,
struct resource *res)
{
- bool res_pos, r_pos, prev_pos = false;
struct resource *r, *prev = NULL;
int i;

- res_pos = pci_bus_resource_positive(bus, res);
pci_bus_for_each_resource(bus, r, i) {
if (!r)
continue;
@@ -122,14 +82,26 @@ static struct resource *pci_bus_find_resource_prev(struct pci_bus *bus,
if ((r->flags & IORESOURCE_TYPE_BITS) != type)
continue;

- r_pos = pci_bus_resource_positive(bus, r);
- if (!res || pci_bus_resource_better(res, res_pos, r, r_pos)) {
- if (!prev || pci_bus_resource_better(r, r_pos,
- prev, prev_pos)) {
- prev = r;
- prev_pos = r_pos;
- }
+ /* If this resource is at or past the cursor, skip it */
+ if (res) {
+ if (r == res)
+ continue;
+ if (r->end > res->end)
+ continue;
+ if (r->end == res->end && r->start > res->start)
+ continue;
}
+
+ if (!prev)
+ prev = r;
+
+ /*
+ * A small resource is higher than a large one that ends at
+ * the same address.
+ */
+ if (r->end > prev->end ||
+ (r->end == prev->end && r->start > prev->start))
+ prev = r;
}

return prev;

2010-12-16 17:38:35

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 3/9] Revert "x86/PCI: allocate space from the end of a region, not the beginning"


This reverts commit dc9887dc02e37bcf83f4e792aa14b07782ef54cf.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/pci/i386.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)


diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index c4bb261..8379c2c 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -65,21 +65,16 @@ pcibios_align_resource(void *data, const struct resource *res,
resource_size_t size, resource_size_t align)
{
struct pci_dev *dev = data;
- resource_size_t start = round_down(res->end - size + 1, align);
+ resource_size_t start = res->start;

if (res->flags & IORESOURCE_IO) {
-
- /*
- * If we're avoiding ISA aliases, the largest contiguous I/O
- * port space is 256 bytes. Clearing bits 9 and 10 preserves
- * all 256-byte and smaller alignments, so the result will
- * still be correctly aligned.
- */
- if (!skip_isa_ioresource_align(dev))
- start &= ~0x300;
+ if (skip_isa_ioresource_align(dev))
+ return start;
+ if (start & 0x300)
+ start = (start + 0x3ff) & ~0x3ff;
} else if (res->flags & IORESOURCE_MEM) {
if (start < BIOS_END)
- start = res->end; /* fail; no space */
+ start = BIOS_END;
}
return start;
}

2010-12-16 17:38:46

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 5/9] Revert "resources: support allocating space within a region from the top down"


This reverts commit e7f8567db9a7f6b3151b0b275e245c1cef0d9c70.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

Documentation/kernel-parameters.txt | 5 --
include/linux/ioport.h | 1
kernel/resource.c | 98 +----------------------------------
3 files changed, 4 insertions(+), 100 deletions(-)


diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index cdd2a6e..8b61c93 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2175,11 +2175,6 @@ and is between 256 and 4096 characters. It is defined in the file
reset_devices [KNL] Force drivers to reset the underlying device
during initialization.

- resource_alloc_from_bottom
- Allocate new resources from the beginning of available
- space, not the end. If you need to use this, please
- report a bug.
-
resume= [SWSUSP]
Specify the partition device for software suspend

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index d377ea8..b227902 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -112,7 +112,6 @@ struct resource_list {
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
-extern int resource_alloc_from_bottom;

extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
extern int request_resource(struct resource *root, struct resource *new);
diff --git a/kernel/resource.c b/kernel/resource.c
index 9fad33e..560659f 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -40,23 +40,6 @@ EXPORT_SYMBOL(iomem_resource);

static DEFINE_RWLOCK(resource_lock);

-/*
- * By default, we allocate free space bottom-up. The architecture can request
- * top-down by clearing this flag. The user can override the architecture's
- * choice with the "resource_alloc_from_bottom" kernel boot option, but that
- * should only be a debugging tool.
- */
-int resource_alloc_from_bottom = 1;
-
-static __init int setup_alloc_from_bottom(char *s)
-{
- printk(KERN_INFO
- "resource: allocating from bottom-up; please report a bug\n");
- resource_alloc_from_bottom = 1;
- return 0;
-}
-early_param("resource_alloc_from_bottom", setup_alloc_from_bottom);
-
static void *r_next(struct seq_file *m, void *v, loff_t *pos)
{
struct resource *p = v;
@@ -397,74 +380,7 @@ static bool resource_contains(struct resource *res1, struct resource *res2)
}

/*
- * Find the resource before "child" in the sibling list of "root" children.
- */
-static struct resource *find_sibling_prev(struct resource *root, struct resource *child)
-{
- struct resource *this;
-
- for (this = root->child; this; this = this->sibling)
- if (this->sibling == child)
- return this;
-
- return NULL;
-}
-
-/*
- * Find empty slot in the resource tree given range and alignment.
- * This version allocates from the end of the root resource first.
- */
-static int find_resource_from_top(struct resource *root, struct resource *new,
- resource_size_t size, resource_size_t min,
- resource_size_t max, resource_size_t align,
- resource_size_t (*alignf)(void *,
- const struct resource *,
- resource_size_t,
- resource_size_t),
- void *alignf_data)
-{
- struct resource *this;
- struct resource tmp, avail, alloc;
-
- tmp.start = root->end;
- tmp.end = root->end;
-
- this = find_sibling_prev(root, NULL);
- for (;;) {
- if (this) {
- if (this->end < root->end)
- tmp.start = this->end + 1;
- } else
- tmp.start = root->start;
-
- resource_clip(&tmp, min, max);
-
- /* Check for overflow after ALIGN() */
- avail = *new;
- avail.start = ALIGN(tmp.start, align);
- avail.end = tmp.end;
- if (avail.start >= tmp.start) {
- alloc.start = alignf(alignf_data, &avail, size, align);
- alloc.end = alloc.start + size - 1;
- if (resource_contains(&avail, &alloc)) {
- new->start = alloc.start;
- new->end = alloc.end;
- return 0;
- }
- }
-
- if (!this || this->start == root->start)
- break;
-
- tmp.end = this->start - 1;
- this = find_sibling_prev(root, this);
- }
- return -EBUSY;
-}
-
-/*
* Find empty slot in the resource tree given range and alignment.
- * This version allocates from the beginning of the root resource first.
*/
static int find_resource(struct resource *root, struct resource *new,
resource_size_t size, resource_size_t min,
@@ -480,15 +396,14 @@ static int find_resource(struct resource *root, struct resource *new,

tmp.start = root->start;
/*
- * Skip past an allocated resource that starts at 0, since the
- * assignment of this->start - 1 to tmp->end below would cause an
- * underflow.
+ * Skip past an allocated resource that starts at 0, since the assignment
+ * of this->start - 1 to tmp->end below would cause an underflow.
*/
if (this && this->start == 0) {
tmp.start = this->end + 1;
this = this->sibling;
}
- for (;;) {
+ for(;;) {
if (this)
tmp.end = this->start - 1;
else
@@ -509,10 +424,8 @@ static int find_resource(struct resource *root, struct resource *new,
return 0;
}
}
-
if (!this)
break;
-
tmp.start = this->end + 1;
this = this->sibling;
}
@@ -545,10 +458,7 @@ int allocate_resource(struct resource *root, struct resource *new,
alignf = simple_align_resource;

write_lock(&resource_lock);
- if (resource_alloc_from_bottom)
- err = find_resource(root, new, size, min, max, align, alignf, alignf_data);
- else
- err = find_resource_from_top(root, new, size, min, max, align, alignf, alignf_data);
+ err = find_resource(root, new, size, min, max, align, alignf, alignf_data);
if (err >= 0 && __request_resource(root, new))
err = -EBUSY;
write_unlock(&resource_lock);

2010-12-16 17:38:53

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 6/9] resources: add arch hook for preventing allocation in reserved areas


This adds arch_remove_reservations(), which an arch can implement if it
needs to protect part of the address space from allocation.

Sometimes that can be done by just putting a region in the resource tree,
but there are cases where that doesn't work well. For example, x86 BIOS
E820 reservations are not related to devices, so they may overlap part of,
all of, or more than a device resource, so they may not end up at the
correct spot in the resource tree.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

include/linux/ioport.h | 1 +
kernel/resource.c | 6 ++++++
2 files changed, 7 insertions(+), 0 deletions(-)


diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index b227902..e9bb22c 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -123,6 +123,7 @@ extern void reserve_region_with_split(struct resource *root,
extern struct resource *insert_resource_conflict(struct resource *parent, struct resource *new);
extern int insert_resource(struct resource *parent, struct resource *new);
extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
+extern void arch_remove_reservations(struct resource *avail);
extern int allocate_resource(struct resource *root, struct resource *new,
resource_size_t size, resource_size_t min,
resource_size_t max, resource_size_t align,
diff --git a/kernel/resource.c b/kernel/resource.c
index 560659f..798e2fa 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -357,6 +357,10 @@ int __weak page_is_ram(unsigned long pfn)
return walk_system_ram_range(pfn, 1, NULL, __is_ram) == 1;
}

+void __weak arch_remove_reservations(struct resource *avail)
+{
+}
+
static resource_size_t simple_align_resource(void *data,
const struct resource *avail,
resource_size_t size,
@@ -394,6 +398,7 @@ static int find_resource(struct resource *root, struct resource *new,
struct resource *this = root->child;
struct resource tmp = *new, avail, alloc;

+ tmp.flags = new->flags;
tmp.start = root->start;
/*
* Skip past an allocated resource that starts at 0, since the assignment
@@ -410,6 +415,7 @@ static int find_resource(struct resource *root, struct resource *new,
tmp.end = root->end;

resource_clip(&tmp, min, max);
+ arch_remove_reservations(&tmp);

/* Check for overflow after ALIGN() */
avail = *new;

2010-12-16 17:38:42

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 4/9] Revert "PCI: allocate bus resources from the top down"


This reverts commit b126b4703afa4010b161784a43650337676dd03b.

We're going back to the old behavior of allocating from bus resources
in _CRS order.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

drivers/pci/bus.c | 53 +++++------------------------------------------------
1 files changed, 5 insertions(+), 48 deletions(-)


diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 5624db8..69546e9 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -64,49 +64,6 @@ void pci_bus_remove_resources(struct pci_bus *bus)
}
}

-/*
- * Find the highest-address bus resource below the cursor "res". If the
- * cursor is NULL, return the highest resource.
- */
-static struct resource *pci_bus_find_resource_prev(struct pci_bus *bus,
- unsigned int type,
- struct resource *res)
-{
- struct resource *r, *prev = NULL;
- int i;
-
- pci_bus_for_each_resource(bus, r, i) {
- if (!r)
- continue;
-
- if ((r->flags & IORESOURCE_TYPE_BITS) != type)
- continue;
-
- /* If this resource is at or past the cursor, skip it */
- if (res) {
- if (r == res)
- continue;
- if (r->end > res->end)
- continue;
- if (r->end == res->end && r->start > res->start)
- continue;
- }
-
- if (!prev)
- prev = r;
-
- /*
- * A small resource is higher than a large one that ends at
- * the same address.
- */
- if (r->end > prev->end ||
- (r->end == prev->end && r->start > prev->start))
- prev = r;
- }
-
- return prev;
-}
-
/**
* pci_bus_alloc_resource - allocate a resource from a parent bus
* @bus: PCI bus
@@ -132,10 +89,9 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
resource_size_t),
void *alignf_data)
{
- int ret = -ENOMEM;
+ int i, ret = -ENOMEM;
struct resource *r;
resource_size_t max = -1;
- unsigned int type = res->flags & IORESOURCE_TYPE_BITS;

type_mask |= IORESOURCE_IO | IORESOURCE_MEM;

@@ -143,9 +99,10 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
if (!(res->flags & IORESOURCE_MEM_64))
max = PCIBIOS_MAX_MEM_32;

- /* Look for space at highest addresses first */
- r = pci_bus_find_resource_prev(bus, type, NULL);
- for ( ; r; r = pci_bus_find_resource_prev(bus, type, r)) {
+ pci_bus_for_each_resource(bus, r, i) {
+ if (!r)
+ continue;
+
/* type_mask must match */
if ((res->flags ^ r->flags) & type_mask)
continue;

2010-12-16 17:38:59

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 7/9] x86: avoid low BIOS area when allocating address space


This implements arch_remove_reservations() so allocate_resource() can
avoid any arch-specific reserved areas. This currently just avoids the
BIOS area (the first 1MB), but could be used for E820 reserved areas if
that turns out to be necessary.

We previously avoided this area in pcibios_align_resource(). This patch
moves the test from that PCI-specific path to a generic path, so *all*
resource allocations will avoid this area.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/resource.c | 11 +++++++++++
arch/x86/pci/i386.c | 3 ---
3 files changed, 12 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/kernel/resource.c


diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9e13763..1e99475 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -45,6 +45,7 @@ obj-y += pci-dma.o quirks.o i8237.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o hw_breakpoint.o
obj-y += tsc.o io_delay.o rtc.o
obj-y += pci-iommu_table.o
+obj-y += resource.o

obj-$(CONFIG_X86_TRAMPOLINE) += trampoline.o
obj-y += process.o
diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
new file mode 100644
index 0000000..407a900
--- /dev/null
+++ b/arch/x86/kernel/resource.c
@@ -0,0 +1,11 @@
+#include <linux/ioport.h>
+#include <asm/e820.h>
+
+void arch_remove_reservations(struct resource *avail)
+{
+ /* Trim out BIOS area (low 1MB) */
+ if (avail->flags & IORESOURCE_MEM) {
+ if (avail->start < BIOS_END)
+ avail->start = BIOS_END;
+ }
+}
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 8379c2c..b1805b7 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -72,9 +72,6 @@ pcibios_align_resource(void *data, const struct resource *res,
return start;
if (start & 0x300)
start = (start + 0x3ff) & ~0x3ff;
- } else if (res->flags & IORESOURCE_MEM) {
- if (start < BIOS_END)
- start = BIOS_END;
}
return start;
}

2010-12-16 17:39:07

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 9/9] x86: avoid high BIOS area when allocating address space


This prevents allocation of the last 2MB before 4GB.

The experiment described here shows Windows 7 ignoring the last 1MB:
https://bugzilla.kernel.org/show_bug.cgi?id=23542#c27

This patch ignores the top 2MB instead of just 1MB because H. Peter Anvin
says "There will be ROM at the top of the 32-bit address space; it's a fact
of the architecture, and on at least older systems it was common to have a
shadow 1 MiB below."

Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/include/asm/e820.h | 3 +++
arch/x86/kernel/resource.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletions(-)


diff --git a/arch/x86/include/asm/e820.h b/arch/x86/include/asm/e820.h
index 5be1542..e99d55d 100644
--- a/arch/x86/include/asm/e820.h
+++ b/arch/x86/include/asm/e820.h
@@ -72,6 +72,9 @@ struct e820map {
#define BIOS_BEGIN 0x000a0000
#define BIOS_END 0x00100000

+#define BIOS_ROM_BASE 0xffe00000
+#define BIOS_ROM_END 0xffffffff
+
#ifdef __KERNEL__
/* see comment in arch/x86/kernel/e820.c */
extern struct e820map e820;
diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
index 89638af..2a26819 100644
--- a/arch/x86/kernel/resource.c
+++ b/arch/x86/kernel/resource.c
@@ -37,10 +37,11 @@ static void remove_e820_regions(struct resource *avail)

void arch_remove_reservations(struct resource *avail)
{
- /* Trim out BIOS area (low 1MB) and E820 regions */
+ /* Trim out BIOS areas (low 1MB and high 2MB) and E820 regions */
if (avail->flags & IORESOURCE_MEM) {
if (avail->start < BIOS_END)
avail->start = BIOS_END;
+ resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);

remove_e820_regions(avail);
}

2010-12-16 17:39:24

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 8/9] x86: avoid E820 regions when allocating address space


When we allocate address space, e.g., to assign it to a PCI device, don't
allocate anything mentioned in the BIOS E820 memory map.

On recent machines (2008 and newer), we assign PCI resources from the
windows described by the ACPI PCI host bridge _CRS. On many Dell
machines, these windows overlap some E820 reserved areas, e.g.,

BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]

If we put devices at 0xbff00000, they don't work, probably because
that's really RAM, not I/O memory. This patch prevents that by removing
the 0xbfe4dc00-0xbfffffff area from the "available" resource.

I'm not very happy with this solution because Windows solves the problem
differently (it seems to ignore E820 reserved areas and it allocates
top-down instead of bottom-up; details at comment 45 of the bugzilla
below). That means we're vulnerable to BIOS defects that Windows would not
trip over. For example, if BIOS described a device in ACPI but didn't
mention it in E820, Windows would work fine but Linux would fail.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228
Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/kernel/resource.c | 38 +++++++++++++++++++++++++++++++++++++-
1 files changed, 37 insertions(+), 1 deletions(-)


diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
index 407a900..89638af 100644
--- a/arch/x86/kernel/resource.c
+++ b/arch/x86/kernel/resource.c
@@ -1,11 +1,47 @@
#include <linux/ioport.h>
#include <asm/e820.h>

+static void resource_clip(struct resource *res, resource_size_t start,
+ resource_size_t end)
+{
+ resource_size_t low = 0, high = 0;
+
+ if (res->end < start || res->start > end)
+ return; /* no conflict */
+
+ if (res->start < start)
+ low = start - res->start;
+
+ if (res->end > end)
+ high = res->end - end;
+
+ /* Keep the area above or below the conflict, whichever is larger */
+ if (low > high)
+ res->end = start - 1;
+ else
+ res->start = end + 1;
+}
+
+static void remove_e820_regions(struct resource *avail)
+{
+ int i;
+ struct e820entry *entry;
+
+ for (i = 0; i < e820.nr_map; i++) {
+ entry = &e820.map[i];
+
+ resource_clip(avail, entry->addr,
+ entry->addr + entry->size - 1);
+ }
+}
+
void arch_remove_reservations(struct resource *avail)
{
- /* Trim out BIOS area (low 1MB) */
+ /* Trim out BIOS area (low 1MB) and E820 regions */
if (avail->flags & IORESOURCE_MEM) {
if (avail->start < BIOS_END)
avail->start = BIOS_END;
+
+ remove_e820_regions(avail);
}
}

2010-12-16 17:40:07

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH 2/9] Revert "x86: allocate space within a region top-down"


This reverts commit 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9.

Signed-off-by: Bjorn Helgaas <[email protected]>
---

arch/x86/kernel/setup.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)


diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 21c6746..85268f8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -769,7 +769,6 @@ void __init setup_arch(char **cmdline_p)

x86_init.oem.arch_setup();

- resource_alloc_from_bottom = 0;
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
setup_memory_map();
parse_setup_data();

2010-12-16 22:01:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/9] PCI: revert to allocating bottom-up, avoid E820 areas

On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
>
> So I think the best band-aid for now is to revert to bottom-up allocation
> and do a better job of avoiding E820 reserved areas. ?Obviously, it's not
> complete because we could still trip over ACPI devices (as in 23802), but
> that's a problem we've always had and it's not as likely if we go bottom-up.

Ok, ack for this series. Have we got testers for it? In particular,
have the people who have seen regressions been notified of this new
series?

That said, let's just try to merge it quickly so that we get maximal
coverage before the holiday season starts. Jesse, it makes sense
through your tree, but if you'd rather me take it directly, I can do
that.

Linus

2010-12-16 22:03:29

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/9] PCI: revert to allocating bottom-up, avoid E820 areas

On Thu, 16 Dec 2010 13:59:33 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
> >
> > So I think the best band-aid for now is to revert to bottom-up allocation
> > and do a better job of avoiding E820 reserved areas.  Obviously, it's not
> > complete because we could still trip over ACPI devices (as in 23802), but
> > that's a problem we've always had and it's not as likely if we go bottom-up.
>
> Ok, ack for this series. Have we got testers for it? In particular,
> have the people who have seen regressions been notified of this new
> series?
>
> That said, let's just try to merge it quickly so that we get maximal
> coverage before the holiday season starts. Jesse, it makes sense
> through your tree, but if you'd rather me take it directly, I can do
> that.

I'll pull it into my tree and send you a pull req tomorrow if that's
ok. I have a couple of fixes besides this, but I have to drop Bjorn's
earlier series...

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2010-12-16 22:12:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 6/9] resources: add arch hook for preventing allocation in reserved areas

On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
>
> This adds arch_remove_reservations(), which an arch can implement if it
> needs to protect part of the address space from allocation.

Oh, I just realized - we've had gcc (or maybe "as") bugs when the weak
symbol is in the same compilation unit as the caller, and some part of
the toolchain just ends up short-circuiting it (either gcc inlines it,
or maybe it was that as resolves it early). Making the "weak" part not
work, because it binds strongly.

It may be that we don't support those gcc versions any more, but I
thought I'd bring the issue up. It's safer to put the weak functions
in some other file if possible.

Linus

2010-12-16 22:56:18

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 6/9] resources: add arch hook for preventing allocation in reserved areas

On Thursday, December 16, 2010 03:03:21 pm Linus Torvalds wrote:
> On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
> >
> > This adds arch_remove_reservations(), which an arch can implement if it
> > needs to protect part of the address space from allocation.
>
> Oh, I just realized - we've had gcc (or maybe "as") bugs when the weak
> symbol is in the same compilation unit as the caller, and some part of
> the toolchain just ends up short-circuiting it (either gcc inlines it,
> or maybe it was that as resolves it early). Making the "weak" part not
> work, because it binds strongly.
>
> It may be that we don't support those gcc versions any more, but I
> thought I'd bring the issue up. It's safer to put the weak functions
> in some other file if possible.

I'd be glad to repost and do that if necessary. Does anybody
know for sure if it is?

I did find these cases where we have a weak symbol definition in
the same file as the caller, as well as another version elsewhere,
so maybe those toolchains are indeed no longer supported?

arch_disable_smp_support()
defined and called in init/main.c (used for "nosmp" arg)
also defined in arch/x86/kernel/apic/io_apic.c

smp_setup_processor_id()
defined and called in init/main.c
also defined in arch/s390/kernel/smp.c
also defined in arch/sparc/kernel/smp_64.c

arch_enable_nonboot_cpus_begin()
defined and called in kernel/cpu.c
also defined in arch/x86/kernel/smpboot.c

arch_irq_work_raise()
defined and called in kernel/irq_work.c
also defined in arch/sparc/kernel/pcr.c
also defined in arch/x86/kernel/irq_work.c

arch_jump_label_text_poke_early()
defined and called in kernel/jump_label.c
also defined in arch/sparc/kernel/jump_label.c
also defined in arch/x86/kernel/jump_label.c

arch_deref_entry_point()
defined and called in kernel/kprobes.c
also defined in arch/ia64/kernel/kprobes.c
also defined in arch/powerpc/kernel/kprobes.c

arch_scale_smt_power()
defined and called in kernel/sched_fair.c
also defined in arch/x86/kernel/cpu/sched.c

memblock_nid_range()
defined and called in mm/memblock.c
also defined in arch/sparc/mm/init_64.c

If this toolchain bug is still a problem, I would think all of the
above would be defects, so my guess is that we don't have to worry
about it any more.

Bjorn

2010-12-16 23:17:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 6/9] resources: add arch hook for preventing allocation in reserved areas

On Thu, Dec 16, 2010 at 2:56 PM, Bjorn Helgaas <[email protected]> wrote:
>
> I did find these cases where we have a weak symbol definition in
> the same file as the caller, as well as another version elsewhere,
> so maybe those toolchains are indeed no longer supported?

Ok, so we already depend on it not breaking. I forget which
gcc/binutils version we had trouble at, so I guess we can forget about
this one.

Linus

2010-12-17 03:00:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/9] PCI: revert to allocating bottom-up, avoid E820 areas

On 12/16/2010 01:59 PM, Linus Torvalds wrote:
> On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
>>
>> So I think the best band-aid for now is to revert to bottom-up allocation
>> and do a better job of avoiding E820 reserved areas. Obviously, it's not
>> complete because we could still trip over ACPI devices (as in 23802), but
>> that's a problem we've always had and it's not as likely if we go bottom-up.
>
> Ok, ack for this series. Have we got testers for it? In particular,
> have the people who have seen regressions been notified of this new
> series?
>

Acked-by: H. Peter Anvin <[email protected]>

... for the series as well. For what it's worth.

-hpa

2010-12-17 16:01:09

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 0/9] PCI: revert to allocating bottom-up, avoid E820 areas

On Thursday, December 16, 2010 02:59:33 pm Linus Torvalds wrote:
> On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
> >
> > So I think the best band-aid for now is to revert to bottom-up allocation
> > and do a better job of avoiding E820 reserved areas. Obviously, it's not
> > complete because we could still trip over ACPI devices (as in 23802), but
> > that's a problem we've always had and it's not as likely if we go bottom-up.
>
> Ok, ack for this series. Have we got testers for it? In particular,
> have the people who have seen regressions been notified of this new
> series?

I put the patches in these bugzillas:

https://bugzilla.kernel.org/show_bug.cgi?id=16228 (Dell T3500)
https://bugzilla.kernel.org/show_bug.cgi?id=23132 (Dell 1545)
https://bugzilla.kernel.org/show_bug.cgi?id=23542 (HP 2530p)
https://bugzilla.kernel.org/show_bug.cgi?id=23802 (PCI / TPM conflict)

Bernard Wiedemann reported success for 16228 on a Dell T3500. No
news yet on the others.

For this one:

https://bugzilla.kernel.org/show_bug.cgi?id=23332 (HP nx6325)

I have an nx6325 and tested it myself.

Bjorn

2010-12-19 09:50:54

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 8/9] x86: avoid E820 regions when allocating address space

On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
>
> When we allocate address space, e.g., to assign it to a PCI device, don't
> allocate anything mentioned in the BIOS E820 memory map.
>
> On recent machines (2008 and newer), we assign PCI resources from the
> windows described by the ACPI PCI host bridge _CRS. ?On many Dell
> machines, these windows overlap some E820 reserved areas, e.g.,
>
> ? ?BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
> ? ?pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
>
> If we put devices at 0xbff00000, they don't work, probably because
> that's really RAM, not I/O memory. ?This patch prevents that by removing
> the 0xbfe4dc00-0xbfffffff area from the "available" resource.
>
> I'm not very happy with this solution because Windows solves the problem
> differently (it seems to ignore E820 reserved areas and it allocates
> top-down instead of bottom-up; details at comment 45 of the bugzilla
> below). ?That means we're vulnerable to BIOS defects that Windows would not
> trip over. ?For example, if BIOS described a device in ACPI but didn't
> mention it in E820, Windows would work fine but Linux would fail.
>
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228
> Signed-off-by: Bjorn Helgaas <[email protected]>
> ---
>
> ?arch/x86/kernel/resource.c | ? 38 +++++++++++++++++++++++++++++++++++++-
> ?1 files changed, 37 insertions(+), 1 deletions(-)
>
>
> diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
> index 407a900..89638af 100644
> --- a/arch/x86/kernel/resource.c
> +++ b/arch/x86/kernel/resource.c
> @@ -1,11 +1,47 @@
> ?#include <linux/ioport.h>
> ?#include <asm/e820.h>
>
> +static void resource_clip(struct resource *res, resource_size_t start,
> + ? ? ? ? ? ? ? ? ? ? ? ? resource_size_t end)
> +{
> + ? ? ? resource_size_t low = 0, high = 0;
> +
> + ? ? ? if (res->end < start || res->start > end)
> + ? ? ? ? ? ? ? return; ? ? ? ? /* no conflict */
> +
> + ? ? ? if (res->start < start)
> + ? ? ? ? ? ? ? low = start - res->start;
> +
> + ? ? ? if (res->end > end)
> + ? ? ? ? ? ? ? high = res->end - end;
> +
> + ? ? ? /* Keep the area above or below the conflict, whichever is larger */
> + ? ? ? if (low > high)
> + ? ? ? ? ? ? ? res->end = start - 1;
> + ? ? ? else
> + ? ? ? ? ? ? ? res->start = end + 1;
> +}
> +
> +static void remove_e820_regions(struct resource *avail)
> +{
> + ? ? ? int i;
> + ? ? ? struct e820entry *entry;
> +
> + ? ? ? for (i = 0; i < e820.nr_map; i++) {
> + ? ? ? ? ? ? ? entry = &e820.map[i];
> +
> + ? ? ? ? ? ? ? resource_clip(avail, entry->addr,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? entry->addr + entry->size - 1);
> + ? ? ? }
> +}
> +
> ?void arch_remove_reservations(struct resource *avail)
> ?{
> - ? ? ? /* Trim out BIOS area (low 1MB) */
> + ? ? ? /* Trim out BIOS area (low 1MB) and E820 regions */
> ? ? ? ?if (avail->flags & IORESOURCE_MEM) {
> ? ? ? ? ? ? ? ?if (avail->start < BIOS_END)
> ? ? ? ? ? ? ? ? ? ? ? ?avail->start = BIOS_END;
> +
> + ? ? ? ? ? ? ? remove_e820_regions(avail);
> ? ? ? ?}
> ?}

that looks expensive. it will keep going through e820 tables...

but e820 should have been reserved in resource tree...

So why we need to check e820 again?

Yinghai

2010-12-19 14:31:56

by David John

[permalink] [raw]
Subject: Re: [PATCH 0/9] PCI: revert to allocating bottom-up, avoid E820 areas

Hi Bjorn,

On 12/17/2010 09:30 PM, Bjorn Helgaas wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=23132 (Dell 1545)

I can confirm that the latest PCI pull fixes this bug.

Regards,
David.

2010-12-19 23:33:25

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 8/9] x86: avoid E820 regions when allocating address space

On Sun, Dec 19, 2010 at 01:50:50AM -0800, Yinghai Lu wrote:
> On Thu, Dec 16, 2010 at 9:38 AM, Bjorn Helgaas <[email protected]> wrote:
> >
> > When we allocate address space, e.g., to assign it to a PCI device, don't
> > allocate anything mentioned in the BIOS E820 memory map.
> >
> > On recent machines (2008 and newer), we assign PCI resources from the
> > windows described by the ACPI PCI host bridge _CRS. ?On many Dell
> > machines, these windows overlap some E820 reserved areas, e.g.,
> >
> > ? ?BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
> > ? ?pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
> >
> > If we put devices at 0xbff00000, they don't work, probably because
> > that's really RAM, not I/O memory. ?This patch prevents that by removing
> > the 0xbfe4dc00-0xbfffffff area from the "available" resource.
> >
> > I'm not very happy with this solution because Windows solves the problem
> > differently (it seems to ignore E820 reserved areas and it allocates
> > top-down instead of bottom-up; details at comment 45 of the bugzilla
> > below). ?That means we're vulnerable to BIOS defects that Windows would not
> > trip over. ?For example, if BIOS described a device in ACPI but didn't
> > mention it in E820, Windows would work fine but Linux would fail.
> >
> > Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228
> > Signed-off-by: Bjorn Helgaas <[email protected]>
> > ---
> >
> > ?arch/x86/kernel/resource.c | ? 38 +++++++++++++++++++++++++++++++++++++-
> > ?1 files changed, 37 insertions(+), 1 deletions(-)
> >
> >
> > diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
> > index 407a900..89638af 100644
> > --- a/arch/x86/kernel/resource.c
> > +++ b/arch/x86/kernel/resource.c
> > @@ -1,11 +1,47 @@
> > ?#include <linux/ioport.h>
> > ?#include <asm/e820.h>
> >
> > +static void resource_clip(struct resource *res, resource_size_t start,
> > + ? ? ? ? ? ? ? ? ? ? ? ? resource_size_t end)
> > +{
> > + ? ? ? resource_size_t low = 0, high = 0;
> > +
> > + ? ? ? if (res->end < start || res->start > end)
> > + ? ? ? ? ? ? ? return; ? ? ? ? /* no conflict */
> > +
> > + ? ? ? if (res->start < start)
> > + ? ? ? ? ? ? ? low = start - res->start;
> > +
> > + ? ? ? if (res->end > end)
> > + ? ? ? ? ? ? ? high = res->end - end;
> > +
> > + ? ? ? /* Keep the area above or below the conflict, whichever is larger */
> > + ? ? ? if (low > high)
> > + ? ? ? ? ? ? ? res->end = start - 1;
> > + ? ? ? else
> > + ? ? ? ? ? ? ? res->start = end + 1;
> > +}
> > +
> > +static void remove_e820_regions(struct resource *avail)
> > +{
> > + ? ? ? int i;
> > + ? ? ? struct e820entry *entry;
> > +
> > + ? ? ? for (i = 0; i < e820.nr_map; i++) {
> > + ? ? ? ? ? ? ? entry = &e820.map[i];
> > +
> > + ? ? ? ? ? ? ? resource_clip(avail, entry->addr,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? entry->addr + entry->size - 1);
> > + ? ? ? }
> > +}
> > +
> > ?void arch_remove_reservations(struct resource *avail)
> > ?{
> > - ? ? ? /* Trim out BIOS area (low 1MB) */
> > + ? ? ? /* Trim out BIOS area (low 1MB) and E820 regions */
> > ? ? ? ?if (avail->flags & IORESOURCE_MEM) {
> > ? ? ? ? ? ? ? ?if (avail->start < BIOS_END)
> > ? ? ? ? ? ? ? ? ? ? ? ?avail->start = BIOS_END;
> > +
> > + ? ? ? ? ? ? ? remove_e820_regions(avail);
> > ? ? ? ?}
> > ?}
>
> that looks expensive. it will keep going through e820 tables...

It's expensive when we do it, but we do it very rarely, i.e., only
when we call allocate_resource().

> but e820 should have been reserved in resource tree...

E820 regions don't fit very well in the resource tree. The
tree normally contains devices, which are mutually exclusive
and fit nicely in a hierarchy.

The E820 RAM regions fit that description (they can't conflict
with anything else). But generic "reserved" regions do not.
A reserved region might cover several devices, it might cover
part of a device, it might cover a piece of RAM, or it might
not be related to a device at all. We do try to wedge those
reservations into the resource tree, but I think that's a
mistake because it doesn't work very well.

In this case, the E820 reservation *is* in the resource tree,
but we took the reasonable 0xbfe4dc00 - 0xc0000000 E820 entry
and turned it into the bogus 0xbfe4dc00-0xf7ffffff range:

bfe4dc00-f7ffffff reserved (expanded)
bff00000-f7ffffff PCI Bus 0000:00

More details here:

https://bugzilla.kernel.org/show_bug.cgi?id=16228#c45

Bjorn