2009-04-10 20:29:39

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Bug 11103] Can't use framebuffer or vesa Xorg with two memory modules

[email protected] wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11103
>
>
>
>
>
> --- Comment #16 from Yannick <[email protected]> 2009-04-10 19:13:43 ---
> Created an attachment (id=20928)
> --> (http://bugzilla.kernel.org/attachment.cgi?id=20928)
> Boot log (from dmesg) with "debug pci=earlydump" passed to kernel 2.6.29.1
>
root cause:
when 4G installed.
BIOS put ACPI etc need the hole
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
[ 0.000000] BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e3000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
[ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bfff0000 - 00000000c0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
so resource will be reserved for 0xbffa0000 - 0xbfff0000 for ACPI
0x100000 - 0xbffa0000 for RAM...

then BIOS set
[ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref: [0xbdf00000-0xddefffff]
[ 0.237102] pci 0000:01:00.0: reg 10 32bit mmio: [0xc0000000-0xcfffffff]
that is conflict with reserved res.
so it can not be reserved Kernel.

then Kernel try to get range from 0x140000000 ( above the RAM, 5G and above 4g)
and set let the bridge to use it, and your ATI cards to use it.

but the problem is that your ATI only support 32bit ...

solution:
1. get updated BIOS?
2. kernel side:
a. reserved return one could use range ( shrinked range ), and set it back to pci bridge?
b. or add pci=earlyset like to hack the pci conf in early stage.

also in kernel side, we should not assign 64bit range to pci device that only take 32bit pref, it mess up the pci conf.

01:00.0 VGA compatible controller: ATI Technologies Inc Mobility Radeon HD 3400 Series (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 19d3
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 10
Region 0: Memory at 140000000 (32-bit, prefetchable) [size=256M]
Region 1: I/O ports at 9000 [size=256]
Region 2: Memory at fddf0000 (32-bit, non-prefetchable) [size=64K]
Expansion ROM at fddc0000 [disabled] [size=128K]
..
00: 02 10 c4 95 07 01 10 00 00 00 00 03 08 00 00 00
10: 08 00 00 40 01 90 00 00 00 00 df fd 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 d3 19
30: 00 00 dc fd 50 00 00 00 00 00 00 00 0a 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 d3 19


YH


2009-04-11 03:31:23

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Bug 11103] Can't use framebuffer or vesa Xorg with two memory modules

please check

[PATCH] pci: don't assume pref mem io are 64bit

Impact: fix bug with some devices

one system with 4g installed ( there is 1g hole)

when 4G installed.
BIOS put ACPI etc need the hole
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
[ 0.000000] BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e3000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
[ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bfff0000 - 00000000c0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
so in kernel resource will be reserved for 0xbffa0000 - 0xbfff0000 for ACPI
0x100000 - 0xbffa0000 for RAM...

and BIOS set
[ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref: [0xbdf00000-0xddefffff]
[ 0.237102] pci 0000:01:00.0: reg 10 32bit mmio: [0xc0000000-0xcfffffff]
that is conflict with reserved res. so it can not be reserved Kernel.

then Kernel try to get range from 0x140000000 ( above the RAM, 5G and above 4g)
and set let the bridge to use it, and ATI cards to use it.

but the problem is that ATI only support 32bit ...

we should not assign 64bit range to pci device that only take 32bit pref

try to set PCI_PREF_RANGE_TYPE_64 in 64bit resource of pci_device (besides in pci_bridge),
and make the bus resource only have that bit set when all device under that do support
64bit pref mem
then use that flag to decide the max limit for find/request.

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/bus.c | 8 +++++++-
drivers/pci/probe.c | 8 ++++++--
drivers/pci/setup-bus.c | 38 ++++++++++++++++++++++++++++----------
3 files changed, 41 insertions(+), 13 deletions(-)

Index: linux-2.6/drivers/pci/bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/bus.c
+++ linux-2.6/drivers/pci/bus.c
@@ -41,9 +41,15 @@ pci_bus_alloc_resource(struct pci_bus *b
void *alignf_data)
{
int i, ret = -ENOMEM;
+ resource_size_t max = -1;

type_mask |= IORESOURCE_IO | IORESOURCE_MEM;

+ /* don't allocate too high if the pref mem doesn't support 64bit*/
+ if ((res->flags & (IORESOURCE_PREFETCH | PCI_PREF_RANGE_TYPE_64)) ==
+ IORESOURCE_PREFETCH)
+ max = 0xffffffff;
+
for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) {
struct resource *r = bus->resource[i];
if (!r)
@@ -62,7 +68,7 @@ pci_bus_alloc_resource(struct pci_bus *b
/* Ok, try it out.. */
ret = allocate_resource(r, res, size,
r->start ? : min,
- -1, align,
+ max, align,
alignf, alignf_data);
if (ret == 0)
break;
Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -193,7 +193,7 @@ int __pci_read_base(struct pci_dev *dev,
res->flags |= pci_calc_resource_flags(l) | IORESOURCE_SIZEALIGN;
if (type == pci_bar_io) {
l &= PCI_BASE_ADDRESS_IO_MASK;
- mask = PCI_BASE_ADDRESS_IO_MASK & 0xffff;
+ mask = PCI_BASE_ADDRESS_IO_MASK & IO_SPACE_LIMIT;
} else {
l &= PCI_BASE_ADDRESS_MEM_MASK;
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
@@ -237,6 +237,9 @@ int __pci_read_base(struct pci_dev *dev,
dev_printk(KERN_DEBUG, &dev->dev,
"reg %x 64bit mmio: %pR\n", pos, res);
}
+
+ if (res->flags & IORESOURCE_PREFETCH)
+ res->flags |= PCI_PREF_RANGE_TYPE_64;
} else {
sz = pci_size(l, sz, mask);

@@ -362,7 +365,8 @@ void __devinit pci_read_bridge_bases(str
}
}
if (base <= limit) {
- res->flags = (mem_base_lo & PCI_MEMORY_RANGE_TYPE_MASK) | IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ res->flags = (mem_base_lo & PCI_PREF_RANGE_TYPE_MASK) |
+ IORESOURCE_MEM | IORESOURCE_PREFETCH;
res->start = base;
res->end = limit + 0xfffff;
dev_printk(KERN_DEBUG, &dev->dev, "bridge %sbit mmio pref: %pR\n",
Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -143,6 +143,7 @@ static void pci_setup_bridge(struct pci_
struct pci_dev *bridge = bus->self;
struct pci_bus_region region;
u32 l, bu, lu, io_upper16;
+ int pref_mem64;

if (pci_is_enabled(bridge))
return;
@@ -198,16 +199,22 @@ static void pci_setup_bridge(struct pci_
pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, 0);

/* Set up PREF base/limit. */
+ pref_mem64 = 0;
bu = lu = 0;
pcibios_resource_to_bus(bridge, &region, bus->resource[2]);
if (bus->resource[2]->flags & IORESOURCE_PREFETCH) {
+ int width = 8;
l = (region.start >> 16) & 0xfff0;
l |= region.end & 0xfff00000;
- bu = upper_32_bits(region.start);
- lu = upper_32_bits(region.end);
- dev_info(&bridge->dev, " PREFETCH window: %#016llx-%#016llx\n",
- (unsigned long long)region.start,
- (unsigned long long)region.end);
+ if (bus->resource[2]->flags & PCI_PREF_RANGE_TYPE_64) {
+ pref_mem64 = 1;
+ bu = upper_32_bits(region.start);
+ lu = upper_32_bits(region.end);
+ width = 16;
+ }
+ dev_info(&bridge->dev, " PREFETCH window: %#0*llx-%#0*llx\n",
+ width, (unsigned long long)region.start,
+ width, (unsigned long long)region.end);
}
else {
l = 0x0000fff0;
@@ -215,9 +222,11 @@ static void pci_setup_bridge(struct pci_
}
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);

- /* Set the upper 32 bits of PREF base & limit. */
- pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
- pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ if (pref_mem64) {
+ /* Set the upper 32 bits of PREF base & limit. */
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
+ pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ }

pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
}
@@ -255,8 +264,11 @@ static void pci_bridge_check_ranges(stru
pci_read_config_dword(bridge, PCI_PREF_MEMORY_BASE, &pmem);
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, 0x0);
}
- if (pmem)
+ if (pmem) {
b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ if ((pmem & PCI_PREF_RANGE_TYPE_MASK) == PCI_PREF_RANGE_TYPE_64)
+ b_res[2].flags |= PCI_PREF_RANGE_TYPE_64;
+ }
}

/* Helper function for sizing routines: find first available
@@ -336,6 +348,7 @@ static int pbus_size_mem(struct pci_bus
resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */
int order, max_order;
struct resource *b_res = find_free_bus_resource(bus, type);
+ unsigned int mem64_mask = 0;

if (!b_res)
return 0;
@@ -344,6 +357,9 @@ static int pbus_size_mem(struct pci_bus
max_order = 0;
size = 0;

+ if (type & IORESOURCE_PREFETCH)
+ mem64_mask = PCI_PREF_RANGE_TYPE_64;
+
list_for_each_entry(dev, &bus->devices, bus_list) {
int i;

@@ -372,6 +388,8 @@ static int pbus_size_mem(struct pci_bus
aligns[order] += align;
if (order > max_order)
max_order = order;
+ if (r->flags & IORESOURCE_PREFETCH)
+ mem64_mask &= r->flags & PCI_PREF_RANGE_TYPE_64;
}
}

@@ -395,7 +413,7 @@ static int pbus_size_mem(struct pci_bus
}
b_res->start = min_align;
b_res->end = size + min_align - 1;
- b_res->flags |= IORESOURCE_STARTALIGN;
+ b_res->flags |= IORESOURCE_STARTALIGN | mem64_mask;
return 1;
}

2009-04-14 20:51:35

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] pci: don't assume pref memio are 64bit -v2


Impact: fix bug with some devices

one system with 4g installed ( there is 1g hole)

when 4G installed.
BIOS put ACPI etc need the hole
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
[ 0.000000] BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e3000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
[ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bfff0000 - 00000000c0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
so in kernel resource will be reserved for 0xbffa0000 - 0xbfff0000 for ACPI
0x100000 - 0xbffa0000 for RAM...

and BIOS set
[ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref: [0xbdf00000-0xddefffff]
[ 0.237102] pci 0000:01:00.0: reg 10 32bit mmio: [0xc0000000-0xcfffffff]
that is conflict with reserved res. so it can not be reserved Kernel.

then Kernel try to get range from 0x140000000 ( above the RAM, 5G and above 4g)
and set let the bridge to use it, and ATI cards to use it.

but the problem is that ATI only support 32bit ...

we should not assign 64bit range to pci device that only take 32bit pref

try to set PCI_PREF_RANGE_TYPE_64 in 64bit resource of pci_device (besides in pci_bridge),
and make the bus resource only have that bit set when all device under that do support
64bit pref mem
then use that flag to decide the max limit for find/request.

v2: fix b_res->flags and logic and passing result.

Reported-and-tested-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/bus.c | 8 +++++++-
drivers/pci/probe.c | 8 ++++++--
drivers/pci/setup-bus.c | 40 +++++++++++++++++++++++++++++++---------
3 files changed, 44 insertions(+), 12 deletions(-)

Index: linux-2.6/drivers/pci/bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/bus.c
+++ linux-2.6/drivers/pci/bus.c
@@ -41,9 +41,15 @@ pci_bus_alloc_resource(struct pci_bus *b
void *alignf_data)
{
int i, ret = -ENOMEM;
+ resource_size_t max = -1;

type_mask |= IORESOURCE_IO | IORESOURCE_MEM;

+ /* don't allocate too high if the pref mem doesn't support 64bit*/
+ if ((res->flags & (IORESOURCE_PREFETCH | PCI_PREF_RANGE_TYPE_64)) ==
+ IORESOURCE_PREFETCH)
+ max = 0xffffffff;
+
for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) {
struct resource *r = bus->resource[i];
if (!r)
@@ -62,7 +68,7 @@ pci_bus_alloc_resource(struct pci_bus *b
/* Ok, try it out.. */
ret = allocate_resource(r, res, size,
r->start ? : min,
- -1, align,
+ max, align,
alignf, alignf_data);
if (ret == 0)
break;
Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -193,7 +193,7 @@ int __pci_read_base(struct pci_dev *dev,
res->flags |= pci_calc_resource_flags(l) | IORESOURCE_SIZEALIGN;
if (type == pci_bar_io) {
l &= PCI_BASE_ADDRESS_IO_MASK;
- mask = PCI_BASE_ADDRESS_IO_MASK & 0xffff;
+ mask = PCI_BASE_ADDRESS_IO_MASK & IO_SPACE_LIMIT;
} else {
l &= PCI_BASE_ADDRESS_MEM_MASK;
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
@@ -237,6 +237,9 @@ int __pci_read_base(struct pci_dev *dev,
dev_printk(KERN_DEBUG, &dev->dev,
"reg %x 64bit mmio: %pR\n", pos, res);
}
+
+ if (res->flags & IORESOURCE_PREFETCH)
+ res->flags |= PCI_PREF_RANGE_TYPE_64;
} else {
sz = pci_size(l, sz, mask);

@@ -362,7 +365,8 @@ void __devinit pci_read_bridge_bases(str
}
}
if (base <= limit) {
- res->flags = (mem_base_lo & PCI_MEMORY_RANGE_TYPE_MASK) | IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ res->flags = (mem_base_lo & PCI_PREF_RANGE_TYPE_MASK) |
+ IORESOURCE_MEM | IORESOURCE_PREFETCH;
res->start = base;
res->end = limit + 0xfffff;
dev_printk(KERN_DEBUG, &dev->dev, "bridge %sbit mmio pref: %pR\n",
Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -143,6 +143,7 @@ static void pci_setup_bridge(struct pci_
struct pci_dev *bridge = bus->self;
struct pci_bus_region region;
u32 l, bu, lu, io_upper16;
+ int pref_mem64;

if (pci_is_enabled(bridge))
return;
@@ -198,16 +199,22 @@ static void pci_setup_bridge(struct pci_
pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, 0);

/* Set up PREF base/limit. */
+ pref_mem64 = 0;
bu = lu = 0;
pcibios_resource_to_bus(bridge, &region, bus->resource[2]);
if (bus->resource[2]->flags & IORESOURCE_PREFETCH) {
+ int width = 8;
l = (region.start >> 16) & 0xfff0;
l |= region.end & 0xfff00000;
- bu = upper_32_bits(region.start);
- lu = upper_32_bits(region.end);
- dev_info(&bridge->dev, " PREFETCH window: %#016llx-%#016llx\n",
- (unsigned long long)region.start,
- (unsigned long long)region.end);
+ if (bus->resource[2]->flags & PCI_PREF_RANGE_TYPE_64) {
+ pref_mem64 = 1;
+ bu = upper_32_bits(region.start);
+ lu = upper_32_bits(region.end);
+ width = 16;
+ }
+ dev_info(&bridge->dev, " PREFETCH window: %#0*llx-%#0*llx\n",
+ width, (unsigned long long)region.start,
+ width, (unsigned long long)region.end);
}
else {
l = 0x0000fff0;
@@ -215,9 +222,11 @@ static void pci_setup_bridge(struct pci_
}
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);

- /* Set the upper 32 bits of PREF base & limit. */
- pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
- pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ if (pref_mem64) {
+ /* Set the upper 32 bits of PREF base & limit. */
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
+ pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ }

pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
}
@@ -255,8 +264,11 @@ static void pci_bridge_check_ranges(stru
pci_read_config_dword(bridge, PCI_PREF_MEMORY_BASE, &pmem);
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, 0x0);
}
- if (pmem)
+ if (pmem) {
b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ if ((pmem & PCI_PREF_RANGE_TYPE_MASK) == PCI_PREF_RANGE_TYPE_64)
+ b_res[2].flags |= PCI_PREF_RANGE_TYPE_64;
+ }
}

/* Helper function for sizing routines: find first available
@@ -336,6 +348,7 @@ static int pbus_size_mem(struct pci_bus
resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */
int order, max_order;
struct resource *b_res = find_free_bus_resource(bus, type);
+ unsigned int mem64_mask = 0;

if (!b_res)
return 0;
@@ -344,6 +357,11 @@ static int pbus_size_mem(struct pci_bus
max_order = 0;
size = 0;

+ if (type & IORESOURCE_PREFETCH) {
+ mem64_mask = b_res->flags & PCI_PREF_RANGE_TYPE_64;
+ b_res->flags &= ~PCI_PREF_RANGE_TYPE_64;
+ }
+
list_for_each_entry(dev, &bus->devices, bus_list) {
int i;

@@ -372,6 +390,8 @@ static int pbus_size_mem(struct pci_bus
aligns[order] += align;
if (order > max_order)
max_order = order;
+ if (r->flags & IORESOURCE_PREFETCH)
+ mem64_mask &= r->flags & PCI_PREF_RANGE_TYPE_64;
}
}

@@ -396,6 +416,8 @@ static int pbus_size_mem(struct pci_bus
b_res->start = min_align;
b_res->end = size + min_align - 1;
b_res->flags |= IORESOURCE_STARTALIGN;
+ if (type & IORESOURCE_PREFETCH)
+ b_res->flags |= mem64_mask;
return 1;
}

2009-04-14 20:52:36

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86/pci: make pci_mem_start to be aligned only


Impact: make more big space below 4g for assigning to unassigned pci devices

don't need to reserved one round after the gapstart.

Reported-and-tested-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index ef2c356..a0ba9b1 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -642,7 +642,7 @@ __init void e820_setup_gap(void)
while ((gapsize >> 4) > round)
round += round;
/* Fun with two's complement */
- pci_mem_start = (gapstart + round) & -round;
+ pci_mem_start = roundup(gapstart, round);

printk(KERN_INFO
"Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",

2009-04-14 21:17:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only



On Tue, 14 Apr 2009, Yinghai Lu wrote:
>
> Impact: make more big space below 4g for assigning to unassigned pci devices
>
> don't need to reserved one round after the gapstart.
>
> Reported-and-tested-by: Yannick <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index ef2c356..a0ba9b1 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -642,7 +642,7 @@ __init void e820_setup_gap(void)
> while ((gapsize >> 4) > round)
> round += round;
> /* Fun with two's complement */
> - pci_mem_start = (gapstart + round) & -round;
> + pci_mem_start = roundup(gapstart, round);

That thing is called "e820_setup_gap()" for a reason. It's supposed to
create a _gap_. That's why it historically doesn't just round up (no
"+round-1" as in round_up()), it rounds up to the _next_ boundary
("+round") in order to guarantee a gap.

The reason? We've definitely seen ACPI code or integrated graphics stuff
that steals a lot of memory at the end, which means that end-of-RAM might
be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and
16MB of "stolen video ram").

Now, the BIOS _hopefully_ marks those areas clearly reserved, and as a
result we don't end up allocating PCI data in there, but the gap was there
literally to make sure we always leave that gap, very much on purpose.

So I'm very nervous about this.

At a minimum, if we do this, I'd like to make sure we round up to a big
boundary (eg 32MB or something - exactly because a missing 16MB can easily
be some integrated stolen video memory).

Sure, we do that whole

while ((gapsize >> 4) > round)
round += round;

thing, so that if the gap is large, then we'll certainly get to 32MB too,
but I think your patch matters the most exactly when the gap is small.
Maybe we could just raise the initial minimum rounding from 1MB to 32MB?

Alternatively, maybe we can make sure that we round up to at least X bytes
from the end of RAM, and to at least Y bytes from the end of some RESERVED
thing.

I dunno. Maybe your patch is fine as-is. But I do get nervous.

Linus

2009-04-14 21:30:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only

Linus Torvalds wrote:
>

> Alternatively, maybe we can make sure that we round up to at least X bytes
> from the end of RAM, and to at least Y bytes from the end of some RESERVED
> thing.

ok, will try to have one updated one with those check...

YH

2009-04-14 21:30:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only

Linus Torvalds wrote:
> The reason? We've definitely seen ACPI code or integrated graphics stuff
> that steals a lot of memory at the end, which means that end-of-RAM might
> be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and
> 16MB of "stolen video ram").

This is pretty much standard these days. It's hard to implement ACPI
without doing so. Throw in the SMI T-seg for even more fun.

> Now, the BIOS _hopefully_ marks those areas clearly reserved, and as a
> result we don't end up allocating PCI data in there, but the gap was there
> literally to make sure we always leave that gap, very much on purpose.

It would be nice if we would mark that memory reserved ourselves.

> thing, so that if the gap is large, then we'll certainly get to 32MB too,
> but I think your patch matters the most exactly when the gap is small.
> Maybe we could just raise the initial minimum rounding from 1MB to 32MB?

Since we're talking about address space, not actual memory, it seems
rather hard to end up in a situation where either one of these is not true:

- we will have real hardware demand for a large alignment datum.
- we will have so much address space available that it doesn't matter.

The latter case would be e.g. a machine with a today-anemic handful of
megabytes of RAM.

-hpa[1]

[1] who remembers running a Linux server on a 0.59 bogomips i386/16 with
3 MB of half-speed memory...

2009-04-14 22:37:27

by Yannick Roehlly

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only

Le Tuesday 14 April 2009 23:10:38 Linus Torvalds, vous avez ?crit :
> The reason? We've definitely seen ACPI code or integrated graphics stuff
> that steals a lot of memory at the end, which means that end-of-RAM might
> be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and
> 16MB of "stolen video ram").

Good evening (UTC+0200),

First, I must say that I don't understand 99% of the technical details of this
discussion. I'm only the bug reporter. ;-)

There's one thing about my computer that may affect, or maybe cause this bug.
There's no need to make possibly harmful change to the kernel only to fix it on
a few machines with buggy bioses.

My computer is an Asus M51Se laptop with an HD3470 graphic card. This card is
an "hypermemory" one with 256MB dedicated RAM and up to 1GB with shared
memory. The problem is that I don't know how the computer assigns main memory
to the card. There is nothing in the bios to indicate the amount of shared
memory (and Asus did not answer my questions).

Once again, I'm not an expert, so if this details are not important, please
forgive me.

Sincerely,

Yannick

--
"...Unix, MS-DOS, and Windows NT (also known as the Good, the Bad, and
the Ugly)."
(By Matt Welsh)

2009-04-15 00:31:45

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86/pci: make pci_mem_start to be aligned only -v2


Impact: make more big space below 4g for assigning to unassigned pci devices

don't need to reserved one round after the gapstart.

v2: Linus said: "
We've definitely seen ACPI code or integrated graphics stuff
that steals a lot of memory at the end, which means that end-of-RAM might
be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and
16MB of "stolen video ram").

At a minimum, if we do this, I'd like to make sure we round up to a big
boundary (eg 32MB or something - exactly because a missing 16MB can easily
be some integrated stolen video memory).

Sure, we do that whole

while ((gapsize >> 4) > round)
round += round;

thing, so that if the gap is large, then we'll certainly get to 32MB too,
but I think your patch matters the most exactly when the gap is small.
Maybe we could just raise the initial minimum rounding from 1MB to 32MB?
...
Alternatively, maybe we can make sure that we round up to at least X bytes
from the end of RAM, and to at least Y bytes from the end of some RESERVED
thing."


Reported-and-tested-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/e820.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -619,6 +619,7 @@ __init void e820_setup_gap(void)
{
unsigned long gapstart, gapsize, round;
int found;
+ unsigned long low_top_ram;

gapstart = 0x10000000;
gapsize = 0x400000;
@@ -636,14 +637,32 @@ __init void e820_setup_gap(void)

/*
* See how much we want to round up: start off with
- * rounding to the next 1MB area.
+ * rounding to the next 32MB area.
*/
- round = 0x100000;
+ round = 0x2000000;
while ((gapsize >> 4) > round)
round += round;
+
+ pci_mem_start = roundup(gapstart, round);
+
+ low_top_ram = e820_end_of_low_ram_pfn() << PAGE_SHIFT;
+ /* try to check if there is gap between last RAM below 4g to that start */
+ if (pci_mem_start > low_top_ram) {
+ if (e820_any_mapped(low_top_ram, pci_mem_start, E820_RESERVED))
+ goto out;
+ if (e820_any_mapped(low_top_ram, pci_mem_start, E820_ACPI))
+ goto out;
+ if (e820_any_mapped(low_top_ram, pci_mem_start, E820_NVS))
+ goto out;
+
+ if ((pci_mem_start - low_top_ram) > round)
+ goto out;
+ }
+
/* Fun with two's complement */
pci_mem_start = (gapstart + round) & -round;

+out:
printk(KERN_INFO
"Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
pci_mem_start, gapstart, gapsize);

2009-04-15 00:43:22

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3


Impact: make more big space below 4g for assigning to unassigned pci devices

don't need to reserved one round after the gapstart.

v2: Linus said: "
We've definitely seen ACPI code or integrated graphics stuff
that steals a lot of memory at the end, which means that end-of-RAM might
be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and
16MB of "stolen video ram").

At a minimum, if we do this, I'd like to make sure we round up to a big
boundary (eg 32MB or something - exactly because a missing 16MB can easily
be some integrated stolen video memory).

Sure, we do that whole

while ((gapsize >> 4) > round)
round += round;

thing, so that if the gap is large, then we'll certainly get to 32MB too,
but I think your patch matters the most exactly when the gap is small.
Maybe we could just raise the initial minimum rounding from 1MB to 32MB?
...
Alternatively, maybe we can make sure that we round up to at least X bytes
from the end of RAM, and to at least Y bytes from the end of some RESERVED
thing."
v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M at least


Reported-and-tested-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/e820.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -619,6 +619,7 @@ __init void e820_setup_gap(void)
{
unsigned long gapstart, gapsize, round;
int found;
+ unsigned long low_top_ram;

gapstart = 0x10000000;
gapsize = 0x400000;
@@ -636,14 +637,32 @@ __init void e820_setup_gap(void)

/*
* See how much we want to round up: start off with
- * rounding to the next 1MB area.
+ * rounding to the next 32MB area.
*/
- round = 0x100000;
+ round = 0x2000000;
while ((gapsize >> 4) > round)
round += round;
+
+ pci_mem_start = roundup(gapstart, round);
+
+ low_top_ram = e820_end_of_low_ram_pfn() << PAGE_SHIFT;
+ /* check if there is gap between last RAM below 4g to that start */
+ if (pci_mem_start > low_top_ram) {
+ if (e820_any_mapped(low_top_ram, pci_mem_start, E820_RESERVED))
+ goto out;
+ if (e820_any_mapped(low_top_ram, pci_mem_start, E820_ACPI))
+ goto out;
+ if (e820_any_mapped(low_top_ram, pci_mem_start, E820_NVS))
+ goto out;
+
+ if ((pci_mem_start - low_top_ram) > (round>>1))
+ goto out;
+ }
+
/* Fun with two's complement */
pci_mem_start = (gapstart + round) & -round;

+out:
printk(KERN_INFO
"Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
pci_mem_start, gapstart, gapsize);

2009-04-15 00:43:54

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86/pci: fix -1 calling to e820_all_mapped with mmconfig


Impact: fix calling

e820_all_mapped need end is (addr + size) instead of (addr + size - 1)

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/pci/mmconfig-shared.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/pci/mmconfig-shared.c
===================================================================
--- linux-2.6.orig/arch/x86/pci/mmconfig-shared.c
+++ linux-2.6/arch/x86/pci/mmconfig-shared.c
@@ -375,7 +375,7 @@ static acpi_status __init check_mcfg_res
if (!fixmem32)
return AE_OK;
if ((mcfg_res->start >= fixmem32->address) &&
- (mcfg_res->end < (fixmem32->address +
+ (mcfg_res->end <= (fixmem32->address +
fixmem32->address_length))) {
mcfg_res->flags = 1;
return AE_CTRL_TERMINATE;
@@ -392,7 +392,7 @@ static acpi_status __init check_mcfg_res
return AE_OK;

if ((mcfg_res->start >= address.minimum) &&
- (mcfg_res->end < (address.minimum + address.address_length))) {
+ (mcfg_res->end <= (address.minimum + address.address_length))) {
mcfg_res->flags = 1;
return AE_CTRL_TERMINATE;
}
@@ -439,7 +439,7 @@ static int __init is_mmconf_reserved(che
u64 old_size = size;
int valid = 0;

- while (!is_reserved(addr, addr + size - 1, E820_RESERVED)) {
+ while (!is_reserved(addr, addr + size, E820_RESERVED)) {
size >>= 1;
if (size < (16UL<<20))
break;

2009-04-16 16:32:18

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3

On Tue, 14 Apr 2009 17:41:35 -0700
Yinghai Lu <[email protected]> wrote:

>
> Impact: make more big space below 4g for assigning to unassigned pci
> devices
>
> don't need to reserved one round after the gapstart.
>
> v2: Linus said: "
> We've definitely seen ACPI code or integrated graphics stuff
> that steals a lot of memory at the end, which means that
> end-of-RAM might be not at 2GB, but at 2GB-16MB-1MB, for example (1MB
> of "ACPI data", and 16MB of "stolen video ram").
>
> At a minimum, if we do this, I'd like to make sure we round
> up to a big boundary (eg 32MB or something - exactly because a
> missing 16MB can easily be some integrated stolen video memory).
>
> Sure, we do that whole
>
> while ((gapsize >> 4) > round)
> round += round;
>
> thing, so that if the gap is large, then we'll certainly get
> to 32MB too, but I think your patch matters the most exactly when the
> gap is small. Maybe we could just raise the initial minimum rounding
> from 1MB to 32MB? ...
> Alternatively, maybe we can make sure that we round up to at
> least X bytes from the end of RAM, and to at least Y bytes from the
> end of some RESERVED thing."
> v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M
> at least

Any comments on this one, Linus? Should I include your ack?

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2009-04-16 16:52:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3



On Thu, 16 Apr 2009, Jesse Barnes wrote:
>
> Any comments on this one, Linus? Should I include your ack?

I'm not ready to ack it, no. I don't think the suggested patch is very
clean or necessarily sensible as-is. It seems very ad-hoc.

I was literally thinking of something like
"round up from the last RAM by X"
"round up from the last reserved region by Y"
"pick the bigger of the two"

with helper functions for the two cases and comments along the lines of
why we do it. Something that was a bit more obvious about what it's doing
and why.

And no, I realize that the old code isn't that way. But the old code isn't
the issue - the old code is proven over _years_ and years of testing, and
works wonderfully well for a ton of very different machines. It has _one_
single known failure case, and while there clearly must be others, the
point is, the old code is not what needs to be worried about.

So when changing that code that has all that testing, and when the
failures are so nasty and hard to debug and likely only happen on some
random old laptop that has crap e820 tables and _just_ the right amount
of memory, I'd really like the replacement code to be better.

Linus

2009-04-16 16:58:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3


* Linus Torvalds <[email protected]> wrote:

> On Thu, 16 Apr 2009, Jesse Barnes wrote:
> >
> > Any comments on this one, Linus? Should I include your ack?
>
> I'm not ready to ack it, no. I don't think the suggested patch is very
> clean or necessarily sensible as-is. It seems very ad-hoc.
>
> I was literally thinking of something like
> "round up from the last RAM by X"
> "round up from the last reserved region by Y"
> "pick the bigger of the two"
>
> with helper functions for the two cases and comments along the
> lines of why we do it. Something that was a bit more obvious about
> what it's doing and why.

That's sensible - but i'd also like to inject hpa's add-on idea: if
we do that then we should do it _explicitly_ and _visibly_, by
injecting an artificial e820 reservation range to all expected
"vulnerable" holes we cannot fully trust.

We'd do that after all the fixed resources are allocated, but before
dynamic PCI allocations.

That prevents the PCI layer from dynamically allocating anything
into that protective zone, and documents it as well (and makes it
visible in boot logs, etc.) - instead of just a silent rule
somewhere that no-one will really see if it breaks.

Or would this be a bad idea for some obvious reason i missed?

Ingo

2009-04-16 17:21:10

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> On Thu, 16 Apr 2009, Jesse Barnes wrote:
>>> Any comments on this one, Linus? Should I include your ack?
>> I'm not ready to ack it, no. I don't think the suggested patch is very
>> clean or necessarily sensible as-is. It seems very ad-hoc.
>>
>> I was literally thinking of something like
>> "round up from the last RAM by X"
>> "round up from the last reserved region by Y"
>> "pick the bigger of the two"
>>
>> with helper functions for the two cases and comments along the
>> lines of why we do it. Something that was a bit more obvious about
>> what it's doing and why.
>
> That's sensible - but i'd also like to inject hpa's add-on idea: if
> we do that then we should do it _explicitly_ and _visibly_, by
> injecting an artificial e820 reservation range to all expected
> "vulnerable" holes we cannot fully trust.
>
> We'd do that after all the fixed resources are allocated, but before
> dynamic PCI allocations.
>
> That prevents the PCI layer from dynamically allocating anything
> into that protective zone, and documents it as well (and makes it
> visible in boot logs, etc.) - instead of just a silent rule
> somewhere that no-one will really see if it breaks.

that need to do done much earlier, and much simple, just need to make that range to be reserved in e820.
and later e820_setup_gap even don't need to be aligned again.

YH

2009-04-16 17:29:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3


* Yinghai Lu <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Linus Torvalds <[email protected]> wrote:
> >
> >> On Thu, 16 Apr 2009, Jesse Barnes wrote:
> >>> Any comments on this one, Linus? Should I include your ack?
> >> I'm not ready to ack it, no. I don't think the suggested patch is very
> >> clean or necessarily sensible as-is. It seems very ad-hoc.
> >>
> >> I was literally thinking of something like
> >> "round up from the last RAM by X"
> >> "round up from the last reserved region by Y"
> >> "pick the bigger of the two"
> >>
> >> with helper functions for the two cases and comments along the
> >> lines of why we do it. Something that was a bit more obvious about
> >> what it's doing and why.
> >
> > That's sensible - but i'd also like to inject hpa's add-on idea: if
> > we do that then we should do it _explicitly_ and _visibly_, by
> > injecting an artificial e820 reservation range to all expected
> > "vulnerable" holes we cannot fully trust.
> >
> > We'd do that after all the fixed resources are allocated, but before
> > dynamic PCI allocations.
> >
> > That prevents the PCI layer from dynamically allocating anything
> > into that protective zone, and documents it as well (and makes it
> > visible in boot logs, etc.) - instead of just a silent rule
> > somewhere that no-one will really see if it breaks.
>
> that need to do done much earlier, and much simple, just need to
> make that range to be reserved in e820. and later e820_setup_gap
> even don't need to be aligned again.

Well, an alignment _check_ could still be added with a
WARN_ONCE(), to make sure these assumptions hold true in
future as well.

This kind of stuff is generally not testable and wont break on many
systems - but it can easily cripple a random 0.5% of systems,
creating a lot of unhappy users.

So pretty much the only solution is to be careful, robust and
redundant all along.

Ingo

2009-04-16 17:30:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3

Yinghai Lu wrote:
>
> that need to do done much earlier, and much simple, just need to make that range to be reserved in e820.
> and later e820_setup_gap even don't need to be aligned again.
>

As long as that doesn't cause the PCI layer to move devices already
assigned in this range out of it. What we want is really a "weak
reserve". On the other hand, that may very well be the semantics of the
existing reserved space, too (I honestly haven't looked lately.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-04-16 17:39:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v3


* H. Peter Anvin <[email protected]> wrote:

> Yinghai Lu wrote:
> >
> > that need to do done much earlier, and much simple, just need to make that range to be reserved in e820.
> > and later e820_setup_gap even don't need to be aligned again.
> >
>
> As long as that doesn't cause the PCI layer to move devices
> already assigned in this range out of it. What we want is really
> a "weak reserve". On the other hand, that may very well be the
> semantics of the existing reserved space, too (I honestly haven't
> looked lately.)

We have reserve_region_with_split(), which 'wraps around' existing
resources non-intrusively by creating split-up resources -
preventing their forced reallocation (and preventing their possible
breakage - a number of BARs dont like dynamic reallocations at all).

Ingo

2009-04-16 20:15:29

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

please check.

[PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Impact: make more big space below 4g for assigning to unassigned pci devices

don't need to reserved one round after the gapstart.

v2: Linus said: "
We've definitely seen ACPI code or integrated graphics stuff
that steals a lot of memory at the end, which means that end-of-RAM might
be not at 2GB, but at 2GB-16MB-1MB, for example (1MB of "ACPI data", and
16MB of "stolen video ram").

At a minimum, if we do this, I'd like to make sure we round up to a big
boundary (eg 32MB or something - exactly because a missing 16MB can easily
be some integrated stolen video memory).

Sure, we do that whole

while ((gapsize >> 4) > round)
round += round;

thing, so that if the gap is large, then we'll certainly get to 32MB too,
but I think your patch matters the most exactly when the gap is small.
Maybe we could just raise the initial minimum rounding from 1MB to 32MB?
...
Alternatively, maybe we can make sure that we round up to at least X bytes
from the end of RAM, and to at least Y bytes from the end of some RESERVED
thing."
v3: take pci_mem_start - low_top_ram bigger than half around, aka 16M at least
v4: try to check e820 early to see if we need reserve stolen RAM.
and only do one simple round up in e820_setup_gap

Reported-and-tested-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/e820.h | 1
arch/x86/kernel/e820.c | 54 ++++++++++++++++++++++++++++++++++++++------
arch/x86/kernel/setup.c | 6 ++++
3 files changed, 54 insertions(+), 7 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -636,13 +636,11 @@ __init void e820_setup_gap(void)

/*
* See how much we want to round up: start off with
- * rounding to the next 1MB area.
+ * rounding to the next 32MB area.
*/
- round = 0x100000;
- while ((gapsize >> 4) > round)
- round += round;
- /* Fun with two's complement */
- pci_mem_start = (gapstart + round) & -round;
+ round = 0x2000000;
+
+ pci_mem_start = roundup(gapstart, round);

printk(KERN_INFO
"Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
@@ -1143,7 +1141,9 @@ static unsigned long __init e820_end_pfn
if (last_pfn > max_arch_pfn)
last_pfn = max_arch_pfn;

- printk(KERN_INFO "last_pfn = %#lx max_arch_pfn = %#lx\n",
+ printk(KERN_INFO "limit_pfn = %#lx ", limit_pfn);
+ e820_print_type(type);
+ printk(KERN_CONT " last_pfn = %#lx max_arch_pfn = %#lx\n",
last_pfn, max_arch_pfn);
return last_pfn;
}
@@ -1314,6 +1314,46 @@ void __init finish_e820_parsing(void)
}
}

+static unsigned long __init real_end(unsigned long low_top_ram,
+ unsigned long round,
+ unsigned long real_end_ram, int type)
+{
+ unsigned long low_top_x;
+ unsigned long end_x;
+
+ low_top_x = e820_end_pfn((low_top_ram + round)>>PAGE_SHIFT, type)
+ << PAGE_SHIFT;
+ end_x = roundup(low_top_x, round);
+ if (end_x > real_end_ram)
+ real_end_ram = end_x;
+
+ return real_end_ram;
+}
+
+void __init e820_reserve_stolen_range(void)
+{
+ unsigned long round;
+ unsigned long low_top_ram;
+ unsigned long real_end_ram;
+
+ /* 32M is enough ?*/
+ round = 0x2000000;
+ low_top_ram = e820_end_of_low_ram_pfn() << PAGE_SHIFT;
+ real_end_ram = roundup(low_top_ram, round);
+ if (low_top_ram == real_end_ram)
+ return;
+
+ real_end_ram = real_end(low_top_ram, round, real_end_ram,
+ E820_RESERVED);
+ real_end_ram = real_end(low_top_ram, round, real_end_ram, E820_ACPI);
+ real_end_ram = real_end(low_top_ram, round, real_end_ram, E820_NVS);
+
+ e820_add_region(low_top_ram, real_end_ram - low_top_ram, E820_RESERVED);
+ sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
+ printk(KERN_INFO "fixed physical RAM map:\n");
+ e820_print_map("reserve_stolen_range");
+}
+
static inline const char *e820_type_to_string(int e820_type)
{
switch (e820_type) {
Index: linux-2.6/arch/x86/include/asm/e820.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/e820.h
+++ linux-2.6/arch/x86/include/asm/e820.h
@@ -78,6 +78,7 @@ extern u64 e820_update_range(u64 start,
extern u64 e820_remove_range(u64 start, u64 size, unsigned old_type,
int checktype);
extern void update_e820(void);
+extern void e820_reserve_stolen_range(void);
extern void e820_setup_gap(void);
extern int e820_search_gap(unsigned long *gapstart, unsigned long *gapsize,
unsigned long start_addr, unsigned long long end_addr);
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -812,6 +812,12 @@ void __init setup_arch(char **cmdline_p)
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);

+ /*
+ * some systems use end of ram to for acpi or video ram
+ * but doesn't state that in reserved in e820
+ * try to round of ram etc and reserve them
+ */
+ e820_reserve_stolen_range();

#ifdef CONFIG_X86_32
if (ppro_with_ram_bug()) {

2009-04-16 23:24:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Thu, 16 Apr 2009, Yinghai Lu wrote:
>
> please check.
>
> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

I like the approach. That said, I think that rather than do the "modify
the e820 array" thing, why not just do it in the in the resource tree, and
do it at "e820_reserve_resources_late()" time?

IOW, something like this.

TOTALLY UNTESTED! The point is to take all RAM resources we haev, and
_after_ we've added all the resources we've seen in the E820 tree, we then
_also_ try to add fake reserved entries for any "round up to X" at the end
of the RAM resources.

NOTE! I really didn't want to use "reserve_region_with_split()". I didn't
want to recurse into any conflicting resources, I really wanted to just do
the other failure cases.

THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind
of thing. That includes the rough draft of how much to round things up to
based on where the end of RAM region is etc. I'm really throwing this out
more as a "wouldn't this be a readable way to handle any missing reserved
entries" kind of thing..

Linus

---
arch/x86/kernel/e820.c | 34 ++++++++++++++++++++++++++++++++++
1 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index ef2c356..e8b8d33 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
}
}

+/* How much should we pad RAM ending depending on where it is? */
+static unsigned long ram_alignment(resource_size_t pos)
+{
+ unsigned long mb = pos >> 20;
+
+ /* To 64kB in the first megabyte */
+ if (!mb)
+ return 64*1024;
+
+ /* To 1MB in the first 16MB */
+ if (mb < 16)
+ return 1024*1024;
+
+ /* To 32MB for anything above that */
+ return 32*1024*1024;
+}
+
void __init e820_reserve_resources_late(void)
{
int i;
@@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
insert_resource_expand_to_fit(&iomem_resource, res);
res++;
}
+
+ /*
+ * Try to bump up RAM regions to reasonable boundaries to
+ * avoid stolen RAM
+ */
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *entry = &e820_saved.map[i];
+ resource_size_t start, end;
+
+ if (entry->type != E820_RAM)
+ continue;
+ start = entry->addr + entry->size;
+ end = round_up(start, ram_alignment(start));
+ if (start == end)
+ continue;
+ reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
+ }
}

char *__init default_machine_specific_memory_setup(void)

2009-04-16 23:56:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4


* Linus Torvalds <[email protected]> wrote:

> On Thu, 16 Apr 2009, Yinghai Lu wrote:
> >
> > please check.
> >
> > [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
>
> I like the approach. That said, I think that rather than do the "modify
> the e820 array" thing, why not just do it in the in the resource tree, and
> do it at "e820_reserve_resources_late()" time?
>
> IOW, something like this.
>
> TOTALLY UNTESTED! The point is to take all RAM resources we haev, and
> _after_ we've added all the resources we've seen in the E820 tree, we then
> _also_ try to add fake reserved entries for any "round up to X" at the end
> of the RAM resources.
>
> NOTE! I really didn't want to use "reserve_region_with_split()". I didn't
> want to recurse into any conflicting resources, I really wanted to just do
> the other failure cases.
>
> THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind
> of thing. That includes the rough draft of how much to round things up to
> based on where the end of RAM region is etc. I'm really throwing this out
> more as a "wouldn't this be a readable way to handle any missing reserved
> entries" kind of thing..
>
> Linus
>
> ---
> arch/x86/kernel/e820.c | 34 ++++++++++++++++++++++++++++++++++
> 1 files changed, 34 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index ef2c356..e8b8d33 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
> }
> }
>
> +/* How much should we pad RAM ending depending on where it is? */
> +static unsigned long ram_alignment(resource_size_t pos)
> +{
> + unsigned long mb = pos >> 20;
> +
> + /* To 64kB in the first megabyte */
> + if (!mb)
> + return 64*1024;

Could we perhaps round up to 1MB in this case too?

This would mean that we would never dynamically allocate into the
640k..1MB hole - but even if it's free it's probably a good idea to
avoid that range - it's usually quite special.

So we'd just have two ganularities for round-up: 1MB and 32MB. Nice,
predictable scheme.

> +
> + /* To 1MB in the first 16MB */
> + if (mb < 16)
> + return 1024*1024;
> +
> + /* To 32MB for anything above that */
> + return 32*1024*1024;
> +}
> +
> void __init e820_reserve_resources_late(void)
> {
> int i;
> @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
> insert_resource_expand_to_fit(&iomem_resource, res);
> res++;
> }
> +
> + /*
> + * Try to bump up RAM regions to reasonable boundaries to
> + * avoid stolen RAM
> + */
> + for (i = 0; i < e820.nr_map; i++) {
> + struct e820entry *entry = &e820_saved.map[i];
> + resource_size_t start, end;
> +
> + if (entry->type != E820_RAM)
> + continue;

Would it make sense to round up everything that is listed in the
E820 map? Just in case the BIOS is not entirely honest about the
true extent of that area.

It might even make sense to add a small 'guard' area next to such
ranges (even if they are aligned well): to prevent the BIOS from
accidentally overruning into a device BAR we allocate next to it.

> + start = entry->addr + entry->size;
> + end = round_up(start, ram_alignment(start));
> + if (start == end)
> + continue;

Hm, indeed, the continue is needed - reserve_region_with_split()
lets zero-sized resources be inserted silently. I'd have missed this
case. Do zero-sized memory resources have a special role somewhere?

[ Plus i dont see any protection against negative-size resources in
kernel/resource.c either. OTOH inserting a negative size resource
just locks down the tree for all future resources, so it would be
noticed for sure.

Zero-size is not an issue it appears, it goes in and prevents only
the exactly same-position zero-size resource to be inserted.

But it might sense to silently ignore zero-size resources (if we
dont rely on them elsewhere), or to WARN() about them and about
negative size resources.
]

> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");

Ingo

2009-04-17 00:31:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Fri, 17 Apr 2009, Ingo Molnar wrote:
>
> Could we perhaps round up to 1MB in this case too?

(The below 1MB one).

I'd argue against it, at least in this incarnation. I can well imagine
somebody wanting to do resource management in the 640k-1M window, so..

> Would it make sense to round up everything that is listed in the
> E820 map? Just in case the BIOS is not entirely honest about the
> true extent of that area.

Well, it would probably work, but on the other hand, when we see
"E820_RAM", that means that we really _can_ trust that that E820 entry is
right, since we're going to use it as RAM (and Windows would too), and if
it wasn't RAM, really bad things would happen.

So E820_RAM is a _lot_ more trustworthy than the other cases. If we're
rounding up by reasonably large amounts like 32MB or even more, I really
think we should do so for the things we really know are there, and that we
really fundamentally know come in big granularities.

The other entries in the e820 map can reasonably be 4kB or something,
because they are an IO-APIC or whatever. I can't say that I'd feel happy
putting a guard area around something like that. But RAM? Sure, it can
come in 384kB chunks (think RAM remapping for the low 1MB area), but it
doesn't tend to happen when we're talking gigs any more.

> > + start = entry->addr + entry->size;
> > + end = round_up(start, ram_alignment(start));
> > + if (start == end)
> > + continue;
>
> Hm, indeed, the continue is needed - reserve_region_with_split()
> lets zero-sized resources be inserted silently. I'd have missed this
> case. Do zero-sized memory resources have a special role somewhere?

No. But it wouldn't be a zero-size region, it would be a one-byte sized
region. It's just that my patch was missing the "-1" from the end that I
meant to put there:

> > + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");

That 'end' there should be 'end-1', and that also explains why "start ==
end" must have a continue.

The 'end' in a resource region is the last byte, not the 'byte after'.

So there was a small buglet in the patch, but as I mentioned, using
"reserve_region_with_split()" is really wrong anyway, because we do not
want to recurse into existing regions, just split _around_ them. So the
patch was meant as

Linus

2009-04-17 13:17:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4


* Linus Torvalds <[email protected]> wrote:

> On Fri, 17 Apr 2009, Ingo Molnar wrote:
> >
> > Could we perhaps round up to 1MB in this case too?
>
> (The below 1MB one).
>
> I'd argue against it, at least in this incarnation. I can well
> imagine somebody wanting to do resource management in the 640k-1M
> window, so..

ok - indeed - if there's some super-small system with limited
address lines and all physical addresses tightly packed with RAM?

> > Would it make sense to round up everything that is listed in the
> > E820 map? Just in case the BIOS is not entirely honest about the
> > true extent of that area.
>
> Well, it would probably work, but on the other hand, when we see
> "E820_RAM", that means that we really _can_ trust that that E820
> entry is right, since we're going to use it as RAM (and Windows
> would too), and if it wasn't RAM, really bad things would happen.
>
> So E820_RAM is a _lot_ more trustworthy than the other cases. If
> we're rounding up by reasonably large amounts like 32MB or even
> more, I really think we should do so for the things we really know
> are there, and that we really fundamentally know come in big
> granularities.
>
> The other entries in the e820 map can reasonably be 4kB or
> something, because they are an IO-APIC or whatever. I can't say
> that I'd feel happy putting a guard area around something like
> that. But RAM? Sure, it can come in 384kB chunks (think RAM
> remapping for the low 1MB area), but it doesn't tend to happen
> when we're talking gigs any more.

One of my systems is a bit weird, with such a checkered RAM map:

BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB
[ hole ] 0.250 MB
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB
BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM
BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB
BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM
BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB
BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM
BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB
BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM
[ hole ] 1.0 MB
[ hole ] 3072.0 MB

On this map, using your scheme, we'd fill up that small 1MB hole up
to 1GB [mockup]:

BIOS-e820: 000000003ff00000 - 0000000040000000 (RAM buffer)

I guess that's a good thing not just for robustness: a chipset might
be faster when DMA or mmio is on some well-isolated physical memory
range, not too close to real RAM or other devices?

Bits of the low hole:

00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000c0000-000dffff : pnp 00:01
000e0000-000fffff : reserved
00100000-3ed93fff : System RAM

would still be available to dynamic PCI resources - as the 64K
rounding would leave it alone.

Ingo

2009-04-17 21:59:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

while testing linus's patch, found some resource name get corrupted...
even bare tip

cat /proc/iomem
00000000-000973ff : System RAM
00097400-0009ffff : reserved
000a0000-000bffff : PCI Bus #00
000c0000-000cffff : pnp 00:0c
000e0000-000fffff : reserved
00100000-b7f9ffff : System RAM
00200000-00c67b4b : Kernel code
00c67b4c-01331edf : Kernel data
015a5000-01fc9657 : Kernel bss
20000000-23ffffff : GART
b7fae000-b7faffff : System RAM
b7fb0000-b7fbdfff : ACPI Tables
b7fbe000-b7feffff : ACPI Non-volatile Storage
b7ff0000-b7ffffff : reserved
b8000000-beffffff : PCI Bus #00
bf000000-bfffffff : PCI Bus #80
bfe80000-bfebffff : pnp 00:0e
bfef9000-bfef9fff :
bfef9000-bfef9fff : forcedeth
bfefa000-bfefa00f :
bfefa000-bfefa00f : forcedeth
bfefa400-bfefa4ff :
bfefa400-bfefa4ff : forcedeth
bfefa800-bfefa80f :
bfefa800-bfefa80f : forcedeth
bfefac00-bfefacff :
bfefac00-bfefacff : forcedeth
bfefb000-bfefbfff :
bfefb000-bfefbfff : forcedeth
bfefc000-bfefcfff :
bfefc000-bfefcfff : sata_nv
bfefd000-bfefdfff :
bfefd000-bfefdfff : sata_nv
bfefe000-bfefefff :
bfefe000-bfefefff : sata_nv
bfeff000-bfefffff : IOAPIC 1
bfeff000-bfefffff :
bff00000-bfffffff : PCI Bus 0000:83
bff80000-bffbffff : 0000:83:00.0
bfffec00-bfffecff : 0000:83:00.0
bfffec00-bfffecff : lpfc
bffff000-bfffffff : 0000:83:00.0
bffff000-bfffffff : lpfc
c0000000-dfffffff : PCI Bus #00
c0000000-d9ffffff : PCI Bus 0000:03
c0000000-cfffffff : 0000:03:00.0
d9800000-d9ffffff : 0000:03:00.0
da6b8c00-da6b8c0f : 0000:00:09.0
da6b8c00-da6b8c0f : forcedeth
da6b9000-da6b9fff : 0000:00:09.0
da6b9000-da6b9fff : forcedeth
da6ba000-da6bafff : 0000:00:08.0
da6ba000-da6bafff : forcedeth
da6bb000-da6bbfff : 0000:00:05.2
da6bb000-da6bbfff : sata_nv
da6bc000-da6bcfff : 0000:00:05.1
da6bc000-da6bcfff : sata_nv
da6bd000-da6bdfff : 0000:00:05.0
da6bd000-da6bdfff : sata_nv
da6be000-da6be0ff : 0000:00:09.0
da6be000-da6be0ff : forcedeth
da6be400-da6be40f : 0000:00:08.0
da6be400-da6be40f : forcedeth
da6be800-da6be8ff : 0000:00:08.0
da6be800-da6be8ff : forcedeth
da6bec00-da6becff : 0000:00:02.1
da6bec00-da6becff : ehci_hcd
da6bf000-da6bffff : 0000:00:02.0
da6bf000-da6bffff : ohci_hcd
da6c0000-da6fffff : pnp 00:05
da700000-daffffff : PCI Bus 0000:01
da7e0000-da7fffff : 0000:01:05.0
da800000-daffffff : 0000:01:05.0
db000000-dfbfffff : PCI Bus 0000:02
db000000-dbffffff : 0000:02:00.3
dc000000-dcffffff : 0000:02:00.2
dd000000-ddffffff : 0000:02:00.1
dd000000-ddffffff : niu
de000000-deffffff : 0000:02:00.0
de000000-deffffff : niu
df700000-df7fffff : 0000:02:00.3
df800000-df8fffff : 0000:02:00.2
df900000-df9fffff : 0000:02:00.1
dfa00000-dfafffff : 0000:02:00.0
dfbc0000-dfbc7fff : 0000:02:00.3
dfbc8000-dfbcffff : 0000:02:00.3
dfbd0000-dfbd7fff : 0000:02:00.2
dfbd8000-dfbdffff : 0000:02:00.2
dfbe0000-dfbe7fff : 0000:02:00.1
dfbe0000-dfbe7fff : niu
dfbe8000-dfbeffff : 0000:02:00.1
dfbe8000-dfbeffff : niu
dfbf0000-dfbf7fff : 0000:02:00.0
dfbf0000-dfbf7fff : niu
dfbf8000-dfbfffff : 0000:02:00.0
dfbf8000-dfbfffff : niu
dfc00000-dfcfffff : PCI Bus 0000:03
dfc00000-dfcfffff : 0000:03:00.0
dfd00000-dfffffff : PCI Bus 0000:04
dfd80000-dfdfffff : 0000:04:00.0
dfe00000-dfffffff : 0000:04:00.0
e0000000-efffffff : PCI MMCONFIG 0 [00-ff]
e0000000-efffffff : reserved
e0000000-efffffff : pnp 00:0b
f0000000-ffffffff : PCI Bus #00
fec00000-fec00fff : IOAPIC 0
fec00000-fec00fff : reserved
fec00000-fec00fff : pnp 00:0a
fed00000-fed003ff : HPET 0
fee00000-feefffff : reserved
fee00000-fee00fff : Local APIC
fee00000-fee00fff : pnp 00:0a
fee01000-feefffff : pnp 00:05
ff700000-ffffffff : reserved
100000000-2047ffffff : System RAM
2048000000-fcffffffff : PCI Bus #00

2009-04-17 22:08:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> On Fri, 17 Apr 2009, Ingo Molnar wrote:
>>> Could we perhaps round up to 1MB in this case too?
>> (The below 1MB one).
>>
>> I'd argue against it, at least in this incarnation. I can well
>> imagine somebody wanting to do resource management in the 640k-1M
>> window, so..
>
> ok - indeed - if there's some super-small system with limited
> address lines and all physical addresses tightly packed with RAM?
>

No, much more likely that you're having PCI 2.x or PnP devices which
have 20-bit resources. It's probably worth noting that at least right
now, Linux mishandles 20-bit BARs and treat them like 32-bit BARs. It
turns out to actually work on a majority of the (quite few) known
devices which do have 20-bit BARs.

>
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB
> [ hole ] 0.250 MB
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB
> BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM
> BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB
> BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM
> BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB
> BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM
> BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB
> BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM
> [ hole ] 1.0 MB
> [ hole ] 3072.0 MB
>
> On this map, using your scheme, we'd fill up that small 1MB hole up
> to 1GB [mockup]:
>
> BIOS-e820: 000000003ff00000 - 0000000040000000 (RAM buffer)
>
> I guess that's a good thing not just for robustness: a chipset might
> be faster when DMA or mmio is on some well-isolated physical memory
> range, not too close to real RAM or other devices?
>

Realistically, there probably is RAM there, probably consumed by the SMM
T-seg.

-hpa

2009-04-18 05:40:30

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] pci: keep pci device resource name pointer right.


Impact: fix bug

notice one system /proc/iomem some entries missed the name for pci_devices

# cat /proc/iomem
00000000-000973ff : System RAM
00097400-0009ffff : reserved
000a0000-000bffff : PCI Bus #00
000c0000-000cffff : pnp 00:0c
000e0000-000fffff : reserved
00100000-b7f9ffff : System RAM
00200000-00c67b4b : Kernel code
00c67b4c-01331edf : Kernel data
015a5000-01fc9657 : Kernel bss
20000000-23ffffff : GART
b7fae000-b7faffff : System RAM
b7fb0000-b7fbdfff : ACPI Tables
b7fbe000-b7feffff : ACPI Non-volatile Storage
b7ff0000-b7ffffff : reserved
b8000000-beffffff : PCI Bus #00
bf000000-bfffffff : PCI Bus #80
bfe80000-bfebffff : pnp 00:0e
bfef9000-bfef9fff :
bfef9000-bfef9fff : forcedeth
bfefa000-bfefa00f :
bfefa000-bfefa00f : forcedeth
bfefa400-bfefa4ff :
bfefa400-bfefa4ff : forcedeth
bfefa800-bfefa80f :
bfefa800-bfefa80f : forcedeth
bfefac00-bfefacff :
bfefac00-bfefacff : forcedeth
bfefb000-bfefbfff :
bfefb000-bfefbfff : forcedeth
bfefc000-bfefcfff :
bfefc000-bfefcfff : sata_nv
bfefd000-bfefdfff :
bfefd000-bfefdfff : sata_nv
bfefe000-bfefefff :
bfefe000-bfefefff : sata_nv
bfeff000-bfefffff : IOAPIC 1
bfeff000-bfefffff :
...

it turns that we need to reget res->name because dev->dev.kobj name is changed
after device_add.

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/bus.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

Index: linux-2.6/drivers/pci/bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/bus.c
+++ linux-2.6/drivers/pci/bus.c
@@ -70,6 +70,19 @@ pci_bus_alloc_resource(struct pci_bus *b
return ret;
}

+static void pci_dev_update_res_name(struct pci_dev *dev)
+{
+ int idx;
+
+ /* after device_add will get new name, reget it */
+ for (idx = 0; idx <= PCI_ROM_RESOURCE; idx++) {
+ struct resource *res = &dev->resource[idx];
+
+ if (res->name)
+ res->name = pci_name(dev);
+ }
+}
+
/**
* pci_bus_add_device - add a single device
* @dev: device to add
@@ -84,6 +97,7 @@ int pci_bus_add_device(struct pci_dev *d
if (retval)
return retval;

+ pci_dev_update_res_name(dev);
dev->is_added = 1;
pci_proc_attach_device(dev);
pci_create_sysfs_dev_files(dev);

2009-04-18 07:52:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.


* Yinghai Lu <[email protected]> wrote:

> Impact: fix bug

i think this needs to be marked Cc: <[email protected]> as well, for
2.6.29.x, maybe even 2.6.28.x ?

( Please note a small commit log detail: a few days go we started
putting impact lines to the end of the commit as 'footers', in
square brackets - right before the signoff lines. We do this to
move them closer to other mechanic-looking tags and to not intrude
the flow of the natural-language story line of the commit.

Also note that 'fix bug' is not a good impact line even if it was
a footer, because it does not really summarize the effects of a
patch specifically enough. A better variant would be:

[ Impact: fix corrupted names in /proc/iomem ]

I've inserted this impact line into your commit below, to show the
exact placement we started using. Note, this impact line would
also be a perfect summary line, if the 'pci: ' tag is added before
it:

pci: fix corrupted names in /proc/iomem

Jesse or Linus might opt to remove the impact line - it's a per
subsystem discretion thing. )

Ingo

> notice one system /proc/iomem some entries missed the name for pci_devices
>
> # cat /proc/iomem
> 00000000-000973ff : System RAM
> 00097400-0009ffff : reserved
> 000a0000-000bffff : PCI Bus #00
> 000c0000-000cffff : pnp 00:0c
> 000e0000-000fffff : reserved
> 00100000-b7f9ffff : System RAM
> 00200000-00c67b4b : Kernel code
> 00c67b4c-01331edf : Kernel data
> 015a5000-01fc9657 : Kernel bss
> 20000000-23ffffff : GART
> b7fae000-b7faffff : System RAM
> b7fb0000-b7fbdfff : ACPI Tables
> b7fbe000-b7feffff : ACPI Non-volatile Storage
> b7ff0000-b7ffffff : reserved
> b8000000-beffffff : PCI Bus #00
> bf000000-bfffffff : PCI Bus #80
> bfe80000-bfebffff : pnp 00:0e
> bfef9000-bfef9fff :
> bfef9000-bfef9fff : forcedeth
> bfefa000-bfefa00f :
> bfefa000-bfefa00f : forcedeth
> bfefa400-bfefa4ff :
> bfefa400-bfefa4ff : forcedeth
> bfefa800-bfefa80f :
> bfefa800-bfefa80f : forcedeth
> bfefac00-bfefacff :
> bfefac00-bfefacff : forcedeth
> bfefb000-bfefbfff :
> bfefb000-bfefbfff : forcedeth
> bfefc000-bfefcfff :
> bfefc000-bfefcfff : sata_nv
> bfefd000-bfefdfff :
> bfefd000-bfefdfff : sata_nv
> bfefe000-bfefefff :
> bfefe000-bfefefff : sata_nv
> bfeff000-bfefffff : IOAPIC 1
> bfeff000-bfefffff :
> ...
>
> it turns that we need to reget res->name because dev->dev.kobj name is changed
> after device_add.
>
> [ Impact: fix corrupted names in /proc/iomem ]
>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> drivers/pci/bus.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> Index: linux-2.6/drivers/pci/bus.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/bus.c
> +++ linux-2.6/drivers/pci/bus.c
> @@ -70,6 +70,19 @@ pci_bus_alloc_resource(struct pci_bus *b
> return ret;
> }
>
> +static void pci_dev_update_res_name(struct pci_dev *dev)
> +{
> + int idx;
> +
> + /* after device_add will get new name, reget it */
> + for (idx = 0; idx <= PCI_ROM_RESOURCE; idx++) {
> + struct resource *res = &dev->resource[idx];
> +
> + if (res->name)
> + res->name = pci_name(dev);
> + }
> +}
> +
> /**
> * pci_bus_add_device - add a single device
> * @dev: device to add
> @@ -84,6 +97,7 @@ int pci_bus_add_device(struct pci_dev *d
> if (retval)
> return retval;
>
> + pci_dev_update_res_name(dev);
> dev->is_added = 1;
> pci_proc_attach_device(dev);
> pci_create_sysfs_dev_files(dev);

2009-04-18 08:36:04

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Linus Torvalds wrote:
>
> On Thu, 16 Apr 2009, Yinghai Lu wrote:
>> please check.
>>
>> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
>
> I like the approach. That said, I think that rather than do the "modify
> the e820 array" thing, why not just do it in the in the resource tree, and
> do it at "e820_reserve_resources_late()" time?
>
> IOW, something like this.
>
> TOTALLY UNTESTED! The point is to take all RAM resources we haev, and
> _after_ we've added all the resources we've seen in the E820 tree, we then
> _also_ try to add fake reserved entries for any "round up to X" at the end
> of the RAM resources.
>
> NOTE! I really didn't want to use "reserve_region_with_split()". I didn't
> want to recurse into any conflicting resources, I really wanted to just do
> the other failure cases.
>
> THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind
> of thing. That includes the rough draft of how much to round things up to
> based on where the end of RAM region is etc. I'm really throwing this out
> more as a "wouldn't this be a readable way to handle any missing reserved
> entries" kind of thing..
>
> Linus
>
> ---
> arch/x86/kernel/e820.c | 34 ++++++++++++++++++++++++++++++++++
> 1 files changed, 34 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index ef2c356..e8b8d33 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
> }
> }
>
> +/* How much should we pad RAM ending depending on where it is? */
> +static unsigned long ram_alignment(resource_size_t pos)
> +{
> + unsigned long mb = pos >> 20;
> +
> + /* To 64kB in the first megabyte */
> + if (!mb)
> + return 64*1024;
> +
> + /* To 1MB in the first 16MB */
> + if (mb < 16)
> + return 1024*1024;
> +
> + /* To 32MB for anything above that */
> + return 32*1024*1024;
> +}
> +
> void __init e820_reserve_resources_late(void)
> {
> int i;
> @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
> insert_resource_expand_to_fit(&iomem_resource, res);
> res++;
> }
> +
> + /*
> + * Try to bump up RAM regions to reasonable boundaries to
> + * avoid stolen RAM
> + */
> + for (i = 0; i < e820.nr_map; i++) {
> + struct e820entry *entry = &e820_saved.map[i];
> + resource_size_t start, end;
> +
> + if (entry->type != E820_RAM)
> + continue;
> + start = entry->addr + entry->size;
> + end = round_up(start, ram_alignment(start));
> + if (start == end)
> + continue;
> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> + }
> }
>
> char *__init default_machine_specific_memory_setup(void)

except need to change
> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
==> > + reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");

it will make sure dynmical allocating code will not use those range.

and could make e820_setup_gap much simple.

---
arch/x86/kernel/e820.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
#endif

/*
- * See how much we want to round up: start off with
- * rounding to the next 1MB area.
+ * e820_reserve_resources_late will protect stolen RAM
+ * so just round it to 1M
*/
round = 0x100000;
- while ((gapsize >> 4) > round)
- round += round;
- /* Fun with two's complement */
- pci_mem_start = (gapstart + round) & -round;
+
+ pci_mem_start = roundup(gapstart, round);

printk(KERN_INFO
"Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",

Ingo, can you put those two patches in tip?

YH

2009-04-18 09:23:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4


* Yinghai Lu <[email protected]> wrote:

> Linus Torvalds wrote:
> >
> > On Thu, 16 Apr 2009, Yinghai Lu wrote:
> >> please check.
> >>
> >> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
> >
> > I like the approach. That said, I think that rather than do the "modify
> > the e820 array" thing, why not just do it in the in the resource tree, and
> > do it at "e820_reserve_resources_late()" time?
> >
> > IOW, something like this.
> >
> > TOTALLY UNTESTED! The point is to take all RAM resources we haev, and
> > _after_ we've added all the resources we've seen in the E820 tree, we then
> > _also_ try to add fake reserved entries for any "round up to X" at the end
> > of the RAM resources.
> >
> > NOTE! I really didn't want to use "reserve_region_with_split()". I didn't
> > want to recurse into any conflicting resources, I really wanted to just do
> > the other failure cases.
> >
> > THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind
> > of thing. That includes the rough draft of how much to round things up to
> > based on where the end of RAM region is etc. I'm really throwing this out
> > more as a "wouldn't this be a readable way to handle any missing reserved
> > entries" kind of thing..
> >
> > Linus
> >
> > ---
> > arch/x86/kernel/e820.c | 34 ++++++++++++++++++++++++++++++++++
> > 1 files changed, 34 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index ef2c356..e8b8d33 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
> > }
> > }
> >
> > +/* How much should we pad RAM ending depending on where it is? */
> > +static unsigned long ram_alignment(resource_size_t pos)
> > +{
> > + unsigned long mb = pos >> 20;
> > +
> > + /* To 64kB in the first megabyte */
> > + if (!mb)
> > + return 64*1024;
> > +
> > + /* To 1MB in the first 16MB */
> > + if (mb < 16)
> > + return 1024*1024;
> > +
> > + /* To 32MB for anything above that */
> > + return 32*1024*1024;
> > +}
> > +
> > void __init e820_reserve_resources_late(void)
> > {
> > int i;
> > @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
> > insert_resource_expand_to_fit(&iomem_resource, res);
> > res++;
> > }
> > +
> > + /*
> > + * Try to bump up RAM regions to reasonable boundaries to
> > + * avoid stolen RAM
> > + */
> > + for (i = 0; i < e820.nr_map; i++) {
> > + struct e820entry *entry = &e820_saved.map[i];
> > + resource_size_t start, end;
> > +
> > + if (entry->type != E820_RAM)
> > + continue;
> > + start = entry->addr + entry->size;
> > + end = round_up(start, ram_alignment(start));
> > + if (start == end)
> > + continue;
> > + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> > + }
> > }
> >
> > char *__init default_machine_specific_memory_setup(void)
>
> except need to change
> > + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> ==> > + reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");
>
> it will make sure dynmical allocating code will not use those range.
>
> and could make e820_setup_gap much simple.
>
> ---
> arch/x86/kernel/e820.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> Index: linux-2.6/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/e820.c
> +++ linux-2.6/arch/x86/kernel/e820.c
> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
> #endif
>
> /*
> - * See how much we want to round up: start off with
> - * rounding to the next 1MB area.
> + * e820_reserve_resources_late will protect stolen RAM
> + * so just round it to 1M
> */
> round = 0x100000;
> - while ((gapsize >> 4) > round)
> - round += round;
> - /* Fun with two's complement */
> - pci_mem_start = (gapstart + round) & -round;
> +
> + pci_mem_start = roundup(gapstart, round);
>
> printk(KERN_INFO
> "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
>
> Ingo, can you put those two patches in tip?

I think the point would be to explore the possibility to have no
'gap' logic at all - we should extend the e820 table with Linus's
scheme to add 'RAM buffer' entries.

That way, if you search for a sufficient size hole, it will be
correctly aligned straight away - with no rounding/gap logic at all.
(Maybe add a debug warning to catch when this is not the case.)

Am i missing something?

Ingo

2009-04-18 16:05:36

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.

On Sat, 18 Apr 2009 09:51:30 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Yinghai Lu <[email protected]> wrote:
>
> > Impact: fix bug
>
> i think this needs to be marked Cc: <[email protected]> as well, for
> 2.6.29.x, maybe even 2.6.28.x ?
>
> ( Please note a small commit log detail: a few days go we started
> putting impact lines to the end of the commit as 'footers', in
> square brackets - right before the signoff lines. We do this to
> move them closer to other mechanic-looking tags and to not intrude
> the flow of the natural-language story line of the commit.
>
> Also note that 'fix bug' is not a good impact line even if it was
> a footer, because it does not really summarize the effects of a
> patch specifically enough. A better variant would be:
>
> [ Impact: fix corrupted names in /proc/iomem ]
>
> I've inserted this impact line into your commit below, to show the
> exact placement we started using. Note, this impact line would
> also be a perfect summary line, if the 'pci: ' tag is added before
> it:
>
> pci: fix corrupted names in /proc/iomem
>
> Jesse or Linus might opt to remove the impact line - it's a per
> subsystem discretion thing. )

Yeah I noticed the x86 patches seem to have those "Impact" lines these
days, but I couldn't figure out what they meant. Sometimes they
indicate the symptom being addressed, other times they act as a sort of
summary subject. What's the intention? Is it really "user
visible impact"? Or something else? Patch subjects generally suffer
from similar ambiguity (sometimes describing what the patch is doing to
the code, other times what issue the patch is addressing), so it would
be nice if "Impact" was something separate and well defined.

--
Jesse Barnes, Intel Open Source Technology Center

2009-04-18 17:09:38

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Ingo Molnar wrote:
> * Yinghai Lu <[email protected]> wrote:
>
>> Linus Torvalds wrote:
>>> On Thu, 16 Apr 2009, Yinghai Lu wrote:
>>>> please check.
>>>>
>>>> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
>>> I like the approach. That said, I think that rather than do the "modify
>>> the e820 array" thing, why not just do it in the in the resource tree, and
>>> do it at "e820_reserve_resources_late()" time?
>>>
>>> IOW, something like this.
>>>
>>> TOTALLY UNTESTED! The point is to take all RAM resources we haev, and
>>> _after_ we've added all the resources we've seen in the E820 tree, we then
>>> _also_ try to add fake reserved entries for any "round up to X" at the end
>>> of the RAM resources.
>>>
>>> NOTE! I really didn't want to use "reserve_region_with_split()". I didn't
>>> want to recurse into any conflicting resources, I really wanted to just do
>>> the other failure cases.
>>>
>>> THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind
>>> of thing. That includes the rough draft of how much to round things up to
>>> based on where the end of RAM region is etc. I'm really throwing this out
>>> more as a "wouldn't this be a readable way to handle any missing reserved
>>> entries" kind of thing..
>>>
>>> Linus
>>>
>>> ---
>>> arch/x86/kernel/e820.c | 34 ++++++++++++++++++++++++++++++++++
>>> 1 files changed, 34 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>> index ef2c356..e8b8d33 100644
>>> --- a/arch/x86/kernel/e820.c
>>> +++ b/arch/x86/kernel/e820.c
>>> @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
>>> }
>>> }
>>>
>>> +/* How much should we pad RAM ending depending on where it is? */
>>> +static unsigned long ram_alignment(resource_size_t pos)
>>> +{
>>> + unsigned long mb = pos >> 20;
>>> +
>>> + /* To 64kB in the first megabyte */
>>> + if (!mb)
>>> + return 64*1024;
>>> +
>>> + /* To 1MB in the first 16MB */
>>> + if (mb < 16)
>>> + return 1024*1024;
>>> +
>>> + /* To 32MB for anything above that */
>>> + return 32*1024*1024;
>>> +}
>>> +
>>> void __init e820_reserve_resources_late(void)
>>> {
>>> int i;
>>> @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
>>> insert_resource_expand_to_fit(&iomem_resource, res);
>>> res++;
>>> }
>>> +
>>> + /*
>>> + * Try to bump up RAM regions to reasonable boundaries to
>>> + * avoid stolen RAM
>>> + */
>>> + for (i = 0; i < e820.nr_map; i++) {
>>> + struct e820entry *entry = &e820_saved.map[i];
>>> + resource_size_t start, end;
>>> +
>>> + if (entry->type != E820_RAM)
>>> + continue;
>>> + start = entry->addr + entry->size;
>>> + end = round_up(start, ram_alignment(start));
>>> + if (start == end)
>>> + continue;
>>> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
>>> + }
>>> }
>>>
>>> char *__init default_machine_specific_memory_setup(void)
>> except need to change
>>> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
>> ==> > + reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");
>>
>> it will make sure dynmical allocating code will not use those range.
>>
>> and could make e820_setup_gap much simple.
>>
>> ---
>> arch/x86/kernel/e820.c | 10 ++++------
>> 1 file changed, 4 insertions(+), 6 deletions(-)
>>
>> Index: linux-2.6/arch/x86/kernel/e820.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/e820.c
>> +++ linux-2.6/arch/x86/kernel/e820.c
>> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
>> #endif
>>
>> /*
>> - * See how much we want to round up: start off with
>> - * rounding to the next 1MB area.
>> + * e820_reserve_resources_late will protect stolen RAM
>> + * so just round it to 1M
>> */
>> round = 0x100000;
>> - while ((gapsize >> 4) > round)
>> - round += round;
>> - /* Fun with two's complement */
>> - pci_mem_start = (gapstart + round) & -round;
>> +
>> + pci_mem_start = roundup(gapstart, round);
>>
>> printk(KERN_INFO
>> "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
>>
>> Ingo, can you put those two patches in tip?
>
> I think the point would be to explore the possibility to have no
> 'gap' logic at all - we should extend the e820 table with Linus's
> scheme to add 'RAM buffer' entries.
>
so you prefer the old one aka the -v4, and add new entry type for RAM Buffer?

YH

2009-04-18 18:49:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.



On Fri, 17 Apr 2009, Yinghai Lu wrote:
>
> it turns that we need to reget res->name because dev->dev.kobj name is changed
> after device_add.

Can we not make the rule be that the name should just be set before?

IOW, there is something else broken, I think. Rather than put this ugly
band-aid, why now make sure that whoever had that broken name fixes it?

Linus

2009-04-18 18:59:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Sat, 18 Apr 2009, Yinghai Lu wrote:
>
> except need to change
> > + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> ==> > + reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");

Yes, I sent out a later email pointing that out.

> it will make sure dynmical allocating code will not use those range.
>
> and could make e820_setup_gap much simple.

ACK. In fact:

> Index: linux-2.6/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/e820.c
> +++ linux-2.6/arch/x86/kernel/e820.c
> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
> #endif
>
> /*
> - * See how much we want to round up: start off with
> - * rounding to the next 1MB area.
> + * e820_reserve_resources_late will protect stolen RAM
> + * so just round it to 1M
> */
> round = 0x100000;
> - while ((gapsize >> 4) > round)
> - round += round;
> - /* Fun with two's complement */
> - pci_mem_start = (gapstart + round) & -round;
> +
> + pci_mem_start = roundup(gapstart, round);

You can just remove "round" entirely. It's no longer a variable, it's just
an odd way of saying 1M ;)

> Ingo, can you put those two patches in tip?

I would suggest that we first change "reserve_region_with_split()" to not
recurse into the region.

That function isn't used by anything else (we ended up using
"expand_to_fit()" instead in the one place that migth have used it), and
now th eone caller we do have would not want the recursion - if there
already exists a resource at the top level, we want to just avoid it.

This - again TOTALLY UNTESTED - patch removes the "recurse into conflicts"
code. Comments? Testing?

Linus
---
kernel/resource.c | 46 ++++++++++++----------------------------------
1 files changed, 12 insertions(+), 34 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index fd5d7d5..ac5f3a3 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -533,43 +533,21 @@ static void __init __reserve_region_with_split(struct resource *root,
res->end = end;
res->flags = IORESOURCE_BUSY;

- for (;;) {
- conflict = __request_resource(parent, res);
- if (!conflict)
- break;
- if (conflict != parent) {
- parent = conflict;
- if (!(conflict->flags & IORESOURCE_BUSY))
- continue;
- }
-
- /* Uhhuh, that didn't work out.. */
- kfree(res);
- res = NULL;
- break;
- }
-
- if (!res) {
- /* failed, split and try again */
-
- /* conflict covered whole area */
- if (conflict->start <= start && conflict->end >= end)
- return;
+ conflict = __request_resource(parent, res);
+ if (!conflict)
+ return;

- if (conflict->start > start)
- __reserve_region_with_split(root, start, conflict->start-1, name);
- if (!(conflict->flags & IORESOURCE_BUSY)) {
- resource_size_t common_start, common_end;
+ /* failed, split and try again */
+ kfree(res);

- common_start = max(conflict->start, start);
- common_end = min(conflict->end, end);
- if (common_start < common_end)
- __reserve_region_with_split(root, common_start, common_end, name);
- }
- if (conflict->end < end)
- __reserve_region_with_split(root, conflict->end+1, end, name);
- }
+ /* conflict covered whole area */
+ if (conflict->start <= start && conflict->end >= end)
+ return;

+ if (conflict->start > start)
+ __reserve_region_with_split(root, start, conflict->start-1, name);
+ if (conflict->end < end)
+ __reserve_region_with_split(root, conflict->end+1, end, name);
}

void __init reserve_region_with_split(struct resource *root,

2009-04-18 19:04:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Sat, 18 Apr 2009, Ingo Molnar wrote:
>
> Am i missing something?

We also try to avoid random motherboard resources etc that aren't reserved
or documented by the BIOS. It's better to go into big holes. It's also
better to try to keep as close to the old (tested) behavior.

Now, admittedly those undocumented resources are _much_ more common in IO
space, but still. They're _very_ common. Om my modern Nehalem thing with
an Intel BIOS (supposedly "good" and not from some random manufacturer), I
have, for example:

[ 26.533771] pci 0000:00:1f.0: ICH7 LPC Generic IO decode 2 PIO at 0810 (mask 007f)

byt that one isn't covered by any PnP range or anythign else.

[ Now, it's possible that it's bogus: "0x810" has a bit set in the same
bits that cover the mask, and I don't know if the mask is a "ignore
these bits" (and the range would thus match all of 0x0800-0x087f) or if
the mast is a "port & ~mask == base" in which case nothing would ever
match.

But I _think_ the BIOS literally set up something to answer int he
0x08?? range, and didn't document it anywhere. The same can be true of
MMIO too, and so we should try to avoid using random memory areas if we
can ]

Linus

2009-04-18 19:15:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4


* Linus Torvalds <[email protected]> wrote:

> On Sat, 18 Apr 2009, Ingo Molnar wrote:
> >
> > Am i missing something?
>
> We also try to avoid random motherboard resources etc that aren't
> reserved or documented by the BIOS. It's better to go into big
> holes. It's also better to try to keep as close to the old
> (tested) behavior.

Yeah - i'm not suggesting any change in behavior, nor am i
suggesting any risky behavior. The current code seems to work quite
well.

I'm just suggesting (maybe foolishly) that instead of having any
gap-rounding logic at all, add artificial entries to the e820 map to
'extend' and round up any odd ending entries.

I.e. explicitly manage all the 'hole' space to be nicely rounded and
to be far away from any T-Seg or other sekrit motherboard resource
danger area.

We'd do this after PCI static allocations (so we dont ever stomp on
real, known resources) but before PCI dynamic allocations.

The e820 printout would look literally like this:

BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB
[ hole ] 0.250 MB
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB
BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM
BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB
BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM
BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB
BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM
BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB
BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM
BIOS-e820: 000000003ff00000 - 0000000040000000 (guard) 1.0 MB
[ hole ] 3072.0 MB

The '(guard)' entry at the end i added above.

This way we intentionally create a 'free physical address space'
hole space that is the same as the rounding logic. No rounding
needed anywhere - as all the remaining address space is well-rounded
already. Plus we'd also _see_ all our rounding logic by looking at
the '(guard)' entries.

Or maybe there's some aspect of gap-rounding that cannot be
expressed in such a static way?

Ingo

2009-04-18 19:22:07

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.

Linus Torvalds wrote:
>
> On Fri, 17 Apr 2009, Yinghai Lu wrote:
>> it turns that we need to reget res->name because dev->dev.kobj name is changed
>> after device_add.
>
> Can we not make the rule be that the name should just be set before?
>
> IOW, there is something else broken, I think. Rather than put this ugly
> band-aid, why now make sure that whoever had that broken name fixes it?
>

driver core guys are enforcing to use dev_name(device) instead of referring it.

for pci code:

via acpi_pci_root_driver.ops.add (aka acpi_pci_root_add) ==> pci_acpi_scan_root is used to scan pci bus/device, and at the same we read the resource for pci_dev at this point still need to use bus->devices to go over all pci_devices if needed.
in the pci_read_bases, we have res->name = pci_name(pci_dev); pci_name is calling dev_name.

later via acpi_pci_root_driver.ops.start (aka acpi_pci_root_start) ==> pci_bus_add_device to add all pci_dev in kobj tree.

pci_bus_add_device will call device_add.

actually in device_add

/* first, register with generic layer. */
error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
if (error)
goto Error;

will get one new name for that kobj, old name is freed.

Will try to make kobject_add more smart to reuse the old one.

YH

2009-04-18 19:26:15

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.

On Sat, Apr 18, 2009 at 12:19:21PM -0700, Yinghai Lu wrote:
> Linus Torvalds wrote:
> >
> > On Fri, 17 Apr 2009, Yinghai Lu wrote:
> >> it turns that we need to reget res->name because dev->dev.kobj name is changed
> >> after device_add.
> >
> > Can we not make the rule be that the name should just be set before?
> >
> > IOW, there is something else broken, I think. Rather than put this ugly
> > band-aid, why now make sure that whoever had that broken name fixes it?
> >
>
> driver core guys are enforcing to use dev_name(device) instead of referring it.

By "enforcing" you now mean, "there is no other way" :)

> for pci code:
>
> via acpi_pci_root_driver.ops.add (aka acpi_pci_root_add) ==>
> pci_acpi_scan_root is used to scan pci bus/device, and at the same we
> read the resource for pci_dev at this point still need to use
> bus->devices to go over all pci_devices if needed.
> in the pci_read_bases, we have res->name = pci_name(pci_dev); pci_name
> is calling dev_name.
>
> later via acpi_pci_root_driver.ops.start (aka acpi_pci_root_start) ==>
> pci_bus_add_device to add all pci_dev in kobj tree.
>
> pci_bus_add_device will call device_add.
>
> actually in device_add
>
> /* first, register with generic layer. */
> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
> if (error)
> goto Error;
>
> will get one new name for that kobj, old name is freed.
>
> Will try to make kobject_add more smart to reuse the old one.

I don't understand the problem here, how are you going to change the
kobject core? Is this just because you aren't getting a name for the
resource? If so, why would the driver core care about this?

confused,

greg k-h

2009-04-18 19:29:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> On Sat, 18 Apr 2009, Ingo Molnar wrote:
>>> Am i missing something?
>> We also try to avoid random motherboard resources etc that aren't
>> reserved or documented by the BIOS. It's better to go into big
>> holes. It's also better to try to keep as close to the old
>> (tested) behavior.
>
> Yeah - i'm not suggesting any change in behavior, nor am i
> suggesting any risky behavior. The current code seems to work quite
> well.
>
> I'm just suggesting (maybe foolishly) that instead of having any
> gap-rounding logic at all, add artificial entries to the e820 map to
> 'extend' and round up any odd ending entries.
>
> I.e. explicitly manage all the 'hole' space to be nicely rounded and
> to be far away from any T-Seg or other sekrit motherboard resource
> danger area.
>
> We'd do this after PCI static allocations (so we dont ever stomp on
> real, known resources) but before PCI dynamic allocations.
>
> The e820 printout would look literally like this:
>
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB
> [ hole ] 0.250 MB
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB
> BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM
> BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB
> BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM
> BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB
> BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM
> BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB
> BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM
> BIOS-e820: 000000003ff00000 - 0000000040000000 (guard) 1.0 MB
> [ hole ] 3072.0 MB
>
> The '(guard)' entry at the end i added above.
>
> This way we intentionally create a 'free physical address space'
> hole space that is the same as the rounding logic. No rounding
> needed anywhere - as all the remaining address space is well-rounded
> already. Plus we'd also _see_ all our rounding logic by looking at
> the '(guard)' entries.
>
> Or maybe there's some aspect of gap-rounding that cannot be
> expressed in such a static way?
>

please check following patch.

From: Linus Torvalds <[email protected]>

[PATCH] x86: reserve range near the ram -v2

some BIOS use ram near end, but don't state it, just try to reserve them
as RAM buffer

v2: make it in e820 table early instead of resource tree.

[Impact: protect stolen RAM]

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/e820.h | 2 +
arch/x86/kernel/e820.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/setup.c | 6 +++++
3 files changed, 60 insertions(+)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -150,6 +150,9 @@ static void __init e820_print_type(u32 t
case E820_UNUSABLE:
printk(KERN_CONT "(unusable)");
break;
+ case E820_RAM_BUFFER:
+ printk(KERN_CONT "(RAM buffer)");
+ break;
default:
printk(KERN_CONT "type %u", type);
break;
@@ -1314,6 +1317,54 @@ void __init finish_e820_parsing(void)
}
}

+/* How much should we pad RAM ending depending on where it is? */
+static unsigned long __init ram_alignment(resource_size_t pos)
+{
+ unsigned long mb = pos >> 20;
+
+ /* To 64kB in the first megabyte */
+ if (!mb)
+ return 64*1024;
+
+ /* To 1MB in the first 16MB */
+ if (mb < 16)
+ return 1024*1024;
+
+ /* To 32MB for anything above that */
+ return 32*1024*1024;
+}
+
+void __init e820_reserve_stolen_ram(void)
+{
+ int i;
+ int changed = 0;
+
+ /*
+ * Try to bump up RAM regions to reasonable boundaries to
+ * avoid stolen RAM
+ */
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *entry = &e820_saved.map[i];
+ resource_size_t start, end;
+
+ if (entry->type != E820_RAM)
+ continue;
+ start = entry->addr + entry->size;
+ end = round_up(start, ram_alignment(start));
+ if (start == end)
+ continue;
+ e820_add_region(start, end - start, E820_RAM_BUFFER);
+ changed = 1;
+ }
+
+ if (!changed)
+ return;
+
+ sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
+ printk(KERN_INFO "fixed physical RAM map:\n");
+ e820_print_map("reserve_stolen_range");
+}
+
static inline const char *e820_type_to_string(int e820_type)
{
switch (e820_type) {
@@ -1322,6 +1373,7 @@ static inline const char *e820_type_to_s
case E820_ACPI: return "ACPI Tables";
case E820_NVS: return "ACPI Non-volatile Storage";
case E820_UNUSABLE: return "Unusable memory";
+ case E820_RAM_BUFFER: return "RAM Buffer";
default: return "reserved";
}
}
Index: linux-2.6/arch/x86/include/asm/e820.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/e820.h
+++ linux-2.6/arch/x86/include/asm/e820.h
@@ -44,6 +44,7 @@
#define E820_ACPI 3
#define E820_NVS 4
#define E820_UNUSABLE 5
+#define E820_RAM_BUFFER 6

/* reserved RAM used by kernel itself */
#define E820_RESERVED_KERN 128
@@ -78,6 +79,7 @@ extern u64 e820_update_range(u64 start,
extern u64 e820_remove_range(u64 start, u64 size, unsigned old_type,
int checktype);
extern void update_e820(void);
+extern void e820_reserve_stolen_ram(void);
extern void e820_setup_gap(void);
extern int e820_search_gap(unsigned long *gapstart, unsigned long *gapsize,
unsigned long start_addr, unsigned long long end_addr);
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -812,6 +812,12 @@ void __init setup_arch(char **cmdline_p)
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);

+ /*
+ * some systems use end of ram to for acpi or video ram
+ * but doesn't state that in reserved in e820
+ * try to round of ram etc and reserve them
+ */
+ e820_reserve_stolen_ram();

#ifdef CONFIG_X86_32
if (ppro_with_ram_bug()) {

2009-04-18 20:01:00

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.

On Sat, Apr 18, 2009 at 21:23, Greg KH <[email protected]> wrote:
> On Sat, Apr 18, 2009 at 12:19:21PM -0700, Yinghai Lu wrote:
>> Linus Torvalds wrote:

>> > Can we not make the rule be that the name should just be set before?

I guess, that's what they do already, otherwise they could not use dev_name().

>> pci_bus_add_device will call device_add.
>>
>> actually in device_add
>>
>>         /* first, register with generic layer. */
>>         error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
>>         if (error)
>>                 goto Error;
>>
>> will get one new name for that kobj, old name is freed.
>>
>> Will try to make kobject_add more smart to reuse the old one.
>
> I don't understand the problem here, how are you going to change the
> kobject core?  Is this just because you aren't getting a name for the
> resource?  If so, why would the driver core care about this?

Seems a bit odd what we do when registering a device.

Usually one sets the name with dev_set_name() which will name the
kobject. When one later registers the device, we reallocate the same
name, so the pointer changes. I guess the problem here is that the pci
code remembers the name pointer after it was set, but it will not be
the same after the device is live.

We should probably make device_add() pass a NULL as the name, and
kobject_add() in that case use the name that is properly set already.

Thanks,
Kay

2009-04-18 20:02:24

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] driver: dont update dev_name if it is not changed


notice one system /proc/iomem some entries missed the name for pci_devices

it turns that dev->dev.kobj name is changed after device_add.

[Impact: fix corrupted names in /proc/iomem ]

Signed-off-by: Yinghai Lu <[email protected]>

---
lib/kobject.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/lib/kobject.c
===================================================================
--- linux-2.6.orig/lib/kobject.c
+++ linux-2.6/lib/kobject.c
@@ -216,12 +216,21 @@ int kobject_set_name_vargs(struct kobjec
va_list vargs)
{
const char *old_name = kobj->name;
+ char *new_name;
char *s;

- kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
- if (!kobj->name)
+ new_name = kvasprintf(GFP_KERNEL, fmt, vargs);
+ if (!new_name)
return -ENOMEM;

+ /* different ? */
+ if (!strcmp(new_name, old_name)) {
+ kfree(new_name);
+ return 0;
+ }
+
+ kobj->name = new_name;
+
/* ewww... some of these buggers have '/' in the name ... */
while ((s = strchr(kobj->name, '/')))
s[0] = '!';

2009-04-18 20:12:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] driver: dont update dev_name if it is not changed


* Yinghai Lu <[email protected]> wrote:

> notice one system /proc/iomem some entries missed the name for pci_devices
>
> it turns that dev->dev.kobj name is changed after device_add.
>
> [Impact: fix corrupted names in /proc/iomem ]
>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> lib/kobject.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/lib/kobject.c
> ===================================================================
> --- linux-2.6.orig/lib/kobject.c
> +++ linux-2.6/lib/kobject.c
> @@ -216,12 +216,21 @@ int kobject_set_name_vargs(struct kobjec
> va_list vargs)
> {
> const char *old_name = kobj->name;
> + char *new_name;
> char *s;
>
> - kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
> - if (!kobj->name)
> + new_name = kvasprintf(GFP_KERNEL, fmt, vargs);
> + if (!new_name)
> return -ENOMEM;
>
> + /* different ? */
> + if (!strcmp(new_name, old_name)) {
> + kfree(new_name);
> + return 0;
> + }
> +
> + kobj->name = new_name;
> +
> /* ewww... some of these buggers have '/' in the name ... */
> while ((s = strchr(kobj->name, '/')))
> s[0] = '!';

So we used the old name in the resource code, and the kfree() here
corrupted the /proc/iomem output?

This still looks fragile to me ...

Ingo

2009-04-18 20:13:19

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Sat, Apr 18, 2009 at 09:14:25PM +0200, Ingo Molnar wrote:
> This way we intentionally create a 'free physical address space'
> hole space that is the same as the rounding logic. No rounding
> needed anywhere - as all the remaining address space is well-rounded
> already. Plus we'd also _see_ all our rounding logic by looking at
> the '(guard)' entries.
>
> Or maybe there's some aspect of gap-rounding that cannot be
> expressed in such a static way?

My gut feeling is that you guys do overcomplicate a simple issue
which can be fixed with a one-liner like this:

pci_mem_start = pci_mem_start < 0xc0000000 ? : 0xc0000000;

This 0xc0000000 (3G) seems to be a pretty fundamental thing for certain
32-bit OS. ;-)

Ivan.

2009-04-18 20:23:07

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] driver: dont update dev_name if it is not changed

Ingo Molnar wrote:
> * Yinghai Lu <[email protected]> wrote:
>
>> notice one system /proc/iomem some entries missed the name for pci_devices
>>
>> it turns that dev->dev.kobj name is changed after device_add.
>>
>> [Impact: fix corrupted names in /proc/iomem ]
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>>
>> ---
>> lib/kobject.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> Index: linux-2.6/lib/kobject.c
>> ===================================================================
>> --- linux-2.6.orig/lib/kobject.c
>> +++ linux-2.6/lib/kobject.c
>> @@ -216,12 +216,21 @@ int kobject_set_name_vargs(struct kobjec
>> va_list vargs)
>> {
>> const char *old_name = kobj->name;
>> + char *new_name;
>> char *s;
>>
>> - kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
>> - if (!kobj->name)
>> + new_name = kvasprintf(GFP_KERNEL, fmt, vargs);
>> + if (!new_name)
>> return -ENOMEM;
>>
>> + /* different ? */
>> + if (!strcmp(new_name, old_name)) {
>> + kfree(new_name);
>> + return 0;
>> + }
>> +
>> + kobj->name = new_name;
>> +
>> /* ewww... some of these buggers have '/' in the name ... */
>> while ((s = strchr(kobj->name, '/')))
>> s[0] = '!';
>
> So we used the old name in the resource code, and the kfree() here
> corrupted the /proc/iomem output?
>
if it is not changed, we still use old_name. and new_name get freed

YH

2009-04-18 20:27:46

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.

On Sat, 2009-04-18 at 22:00 +0200, Kay Sievers wrote:
> On Sat, Apr 18, 2009 at 21:23, Greg KH <[email protected]> wrote:
> > On Sat, Apr 18, 2009 at 12:19:21PM -0700, Yinghai Lu wrote:
> >> Linus Torvalds wrote:
>
> >> > Can we not make the rule be that the name should just be set before?
>
> I guess, that's what they do already, otherwise they could not use dev_name().
>
> >> pci_bus_add_device will call device_add.
> >>
> >> actually in device_add
> >>
> >> /* first, register with generic layer. */
> >> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
> >> if (error)
> >> goto Error;
> >>
> >> will get one new name for that kobj, old name is freed.
> >>
> >> Will try to make kobject_add more smart to reuse the old one.
> >
> > I don't understand the problem here, how are you going to change the
> > kobject core? Is this just because you aren't getting a name for the
> > resource? If so, why would the driver core care about this?
>
> Seems a bit odd what we do when registering a device.
>
> Usually one sets the name with dev_set_name() which will name the
> kobject. When one later registers the device, we reallocate the same
> name, so the pointer changes. I guess the problem here is that the pci
> code remembers the name pointer after it was set, but it will not be
> the same after the device is live.
>
> We should probably make device_add() pass a NULL as the name, and
> kobject_add() in that case use the name that is properly set already.

Only to see if we are on the right track, does that fix the problem? If
yes, we will fix it properly without the NULL hack deep in the kobject
call.

Thanks,
Kay

diff --git a/drivers/base/core.c b/drivers/base/core.c
index d230ff4..1969b20 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -890,8 +890,8 @@ int device_add(struct device *dev)
if (parent)
set_dev_node(dev, dev_to_node(parent));

- /* first, register with generic layer. */
- error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
+ /* we require the name to be set before, and pass NULL */
+ error = kobject_add(&dev->kobj, dev->kobj.parent, NULL);
if (error)
goto Error;

diff --git a/lib/kobject.c b/lib/kobject.c
index a6dec32..48565d6 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -218,6 +218,9 @@ int kobject_set_name_vargs(struct kobject *kobj, const char *fmt,
const char *old_name = kobj->name;
char *s;

+ if (kobj->name && !fmt)
+ return 0;
+
kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
if (!kobj->name)
return -ENOMEM;

2009-04-18 20:28:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] driver: dont update dev_name if it is not changed


* Yinghai Lu <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Yinghai Lu <[email protected]> wrote:
> >
> >> notice one system /proc/iomem some entries missed the name for pci_devices
> >>
> >> it turns that dev->dev.kobj name is changed after device_add.
> >>
> >> [Impact: fix corrupted names in /proc/iomem ]
> >>
> >> Signed-off-by: Yinghai Lu <[email protected]>
> >>
> >> ---
> >> lib/kobject.c | 13 +++++++++++--
> >> 1 file changed, 11 insertions(+), 2 deletions(-)
> >>
> >> Index: linux-2.6/lib/kobject.c
> >> ===================================================================
> >> --- linux-2.6.orig/lib/kobject.c
> >> +++ linux-2.6/lib/kobject.c
> >> @@ -216,12 +216,21 @@ int kobject_set_name_vargs(struct kobjec
> >> va_list vargs)
> >> {
> >> const char *old_name = kobj->name;
> >> + char *new_name;
> >> char *s;
> >>
> >> - kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
> >> - if (!kobj->name)
> >> + new_name = kvasprintf(GFP_KERNEL, fmt, vargs);
> >> + if (!new_name)
> >> return -ENOMEM;
> >>
> >> + /* different ? */
> >> + if (!strcmp(new_name, old_name)) {
> >> + kfree(new_name);
> >> + return 0;
> >> + }
> >> +
> >> + kobj->name = new_name;
> >> +
> >> /* ewww... some of these buggers have '/' in the name ... */
> >> while ((s = strchr(kobj->name, '/')))
> >> s[0] = '!';
> >
> > So we used the old name in the resource code, and the kfree() here
> > corrupted the /proc/iomem output?
> >
> if it is not changed, we still use old_name. and new_name get freed

I know - you are talking about your patch.

But i'm talking about the current, unpatched upstream code. It got a
string displayed in /proc/iomem that got kfree()d? [and this is why
the entries vanished?]

That is badness somewhere _else_ (not in the kobject core i think),
and your patch does not solve the real badness, it works around its
symptoms AFAICS - like the first patch.

Ingo

2009-04-18 20:38:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.


* Kay Sievers <[email protected]> wrote:

> On Sat, 2009-04-18 at 22:00 +0200, Kay Sievers wrote:
> > On Sat, Apr 18, 2009 at 21:23, Greg KH <[email protected]> wrote:
> > > On Sat, Apr 18, 2009 at 12:19:21PM -0700, Yinghai Lu wrote:
> > >> Linus Torvalds wrote:
> >
> > >> > Can we not make the rule be that the name should just be set before?
> >
> > I guess, that's what they do already, otherwise they could not use dev_name().
> >
> > >> pci_bus_add_device will call device_add.
> > >>
> > >> actually in device_add
> > >>
> > >> /* first, register with generic layer. */
> > >> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
> > >> if (error)
> > >> goto Error;
> > >>
> > >> will get one new name for that kobj, old name is freed.
> > >>
> > >> Will try to make kobject_add more smart to reuse the old one.
> > >
> > > I don't understand the problem here, how are you going to change the
> > > kobject core? Is this just because you aren't getting a name for the
> > > resource? If so, why would the driver core care about this?
> >
> > Seems a bit odd what we do when registering a device.
> >
> > Usually one sets the name with dev_set_name() which will name the
> > kobject. When one later registers the device, we reallocate the same
> > name, so the pointer changes. I guess the problem here is that the pci
> > code remembers the name pointer after it was set, but it will not be
> > the same after the device is live.
> >
> > We should probably make device_add() pass a NULL as the name, and
> > kobject_add() in that case use the name that is properly set already.
>
> Only to see if we are on the right track, does that fix the problem? If
> yes, we will fix it properly without the NULL hack deep in the kobject
> call.
>
> Thanks,
> Kay
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index d230ff4..1969b20 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -890,8 +890,8 @@ int device_add(struct device *dev)
> if (parent)
> set_dev_node(dev, dev_to_node(parent));
>
> - /* first, register with generic layer. */
> - error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
> + /* we require the name to be set before, and pass NULL */
> + error = kobject_add(&dev->kobj, dev->kobj.parent, NULL);
> if (error)
> goto Error;
>
> diff --git a/lib/kobject.c b/lib/kobject.c
> index a6dec32..48565d6 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -218,6 +218,9 @@ int kobject_set_name_vargs(struct kobject *kobj, const char *fmt,
> const char *old_name = kobj->name;
> char *s;
>
> + if (kobj->name && !fmt)
> + return 0;
> +
> kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
> if (!kobj->name)
> return -ENOMEM;

This looks much cleaner to me - a NULL format string looks like a
natural extension.

Ingo

2009-04-18 21:10:20

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.

Kay Sievers wrote:
> On Sat, 2009-04-18 at 22:00 +0200, Kay Sievers wrote:
>> On Sat, Apr 18, 2009 at 21:23, Greg KH <[email protected]> wrote:
>>> On Sat, Apr 18, 2009 at 12:19:21PM -0700, Yinghai Lu wrote:
>>>> Linus Torvalds wrote:
>>>>> Can we not make the rule be that the name should just be set before?
>> I guess, that's what they do already, otherwise they could not use dev_name().
>>
>>>> pci_bus_add_device will call device_add.
>>>>
>>>> actually in device_add
>>>>
>>>> /* first, register with generic layer. */
>>>> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
>>>> if (error)
>>>> goto Error;
>>>>
>>>> will get one new name for that kobj, old name is freed.
>>>>
>>>> Will try to make kobject_add more smart to reuse the old one.
>>> I don't understand the problem here, how are you going to change the
>>> kobject core? Is this just because you aren't getting a name for the
>>> resource? If so, why would the driver core care about this?
>> Seems a bit odd what we do when registering a device.
>>
>> Usually one sets the name with dev_set_name() which will name the
>> kobject. When one later registers the device, we reallocate the same
>> name, so the pointer changes. I guess the problem here is that the pci
>> code remembers the name pointer after it was set, but it will not be
>> the same after the device is live.
>>
>> We should probably make device_add() pass a NULL as the name, and
>> kobject_add() in that case use the name that is properly set already.
>
> Only to see if we are on the right track, does that fix the problem? If
> yes, we will fix it properly without the NULL hack deep in the kobject
> call.
>
> Thanks,
> Kay
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index d230ff4..1969b20 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -890,8 +890,8 @@ int device_add(struct device *dev)
> if (parent)
> set_dev_node(dev, dev_to_node(parent));
>
> - /* first, register with generic layer. */
> - error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
> + /* we require the name to be set before, and pass NULL */
> + error = kobject_add(&dev->kobj, dev->kobj.parent, NULL);
> if (error)
> goto Error;
>
> diff --git a/lib/kobject.c b/lib/kobject.c
> index a6dec32..48565d6 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -218,6 +218,9 @@ int kobject_set_name_vargs(struct kobject *kobj, const char *fmt,
> const char *old_name = kobj->name;
> char *s;
>
> + if (kobj->name && !fmt)
> + return 0;
> +
> kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
> if (!kobj->name)
> return -ENOMEM;
>

it works well. Thanks

YH

2009-04-18 22:08:39

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] driver: dont update dev_name via device_add path

From: Kay Sievers <[email protected]>

notice one system /proc/iomem some entries missed the name for pci_devices

it turns that dev->dev.kobj name is changed after device_add.

for pci code:
via acpi_pci_root_driver.ops.add (aka acpi_pci_root_add) ==> pci_acpi_scan_root
is used to scan pci bus/device, and at the same time we read the resource for pci_dev
in the pci_read_bases, we have res->name = pci_name(pci_dev); pci_name is calling dev_name.

later via acpi_pci_root_driver.ops.start (aka acpi_pci_root_start) ==> pci_bus_add_device to
add all pci_dev in kobj tree.
pci_bus_add_device will call device_add.

actually in device_add

/* first, register with generic layer. */
error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
if (error)
goto Error;

will get one new name for that kobj, old name is freed.

[Impact: fix corrupted names in /proc/iomem ]

Signed-off-by: Yinghai Lu <[email protected]>
---
drivers/base/core.c | 3 ++-
lib/kobject.c | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -891,7 +891,8 @@ int device_add(struct device *dev)
set_dev_node(dev, dev_to_node(parent));

/* first, register with generic layer. */
- error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev_name(dev));
+ /* we require the name to be set before, and pass NULL */
+ error = kobject_add(&dev->kobj, dev->kobj.parent, NULL);
if (error)
goto Error;

Index: linux-2.6/lib/kobject.c
===================================================================
--- linux-2.6.orig/lib/kobject.c
+++ linux-2.6/lib/kobject.c
@@ -218,6 +218,9 @@ int kobject_set_name_vargs(struct kobjec
const char *old_name = kobj->name;
char *s;

+ if (kobj->name && !fmt)
+ return 0;
+
kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
if (!kobj->name)
return -ENOMEM;

2009-04-18 22:22:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Yinghai Lu wrote:
> Ingo Molnar wrote:
>> * Linus Torvalds <[email protected]> wrote:
>>
>>> On Sat, 18 Apr 2009, Ingo Molnar wrote:
>>>> Am i missing something?
>>> We also try to avoid random motherboard resources etc that aren't
>>> reserved or documented by the BIOS. It's better to go into big
>>> holes. It's also better to try to keep as close to the old
>>> (tested) behavior.
>> Yeah - i'm not suggesting any change in behavior, nor am i
>> suggesting any risky behavior. The current code seems to work quite
>> well.
>>
>> I'm just suggesting (maybe foolishly) that instead of having any
>> gap-rounding logic at all, add artificial entries to the e820 map to
>> 'extend' and round up any odd ending entries.
>>
>> I.e. explicitly manage all the 'hole' space to be nicely rounded and
>> to be far away from any T-Seg or other sekrit motherboard resource
>> danger area.
>>
>> We'd do this after PCI static allocations (so we dont ever stomp on
>> real, known resources) but before PCI dynamic allocations.
>>
>> The e820 printout would look literally like this:
>>
>> BIOS-provided physical RAM map:
>> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM
>> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB
>> [ hole ] 0.250 MB
>> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB
>> BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM
>> BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB
>> BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM
>> BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB
>> BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM
>> BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB
>> BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM
>> BIOS-e820: 000000003ff00000 - 0000000040000000 (guard) 1.0 MB
>> [ hole ] 3072.0 MB
>>
>> The '(guard)' entry at the end i added above.
>>
>> This way we intentionally create a 'free physical address space'
>> hole space that is the same as the rounding logic. No rounding
>> needed anywhere - as all the remaining address space is well-rounded
>> already. Plus we'd also _see_ all our rounding logic by looking at
>> the '(guard)' entries.
>>
>> Or maybe there's some aspect of gap-rounding that cannot be
>> expressed in such a static way?
>>
>
> please check following patch.
>
> From: Linus Torvalds <[email protected]>
>
> [PATCH] x86: reserve range near the ram -v2
>
> some BIOS use ram near end, but don't state it, just try to reserve them
> as RAM buffer
>
> v2: make it in e820 table early instead of resource tree.
>
> [Impact: protect stolen RAM]
>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> arch/x86/include/asm/e820.h | 2 +
> arch/x86/kernel/e820.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/setup.c | 6 +++++
> 3 files changed, 60 insertions(+)
>
> Index: linux-2.6/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/e820.c
> +++ linux-2.6/arch/x86/kernel/e820.c
> @@ -150,6 +150,9 @@ static void __init e820_print_type(u32 t
> case E820_UNUSABLE:
> printk(KERN_CONT "(unusable)");
> break;
> + case E820_RAM_BUFFER:
> + printk(KERN_CONT "(RAM buffer)");
> + break;
> default:
> printk(KERN_CONT "type %u", type);
> break;
> @@ -1314,6 +1317,54 @@ void __init finish_e820_parsing(void)
> }
> }
>
> +/* How much should we pad RAM ending depending on where it is? */
> +static unsigned long __init ram_alignment(resource_size_t pos)
> +{
> + unsigned long mb = pos >> 20;
> +
> + /* To 64kB in the first megabyte */
> + if (!mb)
> + return 64*1024;
> +
> + /* To 1MB in the first 16MB */
> + if (mb < 16)
> + return 1024*1024;
> +
> + /* To 32MB for anything above that */
> + return 32*1024*1024;
> +}
> +
> +void __init e820_reserve_stolen_ram(void)
> +{
> + int i;
> + int changed = 0;
> +
> + /*
> + * Try to bump up RAM regions to reasonable boundaries to
> + * avoid stolen RAM
> + */
> + for (i = 0; i < e820.nr_map; i++) {
> + struct e820entry *entry = &e820_saved.map[i];
> + resource_size_t start, end;
> +
> + if (entry->type != E820_RAM)
> + continue;
> + start = entry->addr + entry->size;
> + end = round_up(start, ram_alignment(start));
> + if (start == end)
> + continue;
> + e820_add_region(start, end - start, E820_RAM_BUFFER);
> + changed = 1;
> + }
> +
> + if (!changed)
> + return;
> +
> + sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
> + printk(KERN_INFO "fixed physical RAM map:\n");
> + e820_print_map("reserve_stolen_range");
> +}
> +
> static inline const char *e820_type_to_string(int e820_type)
> {
> switch (e820_type) {
> @@ -1322,6 +1373,7 @@ static inline const char *e820_type_to_s
> case E820_ACPI: return "ACPI Tables";
> case E820_NVS: return "ACPI Non-volatile Storage";
> case E820_UNUSABLE: return "Unusable memory";
> + case E820_RAM_BUFFER: return "RAM Buffer";
> default: return "reserved";
> }
> }
> Index: linux-2.6/arch/x86/include/asm/e820.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/e820.h
> +++ linux-2.6/arch/x86/include/asm/e820.h
> @@ -44,6 +44,7 @@
> #define E820_ACPI 3
> #define E820_NVS 4
> #define E820_UNUSABLE 5
> +#define E820_RAM_BUFFER 6
>
> /* reserved RAM used by kernel itself */
> #define E820_RESERVED_KERN 128
> @@ -78,6 +79,7 @@ extern u64 e820_update_range(u64 start,
> extern u64 e820_remove_range(u64 start, u64 size, unsigned old_type,
> int checktype);
> extern void update_e820(void);
> +extern void e820_reserve_stolen_ram(void);
> extern void e820_setup_gap(void);
> extern int e820_search_gap(unsigned long *gapstart, unsigned long *gapsize,
> unsigned long start_addr, unsigned long long end_addr);
> Index: linux-2.6/arch/x86/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup.c
> +++ linux-2.6/arch/x86/kernel/setup.c
> @@ -812,6 +812,12 @@ void __init setup_arch(char **cmdline_p)
> insert_resource(&iomem_resource, &data_resource);
> insert_resource(&iomem_resource, &bss_resource);
>
> + /*
> + * some systems use end of ram to for acpi or video ram
> + * but doesn't state that in reserved in e820
> + * try to round of ram etc and reserve them
> + */
> + e820_reserve_stolen_ram();
>
> #ifdef CONFIG_X86_32
> if (ppro_with_ram_bug()) {
>

it seems ram_alignment is too aggressive, it eat some RAM really

[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 0000000000097400 (usable)
[ 0.000000] BIOS-e820: 0000000000097400 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000b7fa0000 (usable)
[ 0.000000] BIOS-e820: 00000000b7fae000 - 00000000b7fb0000 (usable)
[ 0.000000] BIOS-e820: 00000000b7fb0000 - 00000000b7fbe000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000b7fbe000 - 00000000b7ff0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000b7ff0000 - 00000000b8000000 (reserved)
[ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
[ 0.000000] BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000002048000000 (usable)
[ 0.000000] Early serial console at I/O port 0x3f8 (options '115200n8')
[ 0.000000] console [uart0] enabled
[ 0.000000] DMI present.
[ 0.000000] fixed physical RAM map:
[ 0.000000] reserve_stolen_range: 0000000000000000 - 0000000000097400 (usable)
[ 0.000000] reserve_stolen_range: 0000000000097400 - 00000000000a0000 (RAM buffer)
[ 0.000000] reserve_stolen_range: 00000000000e0000 - 0000000000100000 (reserved)
[ 0.000000] reserve_stolen_range: 0000000000100000 - 00000000b7fa0000 (usable)
[ 0.000000] reserve_stolen_range: 00000000b7fa0000 - 00000000b8000000 (RAM buffer)
[ 0.000000] reserve_stolen_range: 00000000e0000000 - 00000000f0000000 (reserved)
[ 0.000000] reserve_stolen_range: 00000000fec00000 - 00000000fec01000 (reserved)
[ 0.000000] reserve_stolen_range: 00000000fee00000 - 00000000fef00000 (reserved)
[ 0.000000] reserve_stolen_range: 00000000ff700000 - 0000000100000000 (reserved)
[ 0.000000] reserve_stolen_range: 0000000100000000 - 0000002048000000 (usable)

2009-04-18 22:25:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] pci: keep pci device resource name pointer right.



On Sat, 18 Apr 2009, Kay Sievers wrote:
>
> Only to see if we are on the right track, does that fix the problem? If
> yes, we will fix it properly without the NULL hack deep in the kobject
> call.

ACK, this patch looks much saner, and would seem to address the core
issue.

Linus

2009-04-18 22:38:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Sat, 18 Apr 2009, Yinghai Lu wrote:
>
> From: Linus Torvalds <[email protected]>

This is _not_ my patch, and I think this is wrong.

My patch was about adding entries to the resource region. I very much said
that I do _not_ like your approach of editing the e820 memory map itself.

I think this patch is horrible, and NAK it, and definitely don't want my
name on it.

Linus

2009-04-18 22:46:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Linus Torvalds wrote:
>
> On Sat, 18 Apr 2009, Yinghai Lu wrote:
>> except need to change
>>> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
>> ==> > + reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");
>
> Yes, I sent out a later email pointing that out.
>
>> it will make sure dynmical allocating code will not use those range.
>>
>> and could make e820_setup_gap much simple.
>
> ACK. In fact:
>
>> Index: linux-2.6/arch/x86/kernel/e820.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/e820.c
>> +++ linux-2.6/arch/x86/kernel/e820.c
>> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
>> #endif
>>
>> /*
>> - * See how much we want to round up: start off with
>> - * rounding to the next 1MB area.
>> + * e820_reserve_resources_late will protect stolen RAM
>> + * so just round it to 1M
>> */
>> round = 0x100000;
>> - while ((gapsize >> 4) > round)
>> - round += round;
>> - /* Fun with two's complement */
>> - pci_mem_start = (gapstart + round) & -round;
>> +
>> + pci_mem_start = roundup(gapstart, round);
>
> You can just remove "round" entirely. It's no longer a variable, it's just
> an odd way of saying 1M ;)
>
>> Ingo, can you put those two patches in tip?
>
> I would suggest that we first change "reserve_region_with_split()" to not
> recurse into the region.
>
> That function isn't used by anything else (we ended up using
> "expand_to_fit()" instead in the one place that migth have used it), and
> now th eone caller we do have would not want the recursion - if there
> already exists a resource at the top level, we want to just avoid it.
>
> This - again TOTALLY UNTESTED - patch removes the "recurse into conflicts"
> code. Comments? Testing?
>
> Linus
> ---
> kernel/resource.c | 46 ++++++++++++----------------------------------
> 1 files changed, 12 insertions(+), 34 deletions(-)
>
> diff --git a/kernel/resource.c b/kernel/resource.c
> index fd5d7d5..ac5f3a3 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -533,43 +533,21 @@ static void __init __reserve_region_with_split(struct resource *root,
> res->end = end;
> res->flags = IORESOURCE_BUSY;
>
> - for (;;) {
> - conflict = __request_resource(parent, res);
> - if (!conflict)
> - break;
> - if (conflict != parent) {
> - parent = conflict;
> - if (!(conflict->flags & IORESOURCE_BUSY))
> - continue;
> - }
> -
> - /* Uhhuh, that didn't work out.. */
> - kfree(res);
> - res = NULL;
> - break;
> - }
> -
> - if (!res) {
> - /* failed, split and try again */
> -
> - /* conflict covered whole area */
> - if (conflict->start <= start && conflict->end >= end)
> - return;
> + conflict = __request_resource(parent, res);
> + if (!conflict)
> + return;
>
> - if (conflict->start > start)
> - __reserve_region_with_split(root, start, conflict->start-1, name);
> - if (!(conflict->flags & IORESOURCE_BUSY)) {
> - resource_size_t common_start, common_end;
> + /* failed, split and try again */
> + kfree(res);
>
> - common_start = max(conflict->start, start);
> - common_end = min(conflict->end, end);
> - if (common_start < common_end)
> - __reserve_region_with_split(root, common_start, common_end, name);
> - }
> - if (conflict->end < end)
> - __reserve_region_with_split(root, conflict->end+1, end, name);
> - }
> + /* conflict covered whole area */
> + if (conflict->start <= start && conflict->end >= end)
> + return;
>
> + if (conflict->start > start)
> + __reserve_region_with_split(root, start, conflict->start-1, name);
> + if (conflict->end < end)
> + __reserve_region_with_split(root, conflict->end+1, end, name);
> }
>
> void __init reserve_region_with_split(struct resource *root,

with
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000100 - 0000000000097400 (usable)
[ 0.000000] BIOS-e820: 0000000000097400 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000b7fa0000 (usable)
[ 0.000000] BIOS-e820: 00000000b7fae000 - 00000000b7fb0000 (usable)
[ 0.000000] BIOS-e820: 00000000b7fb0000 - 00000000b7fbe000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000b7fbe000 - 00000000b7ff0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000b7ff0000 - 00000000b8000000 (reserved)
[ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
[ 0.000000] BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000002048000000 (usable)

got

00000100-000973ff : System RAM
00097400-0009ffff : reserved
000a0000-000bffff : PCI Bus #00
000c0000-000cffff : pnp 00:0c
000e0000-000fffff : pnp 00:0c
00100000-b7f9ffff : System RAM
00200000-00c68f6b : Kernel code
00c68f6c-01332f7f : Kernel data
015a6000-01fcaa57 : Kernel bss
20000000-23ffffff : GART
b7fa0000-b7fadfff : RAM buffer
b7fae000-b7faffff : System RAM
b7fb0000-b7fbdfff : ACPI Tables
b7fbe000-b7feffff : ACPI Non-volatile Storage
b7ff0000-b7ffffff : reserved
...

YH

2009-04-18 23:02:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Yinghai Lu wrote:
> Linus Torvalds wrote:
>> On Sat, 18 Apr 2009, Yinghai Lu wrote:
>>> except need to change
>>>> + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
>>> ==> > + reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");
>> Yes, I sent out a later email pointing that out.
>>
>>> it will make sure dynmical allocating code will not use those range.
>>>
>>> and could make e820_setup_gap much simple.
>> ACK. In fact:
>>
>>> Index: linux-2.6/arch/x86/kernel/e820.c
>>> ===================================================================
>>> --- linux-2.6.orig/arch/x86/kernel/e820.c
>>> +++ linux-2.6/arch/x86/kernel/e820.c
>>> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
>>> #endif
>>>
>>> /*
>>> - * See how much we want to round up: start off with
>>> - * rounding to the next 1MB area.
>>> + * e820_reserve_resources_late will protect stolen RAM
>>> + * so just round it to 1M
>>> */
>>> round = 0x100000;
>>> - while ((gapsize >> 4) > round)
>>> - round += round;
>>> - /* Fun with two's complement */
>>> - pci_mem_start = (gapstart + round) & -round;
>>> +
>>> + pci_mem_start = roundup(gapstart, round);
>> You can just remove "round" entirely. It's no longer a variable, it's just
>> an odd way of saying 1M ;)
>>
>>> Ingo, can you put those two patches in tip?
>> I would suggest that we first change "reserve_region_with_split()" to not
>> recurse into the region.
>>
>> That function isn't used by anything else (we ended up using
>> "expand_to_fit()" instead in the one place that migth have used it), and
>> now th eone caller we do have would not want the recursion - if there
>> already exists a resource at the top level, we want to just avoid it.
>>
>> This - again TOTALLY UNTESTED - patch removes the "recurse into conflicts"
>> code. Comments? Testing?
>>
>> Linus
>> ---
>> kernel/resource.c | 46 ++++++++++++----------------------------------
>> 1 files changed, 12 insertions(+), 34 deletions(-)
>>
>> diff --git a/kernel/resource.c b/kernel/resource.c
>> index fd5d7d5..ac5f3a3 100644
>> --- a/kernel/resource.c
>> +++ b/kernel/resource.c
>> @@ -533,43 +533,21 @@ static void __init __reserve_region_with_split(struct resource *root,
>> res->end = end;
>> res->flags = IORESOURCE_BUSY;
>>
>> - for (;;) {
>> - conflict = __request_resource(parent, res);
>> - if (!conflict)
>> - break;
>> - if (conflict != parent) {
>> - parent = conflict;
>> - if (!(conflict->flags & IORESOURCE_BUSY))
>> - continue;
>> - }
>> -
>> - /* Uhhuh, that didn't work out.. */
>> - kfree(res);
>> - res = NULL;
>> - break;
>> - }
>> -
>> - if (!res) {
>> - /* failed, split and try again */
>> -
>> - /* conflict covered whole area */
>> - if (conflict->start <= start && conflict->end >= end)
>> - return;
>> + conflict = __request_resource(parent, res);
>> + if (!conflict)
>> + return;
>>
>> - if (conflict->start > start)
>> - __reserve_region_with_split(root, start, conflict->start-1, name);
>> - if (!(conflict->flags & IORESOURCE_BUSY)) {
>> - resource_size_t common_start, common_end;
>> + /* failed, split and try again */
>> + kfree(res);
>>
>> - common_start = max(conflict->start, start);
>> - common_end = min(conflict->end, end);
>> - if (common_start < common_end)
>> - __reserve_region_with_split(root, common_start, common_end, name);
>> - }
>> - if (conflict->end < end)
>> - __reserve_region_with_split(root, conflict->end+1, end, name);
>> - }
>> + /* conflict covered whole area */
>> + if (conflict->start <= start && conflict->end >= end)
>> + return;
>>
>> + if (conflict->start > start)
>> + __reserve_region_with_split(root, start, conflict->start-1, name);
>> + if (conflict->end < end)
>> + __reserve_region_with_split(root, conflict->end+1, end, name);
>> }
>>
>> void __init reserve_region_with_split(struct resource *root,
>
> with
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: 0000000000000100 - 0000000000097400 (usable)
> [ 0.000000] BIOS-e820: 0000000000097400 - 00000000000a0000 (reserved)
> [ 0.000000] BIOS-e820: 0000000000100000 - 00000000b7fa0000 (usable)
> [ 0.000000] BIOS-e820: 00000000b7fae000 - 00000000b7fb0000 (usable)
> [ 0.000000] BIOS-e820: 00000000b7fb0000 - 00000000b7fbe000 (ACPI data)
> [ 0.000000] BIOS-e820: 00000000b7fbe000 - 00000000b7ff0000 (ACPI NVS)
> [ 0.000000] BIOS-e820: 00000000b7ff0000 - 00000000b8000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
> [ 0.000000] BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
> [ 0.000000] BIOS-e820: 0000000100000000 - 0000002048000000 (usable)
>
> got
>
> 00000100-000973ff : System RAM
> 00097400-0009ffff : reserved
> 000a0000-000bffff : PCI Bus #00
> 000c0000-000cffff : pnp 00:0c
> 000e0000-000fffff : pnp 00:0c
> 00100000-b7f9ffff : System RAM
> 00200000-00c68f6b : Kernel code
> 00c68f6c-01332f7f : Kernel data
> 015a6000-01fcaa57 : Kernel bss
> 20000000-23ffffff : GART
> b7fa0000-b7fadfff : RAM buffer
> b7fae000-b7faffff : System RAM
> b7fb0000-b7fbdfff : ACPI Tables
> b7fbe000-b7feffff : ACPI Non-volatile Storage
> b7ff0000-b7ffffff : reserved
> ...
>
without your patch got
00000100-000973ff : System RAM
00097400-0009ffff : reserved
000a0000-000bffff : PCI Bus #00
000c0000-000cffff : pnp 00:0c
000e0000-000fffff : pnp 00:0c
00100000-b7f9ffff : System RAM
00200000-00c68f6b : Kernel code
00c68f6c-01332f7f : Kernel data
015a6000-01fcaa57 : Kernel bss
20000000-23ffffff : GART
b7fa0000-b7fadfff : RAM buffer
b7fae000-b7faffff : System RAM
b7fb0000-b7fbdfff : ACPI Tables
b7fbe000-b7feffff : ACPI Non-volatile Storage
b7ff0000-b7ffffff : reserved
b7ff0000-b7ffffff : RAM buffer
b8000000-beffffff : PCI Bus #00
bf000000-bfffffff : PCI Bus #80

2009-04-18 23:10:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Sat, 18 Apr 2009, Yinghai Lu wrote:
>
> with
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: 0000000000000100 - 0000000000097400 (usable)
> [ 0.000000] BIOS-e820: 0000000000097400 - 00000000000a0000 (reserved)
> [ 0.000000] BIOS-e820: 0000000000100000 - 00000000b7fa0000 (usable)
> [ 0.000000] BIOS-e820: 00000000b7fae000 - 00000000b7fb0000 (usable)
> [ 0.000000] BIOS-e820: 00000000b7fb0000 - 00000000b7fbe000 (ACPI data)
> [ 0.000000] BIOS-e820: 00000000b7fbe000 - 00000000b7ff0000 (ACPI NVS)
> [ 0.000000] BIOS-e820: 00000000b7ff0000 - 00000000b8000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
> [ 0.000000] BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
> [ 0.000000] BIOS-e820: 0000000100000000 - 0000002048000000 (usable)
>
> got
>
> 00000100-000973ff : System RAM
> 00097400-0009ffff : reserved
> 000a0000-000bffff : PCI Bus #00
> 000c0000-000cffff : pnp 00:0c
> 000e0000-000fffff : pnp 00:0c
> 00100000-b7f9ffff : System RAM
> 00200000-00c68f6b : Kernel code
> 00c68f6c-01332f7f : Kernel data
> 015a6000-01fcaa57 : Kernel bss
> 20000000-23ffffff : GART
> b7fa0000-b7fadfff : RAM buffer
> b7fae000-b7faffff : System RAM
> b7fb0000-b7fbdfff : ACPI Tables
> b7fbe000-b7feffff : ACPI Non-volatile Storage
> b7ff0000-b7ffffff : reserved

Hmm. That looks correct to me. We filled in that odd area between
b7fa0000-b7fadfff that went unmentioned in the e820 tables.

And that _is_ a really odd hole. I wonder what it is all about. But the
approach does seem to have done the right thing.

Linus

2009-04-18 23:12:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Sat, 18 Apr 2009, Yinghai Lu wrote:
>
> without your patch got
>
> 00000100-000973ff : System RAM
> 00097400-0009ffff : reserved
> 000a0000-000bffff : PCI Bus #00
> 000c0000-000cffff : pnp 00:0c
> 000e0000-000fffff : pnp 00:0c
> 00100000-b7f9ffff : System RAM
> 00200000-00c68f6b : Kernel code
> 00c68f6c-01332f7f : Kernel data
> 015a6000-01fcaa57 : Kernel bss
> 20000000-23ffffff : GART
> b7fa0000-b7fadfff : RAM buffer
> b7fae000-b7faffff : System RAM
> b7fb0000-b7fbdfff : ACPI Tables
> b7fbe000-b7feffff : ACPI Non-volatile Storage
> b7ff0000-b7ffffff : reserved
> b7ff0000-b7ffffff : RAM buffer
> b8000000-beffffff : PCI Bus #00
> bf000000-bfffffff : PCI Bus #80

Yeah, that "RAM buffer" inside the b7ff0000-b7ffffff reserved area is
obviously crap.

Linus

2009-04-18 23:28:35

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Linus Torvalds wrote:
>
> On Sat, 18 Apr 2009, Yinghai Lu wrote:
>> without your patch got
>>
>> 00000100-000973ff : System RAM
>> 00097400-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus #00
>> 000c0000-000cffff : pnp 00:0c
>> 000e0000-000fffff : pnp 00:0c
>> 00100000-b7f9ffff : System RAM
>> 00200000-00c68f6b : Kernel code
>> 00c68f6c-01332f7f : Kernel data
>> 015a6000-01fcaa57 : Kernel bss
>> 20000000-23ffffff : GART
>> b7fa0000-b7fadfff : RAM buffer
>> b7fae000-b7faffff : System RAM
>> b7fb0000-b7fbdfff : ACPI Tables
>> b7fbe000-b7feffff : ACPI Non-volatile Storage
>> b7ff0000-b7ffffff : reserved
>> b7ff0000-b7ffffff : RAM buffer
>> b8000000-beffffff : PCI Bus #00
>> bf000000-bfffffff : PCI Bus #80
>
> Yeah, that "RAM buffer" inside the b7ff0000-b7ffffff reserved area is
> obviously crap.

reserved from 0xb7ff0000 - 0xffffffff ranges are registered via e820_reserve_resources_late
didn't have IORESOURCE_BUSY

but RAM buffer area has that IORESOURCE_BUSY

YH

2009-04-18 23:33:01

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Yinghai Lu wrote:
> Linus Torvalds wrote:
>> On Sat, 18 Apr 2009, Yinghai Lu wrote:
>>> without your patch got
>>>
>>> 00000100-000973ff : System RAM
>>> 00097400-0009ffff : reserved
>>> 000a0000-000bffff : PCI Bus #00
>>> 000c0000-000cffff : pnp 00:0c
>>> 000e0000-000fffff : pnp 00:0c
>>> 00100000-b7f9ffff : System RAM
>>> 00200000-00c68f6b : Kernel code
>>> 00c68f6c-01332f7f : Kernel data
>>> 015a6000-01fcaa57 : Kernel bss
>>> 20000000-23ffffff : GART
>>> b7fa0000-b7fadfff : RAM buffer
>>> b7fae000-b7faffff : System RAM
>>> b7fb0000-b7fbdfff : ACPI Tables
>>> b7fbe000-b7feffff : ACPI Non-volatile Storage
>>> b7ff0000-b7ffffff : reserved
>>> b7ff0000-b7ffffff : RAM buffer
>>> b8000000-beffffff : PCI Bus #00
>>> bf000000-bfffffff : PCI Bus #80
>> Yeah, that "RAM buffer" inside the b7ff0000-b7ffffff reserved area is
>> obviously crap.
>
> reserved from 0xb7ff0000 - 0xffffffff ranges are registered via e820_reserve_resources_late
> didn't have IORESOURCE_BUSY
>
> but RAM buffer area has that IORESOURCE_BUSY
>
aka those "reserved" is not really reserved.

YH

2009-04-19 00:34:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Linus Torvalds wrote:
>>
>> 00000100-000973ff : System RAM
>> 00097400-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus #00
>> 000c0000-000cffff : pnp 00:0c
>> 000e0000-000fffff : pnp 00:0c
>> 00100000-b7f9ffff : System RAM
>> 00200000-00c68f6b : Kernel code
>> 00c68f6c-01332f7f : Kernel data
>> 015a6000-01fcaa57 : Kernel bss
>> 20000000-23ffffff : GART
>> b7fa0000-b7fadfff : RAM buffer
>> b7fae000-b7faffff : System RAM
>> b7fb0000-b7fbdfff : ACPI Tables
>> b7fbe000-b7feffff : ACPI Non-volatile Storage
>> b7ff0000-b7ffffff : reserved
>
> Hmm. That looks correct to me. We filled in that odd area between
> b7fa0000-b7fadfff that went unmentioned in the e820 tables.
>
> And that _is_ a really odd hole. I wonder what it is all about. But the
> approach does seem to have done the right thing.
>

Looks to me as through the BIOS rounded the end of available memory to a
64K boundary after subtracting the ACPI storage. That might have been
done to work around an OS loader bug somewhere.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-04-19 04:57:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4



On Sat, 18 Apr 2009, Linus Torvalds wrote:
>
> And that _is_ a really odd hole. I wonder what it is all about. But the
> approach does seem to have done the right thing.

I'll commit the reserve_region_with_split() change. There are no actual
users of it now, so committing that change doesn't really do anything, but
I like removing code, and with the only current potential user actively
wanting just the simpler behavior, why keep the code around?

Linus

2009-04-19 05:29:16

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Linus Torvalds wrote:
>
> On Sat, 18 Apr 2009, Linus Torvalds wrote:
>> And that _is_ a really odd hole. I wonder what it is all about. But the
>> approach does seem to have done the right thing.
>
> I'll commit the reserve_region_with_split() change. There are no actual
> users of it now, so committing that change doesn't really do anything, but
> I like removing code, and with the only current potential user actively
> wanting just the simpler behavior, why keep the code around?
>
sure.

yannick, can you check attached three patches on linus tree?
like
http://people.redhat.com/mingo/tip.git/readme.txt
and
git checkout -b linus_2009_04_18 linus/master

YH


Attachments:
pci_start_linus.patch (1.82 kB)
pci_start_x.patch (1.32 kB)
pref_mem_32bit_v2.patch (7.71 kB)
Download all attachments

2009-04-19 09:03:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4


* Linus Torvalds <[email protected]> wrote:

> On Sat, 18 Apr 2009, Linus Torvalds wrote:
> >
> > And that _is_ a really odd hole. I wonder what it is all about.
> > But the approach does seem to have done the right thing.
>
> I'll commit the reserve_region_with_split() change. There are no
> actual users of it now, so committing that change doesn't really
> do anything, but I like removing code, and with the only current
> potential user actively wanting just the simpler behavior, why
> keep the code around?

Cool! Yinghai, mind (re-)sending the latest version of the remaining
two patches, so what we can pick this up into the x86 tree and get
it tested? I'd say it's for v2.6.31. (unless someone can think of a
strong reason to do this sooner.)

Thanks,

Ingo

2009-04-19 09:07:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4


* Ingo Molnar <[email protected]> wrote:

> Cool! Yinghai, mind (re-)sending the latest version of the
> remaining two patches, so what we can pick this up into the x86
> tree and get it tested? I'd say it's for v2.6.31. (unless someone
> can think of a strong reason to do this sooner.)

Hm, there's one patch in that lot that does:

drivers/pci/bus.c | 8 +++++++-
drivers/pci/probe.c | 8 ++++++--
drivers/pci/setup-bus.c | 40 +++++++++++++++++++++++++++++++---------
3 files changed, 44 insertions(+), 12 deletions(-)

Which should go via the PCI tree.

I can set up an isolated x86/pci-gap topic that i'll send to Jesse
to pull (once it looks to be stable), as the other patches modify
the e820 code which we'd like to test in the x86 tree first.

Jesse, Linus, Yinghai, does that look like a good plan to you?

Ingo

2009-04-19 17:52:39

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Sun, 19 Apr 2009 11:06:15 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > Cool! Yinghai, mind (re-)sending the latest version of the
> > remaining two patches, so what we can pick this up into the x86
> > tree and get it tested? I'd say it's for v2.6.31. (unless someone
> > can think of a strong reason to do this sooner.)
>
> Hm, there's one patch in that lot that does:
>
> drivers/pci/bus.c | 8 +++++++-
> drivers/pci/probe.c | 8 ++++++--
> drivers/pci/setup-bus.c | 40
> +++++++++++++++++++++++++++++++--------- 3 files changed, 44
> insertions(+), 12 deletions(-)
>
> Which should go via the PCI tree.
>
> I can set up an isolated x86/pci-gap topic that i'll send to Jesse
> to pull (once it looks to be stable), as the other patches modify
> the e820 code which we'd like to test in the x86 tree first.
>
> Jesse, Linus, Yinghai, does that look like a good plan to you?

Yep, that's fine with me.

--
Jesse Barnes, Intel Open Source Technology Center

2009-04-19 19:36:18

by Yannick Roehlly

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Le Sunday 19 April 2009 07:26:36 Yinghai Lu, vous avez ?crit :
> yannick, can you check attached three patches on linus tree?

Here comes the results of the test: it works!

I attach the boot log (dmesg output).

Sincerely,

Yannick

PS: In case, the patch is not commited to 2.6.30, will it be safe for me to
use it on my kernel (consider this as long time testing ;-)).

--
Today is the tomorrow you worried about yesterday.


Attachments:
(No filename) (430.00 B)
boot.log (103.75 kB)
Download all attachments

2009-04-19 19:59:58

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Sun, Apr 19, 2009 at 12:35 PM, Yannick Roehlly
<[email protected]> wrote:
> Le Sunday 19 April 2009 07:26:36 Yinghai Lu, vous avez ?crit :
>> yannick, can you check attached three patches on linus tree?
>
> Here comes the results of the test: it works!
>
> I attach the boot log (dmesg output).

thanks. can you post cat /proc/iomem ?

YH

2009-04-19 20:24:49

by Yannick Roehlly

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Le Sunday 19 April 2009 21:59:47 Yinghai Lu, vous avez ?crit :
> On Sun, Apr 19, 2009 at 12:35 PM, Yannick Roehlly
> thanks. can you post cat /proc/iomem ?

Here it is.

Yannick



Attachments:
(No filename) (180.00 B)
iomem (2.03 kB)
Download all attachments

2009-04-20 22:32:59

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Sun, Apr 19, 2009 at 11:06:15AM +0200, Ingo Molnar wrote:
> Hm, there's one patch in that lot that does:
>
> drivers/pci/bus.c | 8 +++++++-
> drivers/pci/probe.c | 8 ++++++--
> drivers/pci/setup-bus.c | 40 +++++++++++++++++++++++++++++++---------
> 3 files changed, 44 insertions(+), 12 deletions(-)
>
> Which should go via the PCI tree.

Here is a replacement for that patch which doesn't touch
the generic code.

Ivan.

---
x86 pci: first cut on 64-bit resource allocation

I believe that we should consider PCI memory above 4G as yet another
type of address space. This actually makes sense, as even accesses to that
memory are physically different - Dual Address Cycle (DAC) vs. 32-bit
Single Address Cycle (SAC).

So, platform that can deal with 64-bit allocations would set up an
additional root bus resource and mark it with IORESOURCE_MEM64 flag.

The main problem here is how the kernel would detect that hardware can
actually access a DAC memory (I know for a fact that a lot of Intel chipsets
cannot access MMIO >4G, even though subordinate p2p bridges are 64-bit
capable).
On the other hand, there are PCI devices with 64-bit BARs that do not
work properly being placed above 4G boundary. For example, some
radeon cards have 64-bit BAR for video RAM, but allocating that BAR in
the DAC area doesn't work for various reasons, like video-BIOS
limitations or drivers not taking into account that GPU is 32-bit.

So moving stuff into MEM64 area should be considered as generally unsafe
operation, and the best default policy is to not enable MEM64 resource
unless we find that BIOS has allocated something there.
At the same time, MEM64 can be easily enabled/disabled based on host
bridge PCI IDs, kernel command line options and so on.

Here is a basic implementation of the above for x86. I think it's
reasonably good starting point for PCI64 work - the next step would
be to teach pci_bus_alloc_resource() about IORESOURCE_MEM64: logic is
similar to prefetch vs non-prefetch case - MEM can hold MEM64 resource,
but not vice versa. And eventually bridge sizing code will be updated
for reasonable 64-bit allocations (it's a non-trivial task, though).

This patch alone should fix cardbus >4G allocations and similar
nonsense.

Signed-off-by: Ivan Kokshaysky <[email protected]>
---
arch/x86/include/asm/pci.h | 8 ++++++++
arch/x86/pci/Makefile | 2 ++
arch/x86/pci/dac_64bit.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/pci/i386.c | 10 ++++++++++
include/linux/ioport.h | 2 ++
5 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index b51a1e8..5a9c54e 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -86,6 +86,14 @@ static inline void early_quirks(void) { }

extern void pci_iommu_alloc(void);

+#ifdef CONFIG_ARCH_PHYS_ADDR_T_64BIT
+extern void pcibios_pci64_setup(void);
+extern void pcibios_pci64_verify(void);
+#else
+static inline void pcibios_pci64_setup(void) { }
+static inline void pcibios_pci64_verify(void) { }
+#endif
+
/* MSI arch hook */
#define arch_setup_msi_irqs arch_setup_msi_irqs

diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
index d49202e..1b6c576 100644
--- a/arch/x86/pci/Makefile
+++ b/arch/x86/pci/Makefile
@@ -13,5 +13,7 @@ obj-$(CONFIG_X86_VISWS) += visws.o

obj-$(CONFIG_X86_NUMAQ) += numaq_32.o

+obj-$(CONFIG_ARCH_PHYS_ADDR_T_64BIT) += dac_64bit.o
+
obj-y += common.o early.o
obj-y += amd_bus.o
diff --git a/arch/x86/pci/dac_64bit.c b/arch/x86/pci/dac_64bit.c
new file mode 100644
index 0000000..ee03c4a
--- /dev/null
+++ b/arch/x86/pci/dac_64bit.c
@@ -0,0 +1,44 @@
+/*
+ * Set up the 64-bit bus resource for allocations > 4G if the hardware
+ * is capable of generating Dual Address Cycle (DAC).
+ */
+
+#include <linux/pci.h>
+
+static struct resource mem64 = {
+ .name = "PCI mem64",
+ .start = (resource_size_t)1 << 32, /* 4Gb */
+ .end = -1,
+ .flags = IORESOURCE_MEM,
+};
+
+void pcibios_pci64_setup(void)
+{
+ struct resource *r64 = &mem64, *root = &iomem_resource;
+ struct pci_bus *b;
+
+ if (insert_resource(root, r64)) {
+ printk(KERN_WARNING "PCI: Failed to allocate PCI64 space\n");
+ return;
+ }
+
+ list_for_each_entry(b, &pci_root_buses, node) {
+ /* Is this a "standard" root bus created by pci_create_bus? */
+ if (b->resource[1] != root || b->resource[2])
+ continue;
+ b->resource[2] = r64; /* create DAC resource */
+ }
+}
+
+void pcibios_pci64_verify(void)
+{
+ struct pci_bus *b;
+
+ if (mem64.flags & IORESOURCE_MEM64)
+ return; /* presumably DAC works */
+ list_for_each_entry(b, &pci_root_buses, node) {
+ if (b->resource[2] == &mem64)
+ b->resource[2] = NULL;
+ }
+ printk(KERN_INFO "PCI: allocations above 4G disabled\n");
+}
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index f1817f7..bf8eb75 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -137,6 +137,10 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
* range.
*/
r->flags = 0;
+ } else {
+ /* Successful allocation */
+ if (upper_32_bits(r->start))
+ pr->flags |= IORESOURCE_MEM64;
}
}
}
@@ -174,6 +178,10 @@ static void __init pcibios_allocate_resources(int pass)
/* We'll assign a new address later */
r->end -= r->start;
r->start = 0;
+ } else {
+ /* Successful allocation */
+ if (upper_32_bits(r->start))
+ pr->flags |= IORESOURCE_MEM64;
}
}
}
@@ -225,9 +233,11 @@ static int __init pcibios_assign_resources(void)
void __init pcibios_resource_survey(void)
{
DBG("PCI: Allocating resources\n");
+ pcibios_pci64_setup();
pcibios_allocate_bus_resources(&pci_root_buses);
pcibios_allocate_resources(0);
pcibios_allocate_resources(1);
+ pcibios_pci64_verify();

e820_reserve_resources_late();
}
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 32e4b2f..30403b3 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -49,6 +49,8 @@ struct resource_list {
#define IORESOURCE_SIZEALIGN 0x00020000 /* size indicates alignment */
#define IORESOURCE_STARTALIGN 0x00040000 /* start field is alignment */

+#define IORESOURCE_MEM64 0x00080000 /* 64-bit addressing, >4G */
+
#define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this resource */
#define IORESOURCE_DISABLED 0x10000000
#define IORESOURCE_UNSET 0x20000000

2009-04-20 22:55:47

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Ivan Kokshaysky wrote:
> On Sun, Apr 19, 2009 at 11:06:15AM +0200, Ingo Molnar wrote:
>> Hm, there's one patch in that lot that does:
>>
>> drivers/pci/bus.c | 8 +++++++-
>> drivers/pci/probe.c | 8 ++++++--
>> drivers/pci/setup-bus.c | 40 +++++++++++++++++++++++++++++++---------
>> 3 files changed, 44 insertions(+), 12 deletions(-)
>>
>> Which should go via the PCI tree.
>
> Here is a replacement for that patch which doesn't touch
> the generic code.
>
> Ivan.
>
> ---
> x86 pci: first cut on 64-bit resource allocation
>
> I believe that we should consider PCI memory above 4G as yet another
> type of address space. This actually makes sense, as even accesses to that
> memory are physically different - Dual Address Cycle (DAC) vs. 32-bit
> Single Address Cycle (SAC).
>
> So, platform that can deal with 64-bit allocations would set up an
> additional root bus resource and mark it with IORESOURCE_MEM64 flag.
>
> The main problem here is how the kernel would detect that hardware can
> actually access a DAC memory (I know for a fact that a lot of Intel chipsets
> cannot access MMIO >4G, even though subordinate p2p bridges are 64-bit
> capable).
> On the other hand, there are PCI devices with 64-bit BARs that do not
> work properly being placed above 4G boundary. For example, some
> radeon cards have 64-bit BAR for video RAM, but allocating that BAR in
> the DAC area doesn't work for various reasons, like video-BIOS
> limitations or drivers not taking into account that GPU is 32-bit.
>
> So moving stuff into MEM64 area should be considered as generally unsafe
> operation, and the best default policy is to not enable MEM64 resource
> unless we find that BIOS has allocated something there.
> At the same time, MEM64 can be easily enabled/disabled based on host
> bridge PCI IDs, kernel command line options and so on.
>
> Here is a basic implementation of the above for x86. I think it's
> reasonably good starting point for PCI64 work - the next step would
> be to teach pci_bus_alloc_resource() about IORESOURCE_MEM64: logic is
> similar to prefetch vs non-prefetch case - MEM can hold MEM64 resource,
> but not vice versa. And eventually bridge sizing code will be updated
> for reasonable 64-bit allocations (it's a non-trivial task, though).
>
> This patch alone should fix cardbus >4G allocations and similar
> nonsense.
>
> Signed-off-by: Ivan Kokshaysky <[email protected]>
> ---
> arch/x86/include/asm/pci.h | 8 ++++++++
> arch/x86/pci/Makefile | 2 ++
> arch/x86/pci/dac_64bit.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/pci/i386.c | 10 ++++++++++
> include/linux/ioport.h | 2 ++
> 5 files changed, 66 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
> index b51a1e8..5a9c54e 100644
> --- a/arch/x86/include/asm/pci.h
> +++ b/arch/x86/include/asm/pci.h
> @@ -86,6 +86,14 @@ static inline void early_quirks(void) { }
>
> extern void pci_iommu_alloc(void);
>
> +#ifdef CONFIG_ARCH_PHYS_ADDR_T_64BIT
> +extern void pcibios_pci64_setup(void);
> +extern void pcibios_pci64_verify(void);
> +#else
> +static inline void pcibios_pci64_setup(void) { }
> +static inline void pcibios_pci64_verify(void) { }
> +#endif
> +
> /* MSI arch hook */
> #define arch_setup_msi_irqs arch_setup_msi_irqs
>
> diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
> index d49202e..1b6c576 100644
> --- a/arch/x86/pci/Makefile
> +++ b/arch/x86/pci/Makefile
> @@ -13,5 +13,7 @@ obj-$(CONFIG_X86_VISWS) += visws.o
>
> obj-$(CONFIG_X86_NUMAQ) += numaq_32.o
>
> +obj-$(CONFIG_ARCH_PHYS_ADDR_T_64BIT) += dac_64bit.o
> +
> obj-y += common.o early.o
> obj-y += amd_bus.o
> diff --git a/arch/x86/pci/dac_64bit.c b/arch/x86/pci/dac_64bit.c
> new file mode 100644
> index 0000000..ee03c4a
> --- /dev/null
> +++ b/arch/x86/pci/dac_64bit.c
> @@ -0,0 +1,44 @@
> +/*
> + * Set up the 64-bit bus resource for allocations > 4G if the hardware
> + * is capable of generating Dual Address Cycle (DAC).
> + */
> +
> +#include <linux/pci.h>
> +
> +static struct resource mem64 = {
> + .name = "PCI mem64",
> + .start = (resource_size_t)1 << 32, /* 4Gb */
> + .end = -1,
> + .flags = IORESOURCE_MEM,
> +};
> +
> +void pcibios_pci64_setup(void)
> +{
> + struct resource *r64 = &mem64, *root = &iomem_resource;
> + struct pci_bus *b;
> +
> + if (insert_resource(root, r64)) {
> + printk(KERN_WARNING "PCI: Failed to allocate PCI64 space\n");
> + return;
> + }
> +
> + list_for_each_entry(b, &pci_root_buses, node) {
> + /* Is this a "standard" root bus created by pci_create_bus? */
> + if (b->resource[1] != root || b->resource[2])
> + continue;
> + b->resource[2] = r64; /* create DAC resource */
> + }
> +}
> +
> +void pcibios_pci64_verify(void)
> +{
> + struct pci_bus *b;
> +
> + if (mem64.flags & IORESOURCE_MEM64)
> + return; /* presumably DAC works */
> + list_for_each_entry(b, &pci_root_buses, node) {
> + if (b->resource[2] == &mem64)
> + b->resource[2] = NULL;
> + }
> + printk(KERN_INFO "PCI: allocations above 4G disabled\n");
> +}
> diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> index f1817f7..bf8eb75 100644
> --- a/arch/x86/pci/i386.c
> +++ b/arch/x86/pci/i386.c
> @@ -137,6 +137,10 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
> * range.
> */
> r->flags = 0;
> + } else {
> + /* Successful allocation */
> + if (upper_32_bits(r->start))
> + pr->flags |= IORESOURCE_MEM64;
> }
> }
> }
> @@ -174,6 +178,10 @@ static void __init pcibios_allocate_resources(int pass)
> /* We'll assign a new address later */
> r->end -= r->start;
> r->start = 0;
> + } else {
> + /* Successful allocation */
> + if (upper_32_bits(r->start))
> + pr->flags |= IORESOURCE_MEM64;
> }
> }
> }
> @@ -225,9 +233,11 @@ static int __init pcibios_assign_resources(void)
> void __init pcibios_resource_survey(void)
> {
> DBG("PCI: Allocating resources\n");
> + pcibios_pci64_setup();
> pcibios_allocate_bus_resources(&pci_root_buses);
> pcibios_allocate_resources(0);
> pcibios_allocate_resources(1);
> + pcibios_pci64_verify();
>
> e820_reserve_resources_late();
> }
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 32e4b2f..30403b3 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -49,6 +49,8 @@ struct resource_list {
> #define IORESOURCE_SIZEALIGN 0x00020000 /* size indicates alignment */
> #define IORESOURCE_STARTALIGN 0x00040000 /* start field is alignment */
>
> +#define IORESOURCE_MEM64 0x00080000 /* 64-bit addressing, >4G */
> +
> #define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this resource */
> #define IORESOURCE_DISABLED 0x10000000
> #define IORESOURCE_UNSET 0x20000000

i don't think this going to work with AMD system with 2 and more HT chains.

also it may have problem with new intel system with 2 IOHs.

YH

2009-04-21 00:12:19

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Ivan Kokshaysky wrote:
> On Sun, Apr 19, 2009 at 11:06:15AM +0200, Ingo Molnar wrote:
>> Hm, there's one patch in that lot that does:
>>
>> drivers/pci/bus.c | 8 +++++++-
>> drivers/pci/probe.c | 8 ++++++--
>> drivers/pci/setup-bus.c | 40 +++++++++++++++++++++++++++++++---------
>> 3 files changed, 44 insertions(+), 12 deletions(-)
>>
>> Which should go via the PCI tree.
>
> Here is a replacement for that patch which doesn't touch
> the generic code.
>
> Ivan.
>
> ---
> x86 pci: first cut on 64-bit resource allocation
>
> I believe that we should consider PCI memory above 4G as yet another
> type of address space. This actually makes sense, as even accesses to that
> memory are physically different - Dual Address Cycle (DAC) vs. 32-bit
> Single Address Cycle (SAC).
>
> So, platform that can deal with 64-bit allocations would set up an
> additional root bus resource and mark it with IORESOURCE_MEM64 flag.
>
> The main problem here is how the kernel would detect that hardware can
> actually access a DAC memory (I know for a fact that a lot of Intel chipsets
> cannot access MMIO >4G, even though subordinate p2p bridges are 64-bit
> capable).
> On the other hand, there are PCI devices with 64-bit BARs that do not
> work properly being placed above 4G boundary. For example, some
> radeon cards have 64-bit BAR for video RAM, but allocating that BAR in
> the DAC area doesn't work for various reasons, like video-BIOS
> limitations or drivers not taking into account that GPU is 32-bit.
>
> So moving stuff into MEM64 area should be considered as generally unsafe
> operation, and the best default policy is to not enable MEM64 resource
> unless we find that BIOS has allocated something there.
> At the same time, MEM64 can be easily enabled/disabled based on host
> bridge PCI IDs, kernel command line options and so on.
>
> Here is a basic implementation of the above for x86. I think it's
> reasonably good starting point for PCI64 work - the next step would
> be to teach pci_bus_alloc_resource() about IORESOURCE_MEM64: logic is
> similar to prefetch vs non-prefetch case - MEM can hold MEM64 resource,
> but not vice versa. And eventually bridge sizing code will be updated
> for reasonable 64-bit allocations (it's a non-trivial task, though).
>
> This patch alone should fix cardbus >4G allocations and similar
> nonsense.
>
> Signed-off-by: Ivan Kokshaysky <[email protected]>
> ---
> arch/x86/include/asm/pci.h | 8 ++++++++
> arch/x86/pci/Makefile | 2 ++
> arch/x86/pci/dac_64bit.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/pci/i386.c | 10 ++++++++++
> include/linux/ioport.h | 2 ++
> 5 files changed, 66 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
> index b51a1e8..5a9c54e 100644
> --- a/arch/x86/include/asm/pci.h
> +++ b/arch/x86/include/asm/pci.h
> @@ -86,6 +86,14 @@ static inline void early_quirks(void) { }
>
> extern void pci_iommu_alloc(void);
>
> +#ifdef CONFIG_ARCH_PHYS_ADDR_T_64BIT
> +extern void pcibios_pci64_setup(void);
> +extern void pcibios_pci64_verify(void);
> +#else
> +static inline void pcibios_pci64_setup(void) { }
> +static inline void pcibios_pci64_verify(void) { }
> +#endif
> +
> /* MSI arch hook */
> #define arch_setup_msi_irqs arch_setup_msi_irqs
>
> diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
> index d49202e..1b6c576 100644
> --- a/arch/x86/pci/Makefile
> +++ b/arch/x86/pci/Makefile
> @@ -13,5 +13,7 @@ obj-$(CONFIG_X86_VISWS) += visws.o
>
> obj-$(CONFIG_X86_NUMAQ) += numaq_32.o
>
> +obj-$(CONFIG_ARCH_PHYS_ADDR_T_64BIT) += dac_64bit.o
> +
> obj-y += common.o early.o
> obj-y += amd_bus.o
> diff --git a/arch/x86/pci/dac_64bit.c b/arch/x86/pci/dac_64bit.c
> new file mode 100644
> index 0000000..ee03c4a
> --- /dev/null
> +++ b/arch/x86/pci/dac_64bit.c
> @@ -0,0 +1,44 @@
> +/*
> + * Set up the 64-bit bus resource for allocations > 4G if the hardware
> + * is capable of generating Dual Address Cycle (DAC).
> + */
> +
> +#include <linux/pci.h>
> +
> +static struct resource mem64 = {
> + .name = "PCI mem64",
> + .start = (resource_size_t)1 << 32, /* 4Gb */
> + .end = -1,
> + .flags = IORESOURCE_MEM,
> +};
> +
> +void pcibios_pci64_setup(void)
> +{
> + struct resource *r64 = &mem64, *root = &iomem_resource;
> + struct pci_bus *b;
> +
> + if (insert_resource(root, r64)) {
> + printk(KERN_WARNING "PCI: Failed to allocate PCI64 space\n");
> + return;
> + }
> +
> + list_for_each_entry(b, &pci_root_buses, node) {
> + /* Is this a "standard" root bus created by pci_create_bus? */
> + if (b->resource[1] != root || b->resource[2])
> + continue;
> + b->resource[2] = r64; /* create DAC resource */
> + }
> +}
> +
> +void pcibios_pci64_verify(void)
> +{
> + struct pci_bus *b;
> +
> + if (mem64.flags & IORESOURCE_MEM64)
> + return; /* presumably DAC works */
> + list_for_each_entry(b, &pci_root_buses, node) {
> + if (b->resource[2] == &mem64)
> + b->resource[2] = NULL;
> + }
> + printk(KERN_INFO "PCI: allocations above 4G disabled\n");
> +}
> diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> index f1817f7..bf8eb75 100644
> --- a/arch/x86/pci/i386.c
> +++ b/arch/x86/pci/i386.c
> @@ -137,6 +137,10 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
> * range.
> */
> r->flags = 0;
> + } else {
> + /* Successful allocation */
> + if (upper_32_bits(r->start))
> + pr->flags |= IORESOURCE_MEM64;
> }
> }
> }
> @@ -174,6 +178,10 @@ static void __init pcibios_allocate_resources(int pass)
> /* We'll assign a new address later */
> r->end -= r->start;
> r->start = 0;
> + } else {
> + /* Successful allocation */
> + if (upper_32_bits(r->start))
> + pr->flags |= IORESOURCE_MEM64;
> }
> }
> }
> @@ -225,9 +233,11 @@ static int __init pcibios_assign_resources(void)
> void __init pcibios_resource_survey(void)
> {
> DBG("PCI: Allocating resources\n");
> + pcibios_pci64_setup();
> pcibios_allocate_bus_resources(&pci_root_buses);
> pcibios_allocate_resources(0);
> pcibios_allocate_resources(1);
> + pcibios_pci64_verify();
>
> e820_reserve_resources_late();
> }
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 32e4b2f..30403b3 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -49,6 +49,8 @@ struct resource_list {
> #define IORESOURCE_SIZEALIGN 0x00020000 /* size indicates alignment */
> #define IORESOURCE_STARTALIGN 0x00040000 /* start field is alignment */
>
> +#define IORESOURCE_MEM64 0x00080000 /* 64-bit addressing, >4G */
> +
> #define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this resource */
> #define IORESOURCE_DISABLED 0x10000000
> #define IORESOURCE_UNSET 0x20000000

also it seems logical is wrong.

we should make sure if one pci resource support 64 from pci_read_bases() instead of
pcibios_allocate_resources.

thinking about: if pci bridge on bus 0 (aka first peer root bus), some device under
bridge doesn't get allocated resource from BIOS. and those bridge/device does support pref mem64
then your patch will not getIORESORCE_MEM64 set before pcibios_pci64_verify...

Correct logic should be
record all device if support 64bit pref mem (with pci_read_bases), and make sure bridge to be
consistent with that of device under if.
and that is my patch doing: pci: don't assume pref memio are 64bit -v2

YH

2009-04-21 10:54:55

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Mon, Apr 20, 2009 at 03:52:26PM -0700, Yinghai Lu wrote:
> i don't think this going to work with AMD system with 2 and more HT chains.
>
> also it may have problem with new intel system with 2 IOHs.

There are no problems with these systems, my patch is just a no-op
for them, which is fine for now - please note this is only a first
step to get 64-bit allocations right.

Ivan.

2009-04-21 10:56:33

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Mon, Apr 20, 2009 at 05:09:32PM -0700, Yinghai Lu wrote:
> also it seems logical is wrong.
>
> we should make sure if one pci resource support 64 from pci_read_bases() instead of
> pcibios_allocate_resources.

pci_read_bases() is already providing all necessary information.

> thinking about: if pci bridge on bus 0 (aka first peer root bus), some device under
> bridge doesn't get allocated resource from BIOS. and those bridge/device does support pref mem64
> then your patch will not getIORESORCE_MEM64 set before pcibios_pci64_verify...

Yes, that's my point - even if host bridge, p2p bridge and device are 64-bit,
there is absolutely no guarantee that after moving the BARs above 4G
the device will work correctly.

> Correct logic should be
> record all device if support 64bit pref mem (with pci_read_bases), and make sure bridge to be
> consistent with that of device under if.

Your view is very x86 centric, please don't forget that drivers/pci
code is used by other architectures as well:
- limiting 32-bit allocations to 0xffffffff simply breaks non-x86
architectures. Alpha doesn't even boot with your patch;
- there are lots of devices with 64-bit non-prefetchable memory BARs,
you don't seem to care about that.

And your patch doesn't work even on x86:

00:01.0 PCI bridge: Intel Corporation 82945G/GZ/P/PL PCI Express Root Port (rev 02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 16 bytes
Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
I/O behind bridge: 0000e000-0000efff
Memory behind bridge: cdf00000-cfffffff
Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [88] Subsystem: Intel Corporation Device 0000
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0300c Data: 4159
Capabilities: [a0] Express (v1) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #2, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <256ns, L1 <4us
ClockPM- Suprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
Slot # 0, PowerLimit 75.000000; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd On, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
00: 86 80 71 27 07 05 10 00 02 00 04 06 04 00 01 00
10: 00 00 00 00 00 00 00 00 00 04 04 00 e0 e0 00 00
20: f0 cd f0 cf 01 d0 f1 df 00 00 00 00 00 00 00 00
30: 00 00 00 00 88 00 00 00 00 00 00 00 0b 01 0a 00

As one can see, the prefetchable base register (0x24) is 0xd001, the
bit 0 indicates 64-bitness. Which is not true, as i82945G/GZ/P/PL
only supports 32-bit addressing (please check the datasheet).

Also, your patch can't handle transparent bridges. And it doesn't
bode well for bus sizing code.

> and that is my patch doing: pci: don't assume pref memio are 64bit -v2

Your patch is all wrong, sorry.

Ivan.

2009-04-21 15:41:22

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

On Tue, 21 Apr 2009 02:33:05 +0400
Ivan Kokshaysky <[email protected]> wrote:
> x86 pci: first cut on 64-bit resource allocation
>
> I believe that we should consider PCI memory above 4G as yet another
> type of address space. This actually makes sense, as even accesses to
> that memory are physically different - Dual Address Cycle (DAC) vs.
> 32-bit Single Address Cycle (SAC).
>
> So, platform that can deal with 64-bit allocations would set up an
> additional root bus resource and mark it with IORESOURCE_MEM64 flag.
>
> The main problem here is how the kernel would detect that hardware can
> actually access a DAC memory (I know for a fact that a lot of Intel
> chipsets cannot access MMIO >4G, even though subordinate p2p bridges
> are 64-bit capable).
> On the other hand, there are PCI devices with 64-bit BARs that do not
> work properly being placed above 4G boundary. For example, some
> radeon cards have 64-bit BAR for video RAM, but allocating that BAR in
> the DAC area doesn't work for various reasons, like video-BIOS
> limitations or drivers not taking into account that GPU is 32-bit.
>
> So moving stuff into MEM64 area should be considered as generally
> unsafe operation, and the best default policy is to not enable MEM64
> resource unless we find that BIOS has allocated something there.
> At the same time, MEM64 can be easily enabled/disabled based on host
> bridge PCI IDs, kernel command line options and so on.

This sounds like reasonable default behavior given the variety of
chipsets and device quirks out there. This does muddy up the arch vs.
generic code just a little bit more though; iirc mips & ia64 use full
64 bit ranges for their current IORESOURCE_MEM types (hm now that I've
checked it appears ia64 has changed a bit here, but still other
arches should probably get cleaned up to use the new 64 bit type at
some point).

> Here is a basic implementation of the above for x86. I think it's
> reasonably good starting point for PCI64 work - the next step would
> be to teach pci_bus_alloc_resource() about IORESOURCE_MEM64: logic is
> similar to prefetch vs non-prefetch case - MEM can hold MEM64
> resource, but not vice versa. And eventually bridge sizing code will
> be updated for reasonable 64-bit allocations (it's a non-trivial
> task, though).
>
> This patch alone should fix cardbus >4G allocations and similar
> nonsense.
>
> Signed-off-by: Ivan Kokshaysky <[email protected]>

Nice. Any cleanups to existing arch code could be done at the same
time as the updates to the bus allocation.

Anyone care to send me some tested-by lines?

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2009-04-21 15:59:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

Ivan Kokshaysky wrote:
> On Mon, Apr 20, 2009 at 05:09:32PM -0700, Yinghai Lu wrote:
>> also it seems logical is wrong.
>>
>> we should make sure if one pci resource support 64 from pci_read_bases() instead of
>> pcibios_allocate_resources.
>
> pci_read_bases() is already providing all necessary information.

current pci_read_bases() does not tell us if that device support 32bit pref mem or 64bit pref.
aka there is type about that, but that is not recording in res->flag

>
>> thinking about: if pci bridge on bus 0 (aka first peer root bus), some device under
>> bridge doesn't get allocated resource from BIOS. and those bridge/device does support pref mem64
>> then your patch will not getIORESORCE_MEM64 set before pcibios_pci64_verify...
>
> Yes, that's my point - even if host bridge, p2p bridge and device are 64-bit,
> there is absolutely no guarantee that after moving the BARs above 4G
> the device will work correctly.
>
>> Correct logic should be
>> record all device if support 64bit pref mem (with pci_read_bases), and make sure bridge to be
>> consistent with that of device under if.
>
> Your view is very x86 centric, please don't forget that drivers/pci
> code is used by other architectures as well:
> - limiting 32-bit allocations to 0xffffffff simply breaks non-x86
> architectures. Alpha doesn't even boot with your patch;

will look at it, may limit that to x86 arch.

> - there are lots of devices with 64-bit non-prefetchable memory BARs,
> you don't seem to care about that.

the fact is current p2p spec said: mem ( non pref) is 32bit.
and pref mem could be 64bit or 32 bit.

with devices on root buses directly, we may need to expand current
res->flags & IORESOURCE_PREFETCH checking to support it.

>
> And your patch doesn't work even on x86:
>
> 00:01.0 PCI bridge: Intel Corporation 82945G/GZ/P/PL PCI Express Root Port (rev 02) (prog-if 00 [Normal decode])
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 16 bytes
> Bus: primary=00, secondary=04, subordinate=04, sec-latency=0
> I/O behind bridge: 0000e000-0000efff
> Memory behind bridge: cdf00000-cfffffff
> Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> Capabilities: [88] Subsystem: Intel Corporation Device 0000
> Capabilities: [80] Power Management version 2
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
> Address: fee0300c Data: 4159
> Capabilities: [a0] Express (v1) Root Port (Slot+), MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
> ExtTag- RBE- FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 128 bytes
> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> LnkCap: Port #2, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <256ns, L1 <4us
> ClockPM- Suprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
> Slot # 0, PowerLimit 75.000000; Interlock- NoCompl-
> SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
> Control: AttnInd Off, PwrInd On, Power- Interlock-
> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
> Changed: MRL- PresDet+ LinkState-
> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
> RootCap: CRSVisible-
> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> 00: 86 80 71 27 07 05 10 00 02 00 04 06 04 00 01 00
> 10: 00 00 00 00 00 00 00 00 00 04 04 00 e0 e0 00 00
> 20: f0 cd f0 cf 01 d0 f1 df 00 00 00 00 00 00 00 00
> 30: 00 00 00 00 88 00 00 00 00 00 00 00 0b 01 0a 00
>
> As one can see, the prefetchable base register (0x24) is 0xd001, the
> bit 0 indicates 64-bitness. Which is not true, as i82945G/GZ/P/PL
> only supports 32-bit addressing (please check the datasheet).

that should be done with quirks way to handle it.

>
> Also, your patch can't handle transparent bridges. And it doesn't
> bode well for bus sizing code.

my patch is only trying to handle the case:
don't assume all pref mem is 64bit so don't assign resource above to 4g to pci bridge that
has device under it but the device only support 32 bit pref.

for the bus resize, may could be use your mem64 res, but set that flag always,
and in pci_bus_size_bridges, try to expand
mask = IORESOURCE_MEM;
prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
if (pbus_size_mem(bus, prefmask, prefmask))
mask = prefmask; /* Success, size non-prefetch only. */
pbus_size_mem(bus, mask, IORESOURCE_MEM);

to have more tries, and that flags.

YH

2009-04-22 22:39:26

by Yinghai Lu

[permalink] [raw]
Subject: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3


one system with 4g installed ( there is 1g hole)

when 4G installed.
BIOS put ACPI etc need the hole
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
[ 0.000000] BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e3000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
[ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000bfff0000 - 00000000c0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
so in kernel resource will be reserved for 0xbffa0000 - 0xbfff0000 for ACPI
0x100000 - 0xbffa0000 for RAM...

and BIOS set
[ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref: [0xbdf00000-0xddefffff]
[ 0.237102] pci 0000:01:00.0: reg 10 32bit mmio: [0xc0000000-0xcfffffff]
that is conflict with reserved res. so it can not be reserved Kernel.

then Kernel try to get range from 0x140000000 ( above the RAM, 5G and above 4g)
and set let the bridge to use it, and ATI cards to use it.

but the problem is that ATI only support 32bit ...

we should not assign 64bit range to pci device that only take 32bit pref

try to set PCI_PREF_RANGE_TYPE_64 in 64bit resource of pci_device (besides in pci_bridge),
and make the bus resource only have that bit set when all device under that do support
64bit pref mem
then use that flag to decide the max limit for find/request.

v2: fix b_res->flags and logic and passing result.
v3: split iomem to iomem32, iomem64, and iomem64 will take IORESOURCE_MEM_64

[Impact: do assign wrong range to device that doesn't support it]


Reported-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/bus.c | 6 +++
drivers/pci/probe.c | 12 +++++--
drivers/pci/setup-bus.c | 41 ++++++++++++++++++-------
include/linux/ioport.h | 4 ++
init/main.c | 7 ++++
kernel/resource.c | 76 ++++++++++++++++++++++++++++++++++++++----------
6 files changed, 117 insertions(+), 29 deletions(-)

Index: linux-2.6/drivers/pci/bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/bus.c
+++ linux-2.6/drivers/pci/bus.c
@@ -53,6 +53,12 @@ pci_bus_alloc_resource(struct pci_bus *b
if ((res->flags ^ r->flags) & type_mask)
continue;

+ /* We cannot allocate a mem_32 resource
+ from a mem_64 area */
+ if ((r->flags & IORESOURCE_MEM_64) &&
+ !(res->flags & IORESOURCE_MEM_64))
+ continue;
+
/* We cannot allocate a non-prefetching resource
from a pre-fetching area */
if ((r->flags & IORESOURCE_PREFETCH) &&
Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -193,7 +193,7 @@ int __pci_read_base(struct pci_dev *dev,
res->flags |= pci_calc_resource_flags(l) | IORESOURCE_SIZEALIGN;
if (type == pci_bar_io) {
l &= PCI_BASE_ADDRESS_IO_MASK;
- mask = PCI_BASE_ADDRESS_IO_MASK & 0xffff;
+ mask = PCI_BASE_ADDRESS_IO_MASK & IO_SPACE_LIMIT;
} else {
l &= PCI_BASE_ADDRESS_MEM_MASK;
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
@@ -237,6 +237,8 @@ int __pci_read_base(struct pci_dev *dev,
dev_printk(KERN_DEBUG, &dev->dev,
"reg %x 64bit mmio: %pR\n", pos, res);
}
+
+ res->flags |= IORESOURCE_MEM_64;
} else {
sz = pci_size(l, sz, mask);

@@ -362,7 +364,10 @@ void __devinit pci_read_bridge_bases(str
}
}
if (base <= limit) {
- res->flags = (mem_base_lo & PCI_MEMORY_RANGE_TYPE_MASK) | IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ res->flags = (mem_base_lo & PCI_PREF_RANGE_TYPE_MASK) |
+ IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ if (res->flags & PCI_PREF_RANGE_TYPE_64)
+ res->flags |= IORESOURCE_MEM_64;
res->start = base;
res->end = limit + 0xfffff;
dev_printk(KERN_DEBUG, &dev->dev, "bridge %sbit mmio pref: %pR\n",
@@ -1178,7 +1183,8 @@ struct pci_bus * pci_create_bus(struct d

b->number = b->secondary = bus;
b->resource[0] = &ioport_resource;
- b->resource[1] = &iomem_resource;
+ b->resource[1] = &iomem32_resource;
+ b->resource[2] = &iomem64_resource;

set_pci_bus_resources_arch_default(b);

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -143,6 +143,7 @@ static void pci_setup_bridge(struct pci_
struct pci_dev *bridge = bus->self;
struct pci_bus_region region;
u32 l, bu, lu, io_upper16;
+ int pref_mem64;

if (pci_is_enabled(bridge))
return;
@@ -198,16 +199,22 @@ static void pci_setup_bridge(struct pci_
pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, 0);

/* Set up PREF base/limit. */
+ pref_mem64 = 0;
bu = lu = 0;
pcibios_resource_to_bus(bridge, &region, bus->resource[2]);
if (bus->resource[2]->flags & IORESOURCE_PREFETCH) {
+ int width = 8;
l = (region.start >> 16) & 0xfff0;
l |= region.end & 0xfff00000;
- bu = upper_32_bits(region.start);
- lu = upper_32_bits(region.end);
- dev_info(&bridge->dev, " PREFETCH window: %#016llx-%#016llx\n",
- (unsigned long long)region.start,
- (unsigned long long)region.end);
+ if (bus->resource[2]->flags & IORESOURCE_MEM_64) {
+ pref_mem64 = 1;
+ bu = upper_32_bits(region.start);
+ lu = upper_32_bits(region.end);
+ width = 16;
+ }
+ dev_info(&bridge->dev, " PREFETCH window: %#0*llx-%#0*llx\n",
+ width, (unsigned long long)region.start,
+ width, (unsigned long long)region.end);
}
else {
l = 0x0000fff0;
@@ -215,9 +222,11 @@ static void pci_setup_bridge(struct pci_
}
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);

- /* Set the upper 32 bits of PREF base & limit. */
- pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
- pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ if (pref_mem64) {
+ /* Set the upper 32 bits of PREF base & limit. */
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
+ pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ }

pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
}
@@ -255,8 +264,11 @@ static void pci_bridge_check_ranges(stru
pci_read_config_dword(bridge, PCI_PREF_MEMORY_BASE, &pmem);
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, 0x0);
}
- if (pmem)
+ if (pmem) {
b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ if ((pmem & PCI_PREF_RANGE_TYPE_MASK) == PCI_PREF_RANGE_TYPE_64)
+ b_res[2].flags |= IORESOURCE_MEM_64;
+ }
}

/* Helper function for sizing routines: find first available
@@ -272,7 +284,8 @@ static struct resource *find_free_bus_re

for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) {
r = bus->resource[i];
- if (r == &ioport_resource || r == &iomem_resource)
+ if (r == &ioport_resource || r == &iomem32_resource
+ || r == &iomem64_resource)
continue;
if (r && (r->flags & type_mask) == type && !r->parent)
return r;
@@ -336,6 +349,7 @@ static int pbus_size_mem(struct pci_bus
resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */
int order, max_order;
struct resource *b_res = find_free_bus_resource(bus, type);
+ unsigned int mem64_mask = 0;

if (!b_res)
return 0;
@@ -344,9 +358,12 @@ static int pbus_size_mem(struct pci_bus
max_order = 0;
size = 0;

+ mem64_mask = b_res->flags & IORESOURCE_MEM_64;
+ b_res->flags &= ~IORESOURCE_MEM_64;
+
list_for_each_entry(dev, &bus->devices, bus_list) {
int i;
-
+
for (i = 0; i < PCI_NUM_RESOURCES; i++) {
struct resource *r = &dev->resource[i];
resource_size_t r_size;
@@ -372,6 +389,7 @@ static int pbus_size_mem(struct pci_bus
aligns[order] += align;
if (order > max_order)
max_order = order;
+ mem64_mask &= r->flags & IORESOURCE_MEM_64;
}
}

@@ -396,6 +414,7 @@ static int pbus_size_mem(struct pci_bus
b_res->start = min_align;
b_res->end = size + min_align - 1;
b_res->flags |= IORESOURCE_STARTALIGN;
+ b_res->flags |= mem64_mask;
return 1;
}

Index: linux-2.6/include/linux/ioport.h
===================================================================
--- linux-2.6.orig/include/linux/ioport.h
+++ linux-2.6/include/linux/ioport.h
@@ -49,6 +49,8 @@ struct resource_list {
#define IORESOURCE_SIZEALIGN 0x00020000 /* size indicates alignment */
#define IORESOURCE_STARTALIGN 0x00040000 /* start field is alignment */

+#define IORESOURCE_MEM_64 0x00100000
+
#define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this resource */
#define IORESOURCE_DISABLED 0x10000000
#define IORESOURCE_UNSET 0x20000000
@@ -107,6 +109,8 @@ struct resource_list {
/* PC/ISA/whatever - the normal PC address spaces: IO and memory */
extern struct resource ioport_resource;
extern struct resource iomem_resource;
+extern struct resource iomem32_resource;
+extern struct resource iomem64_resource;

extern int request_resource(struct resource *root, struct resource *new);
extern int release_resource(struct resource *new);
Index: linux-2.6/kernel/resource.c
===================================================================
--- linux-2.6.orig/kernel/resource.c
+++ linux-2.6/kernel/resource.c
@@ -37,6 +37,20 @@ struct resource iomem_resource = {
};
EXPORT_SYMBOL(iomem_resource);

+/* need to insert those two under iomem */
+struct resource iomem32_resource = {
+ .name = "PCI mem 32bit",
+ .start = 0,
+ .end = 0xffffffff,
+ .flags = IORESOURCE_MEM,
+};
+struct resource iomem64_resource = {
+ .name = "PCI mem 64bit",
+ .start = 1ULL<<32,
+ .end = -1,
+ .flags = IORESOURCE_MEM | IORESOURCE_MEM_64,
+};
+
static DEFINE_RWLOCK(resource_lock);

static void *r_next(struct seq_file *m, void *v, loff_t *pos)
@@ -200,7 +214,15 @@ int request_resource(struct resource *ro
struct resource *conflict;

write_lock(&resource_lock);
- conflict = __request_resource(root, new);
+ if (root != &iomem_resource) {
+ conflict = __request_resource(root, new);
+ } else {
+ /* assume no cross */
+ if (new->start >= (1ULL<<32))
+ conflict = __request_resource(&iomem64_resource, new);
+ else
+ conflict = __request_resource(&iomem32_resource, new);
+ }
write_unlock(&resource_lock);
return conflict ? -EBUSY : 0;
}
@@ -437,20 +459,9 @@ int insert_resource(struct resource *par
return conflict ? -EBUSY : 0;
}

-/**
- * insert_resource_expand_to_fit - Insert a resource into the resource tree
- * @root: root resource descriptor
- * @new: new resource to insert
- *
- * Insert a resource into the resource tree, possibly expanding it in order
- * to make it encompass any conflicting resources.
- */
-void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
+static void __insert_resource_expand_to_fit(struct resource *root,
+ struct resource *new)
{
- if (new->parent)
- return;
-
- write_lock(&resource_lock);
for (;;) {
struct resource *conflict;

@@ -468,6 +479,31 @@ void insert_resource_expand_to_fit(struc

printk("Expanded resource %s due to conflict with %s\n", new->name, conflict->name);
}
+}
+
+/**
+ * insert_resource_expand_to_fit - Insert a resource into the resource tree
+ * @root: root resource descriptor
+ * @new: new resource to insert
+ *
+ * Insert a resource into the resource tree, possibly expanding it in order
+ * to make it encompass any conflicting resources.
+ */
+void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
+{
+ if (new->parent)
+ return;
+
+ write_lock(&resource_lock);
+ if (root != &iomem_resource) {
+ __insert_resource_expand_to_fit(root, new);
+ } else {
+ /* assume no cross */
+ if (new->start >= (1ULL<<32))
+ __insert_resource_expand_to_fit(&iomem64_resource, new);
+ else
+ __insert_resource_expand_to_fit(&iomem32_resource, new);
+ }
write_unlock(&resource_lock);
}

@@ -555,7 +591,17 @@ void __init reserve_region_with_split(st
const char *name)
{
write_lock(&resource_lock);
- __reserve_region_with_split(root, start, end, name);
+ if (root != &iomem_resource) {
+ __reserve_region_with_split(root, start, end, name);
+ } else {
+ /* assume no cross */
+ if (start >= (1ULL<<32))
+ __reserve_region_with_split(&iomem64_resource, start,
+ end, name);
+ else
+ __reserve_region_with_split(&iomem32_resource, start,
+ end, name);
+ }
write_unlock(&resource_lock);
}

Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -534,6 +534,12 @@ void __init __weak thread_info_cache_ini
{
}

+static void __init resource_init(void)
+{
+ insert_resource(&iomem_resource, &iomem32_resource);
+ insert_resource(&iomem_resource, &iomem64_resource);
+}
+
asmlinkage void __init start_kernel(void)
{
char * command_line;
@@ -569,6 +575,7 @@ asmlinkage void __init start_kernel(void
page_address_init();
printk(KERN_NOTICE);
printk(linux_banner);
+ resource_init();
setup_arch(&command_line);
mm_init_owner(&init_mm, &init_task);
setup_command_line(command_line);

2009-04-22 22:40:25

by Yinghai Lu

[permalink] [raw]
Subject: [RFC PATCH 2/2] pci: try to assign res for device under transparent bridges


try to get resource on parent bus if bridge is transparent...

[ Impact: second try to get resource for unassigned resource with transparent bridges ]

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/setup-bus.c | 1
drivers/pci/setup-res.c | 49 +++++++++++++++++++++++++++++++++++-------------
2 files changed, 36 insertions(+), 14 deletions(-)

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -58,7 +58,6 @@ static void pbus_assign_resources_sorted
res = list->res;
idx = res - &list->dev->resource[0];
if (pci_assign_resource(list->dev, idx)) {
- /* FIXME: get rid of this */
res->start = 0;
res->end = 0;
res->flags = 0;
Index: linux-2.6/drivers/pci/setup-res.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-res.c
+++ linux-2.6/drivers/pci/setup-res.c
@@ -135,23 +135,16 @@ void pci_disable_bridge_window(struct pc
}
#endif /* CONFIG_PCI_QUIRKS */

-int pci_assign_resource(struct pci_dev *dev, int resno)
+static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev,
+ int resno)
{
- struct pci_bus *bus = dev->bus;
struct resource *res = dev->resource + resno;
resource_size_t size, min, align;
int ret;

size = resource_size(res);
min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM;
-
align = resource_alignment(res);
- if (!align) {
- dev_info(&dev->dev, "BAR %d: can't allocate resource (bogus "
- "alignment) %pR flags %#lx\n",
- resno, res, res->flags);
- return -EINVAL;
- }

/* First, try exact prefetching match.. */
ret = pci_bus_alloc_resource(bus, res, size, align, min,
@@ -169,10 +162,7 @@ int pci_assign_resource(struct pci_dev *
pcibios_align_resource, dev);
}

- if (ret) {
- dev_info(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
- resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
- } else {
+ if (!ret) {
res->flags &= ~IORESOURCE_STARTALIGN;
if (resno < PCI_BRIDGE_RESOURCES)
pci_update_resource(dev, resno);
@@ -180,6 +170,39 @@ int pci_assign_resource(struct pci_dev *

return ret;
}
+
+int pci_assign_resource(struct pci_dev *dev, int resno)
+{
+ struct resource *res = dev->resource + resno;
+ resource_size_t align;
+ struct pci_bus *bus;
+ int ret;
+
+ align = resource_alignment(res);
+ if (!align) {
+ dev_info(&dev->dev, "BAR %d: can't allocate resource (bogus "
+ "alignment) %pR flags %#lx\n",
+ resno, res, res->flags);
+ return -EINVAL;
+ }
+
+ bus = dev->bus;
+ while ((ret = __pci_assign_resource(bus, dev, resno))) {
+ if (bus->self->transparent)
+ bus = bus->parent;
+ else
+ bus = NULL;
+ if (bus)
+ continue;
+ break;
+ }
+
+ if (ret)
+ dev_info(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
+ resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
+
+ return ret;
+}

#if 0
int pci_assign_resource_fixed(struct pci_dev *dev, int resno)

2009-04-22 22:49:19

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Wed, 22 Apr 2009 15:37:04 -0700
Yinghai Lu <[email protected]> wrote:

>
> one system with 4g installed ( there is 1g hole)
>
> when 4G installed.
> BIOS put ACPI etc need the hole
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00
> (usable) [ 0.000000] BIOS-e820: 000000000009bc00 -
> 00000000000a0000 (reserved) [ 0.000000] BIOS-e820:
> 00000000000e3000 - 0000000000100000 (reserved) [ 0.000000]
> BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
> [ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI
> data) [ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000
> (ACPI NVS) [ 0.000000] BIOS-e820: 00000000bfff0000 -
> 00000000c0000000 (reserved) [ 0.000000] BIOS-e820:
> 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000]
> BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
> [ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000
> (usable) so in kernel resource will be reserved for 0xbffa0000 -
> 0xbfff0000 for ACPI 0x100000 - 0xbffa0000 for RAM...
>
> and BIOS set
> [ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref:
> [0xbdf00000-0xddefffff] [ 0.237102] pci 0000:01:00.0: reg 10 32bit
> mmio: [0xc0000000-0xcfffffff] that is conflict with reserved res. so
> it can not be reserved Kernel.
>
> then Kernel try to get range from 0x140000000 ( above the RAM, 5G and
> above 4g) and set let the bridge to use it, and ATI cards to use it.
>
> but the problem is that ATI only support 32bit ...

So Ivan's patch didn't work for you for this problem? I was planning
on applying it, but it would be nice to get some test results first.

--
Jesse Barnes, Intel Open Source Technology Center

2009-04-23 00:51:21

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

Jesse Barnes wrote:
> On Wed, 22 Apr 2009 15:37:04 -0700
> Yinghai Lu <[email protected]> wrote:
>
>> one system with 4g installed ( there is 1g hole)
>>
>> when 4G installed.
>> BIOS put ACPI etc need the hole
>> [ 0.000000] BIOS-provided physical RAM map:
>> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00
>> (usable) [ 0.000000] BIOS-e820: 000000000009bc00 -
>> 00000000000a0000 (reserved) [ 0.000000] BIOS-e820:
>> 00000000000e3000 - 0000000000100000 (reserved) [ 0.000000]
>> BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
>> [ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI
>> data) [ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000
>> (ACPI NVS) [ 0.000000] BIOS-e820: 00000000bfff0000 -
>> 00000000c0000000 (reserved) [ 0.000000] BIOS-e820:
>> 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000]
>> BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
>> [ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000
>> (usable) so in kernel resource will be reserved for 0xbffa0000 -
>> 0xbfff0000 for ACPI 0x100000 - 0xbffa0000 for RAM...
>>
>> and BIOS set
>> [ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref:
>> [0xbdf00000-0xddefffff] [ 0.237102] pci 0000:01:00.0: reg 10 32bit
>> mmio: [0xc0000000-0xcfffffff] that is conflict with reserved res. so
>> it can not be reserved Kernel.
>>
>> then Kernel try to get range from 0x140000000 ( above the RAM, 5G and
>> above 4g) and set let the bridge to use it, and ATI cards to use it.
>>
>> but the problem is that ATI only support 32bit ...
>
> So Ivan's patch didn't work for you for this problem? I was planning
> on applying it, but it would be nice to get some test results first.

looks like Ivan patch still has some problem.

YH

2009-04-23 01:05:24

by Jesse Barnes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Wed, 22 Apr 2009 17:49:18 -0700
Yinghai Lu <[email protected]> wrote:

> Jesse Barnes wrote:
> > On Wed, 22 Apr 2009 15:37:04 -0700
> > Yinghai Lu <[email protected]> wrote:
> >
> >> one system with 4g installed ( there is 1g hole)
> >>
> >> when 4G installed.
> >> BIOS put ACPI etc need the hole
> >> [ 0.000000] BIOS-provided physical RAM map:
> >> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00
> >> (usable) [ 0.000000] BIOS-e820: 000000000009bc00 -
> >> 00000000000a0000 (reserved) [ 0.000000] BIOS-e820:
> >> 00000000000e3000 - 0000000000100000 (reserved) [ 0.000000]
> >> BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
> >> [ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000
> >> (ACPI data) [ 0.000000] BIOS-e820: 00000000bffae000 -
> >> 00000000bfff0000 (ACPI NVS) [ 0.000000] BIOS-e820:
> >> 00000000bfff0000 - 00000000c0000000 (reserved) [ 0.000000]
> >> BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
> >> [ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000
> >> (reserved) [ 0.000000] BIOS-e820: 0000000100000000 -
> >> 0000000140000000 (usable) so in kernel resource will be reserved
> >> for 0xbffa0000 - 0xbfff0000 for ACPI 0x100000 - 0xbffa0000 for
> >> RAM...
> >>
> >> and BIOS set
> >> [ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref:
> >> [0xbdf00000-0xddefffff] [ 0.237102] pci 0000:01:00.0: reg 10
> >> 32bit mmio: [0xc0000000-0xcfffffff] that is conflict with reserved
> >> res. so it can not be reserved Kernel.
> >>
> >> then Kernel try to get range from 0x140000000 ( above the RAM, 5G
> >> and above 4g) and set let the bridge to use it, and ATI cards to
> >> use it.
> >>
> >> but the problem is that ATI only support 32bit ...
> >
> > So Ivan's patch didn't work for you for this problem? I was
> > planning on applying it, but it would be nice to get some test
> > results first.
>
> looks like Ivan patch still has some problem.

Can you be more specific? :) I'd like to get this resolved properly as
well, and I think the principles Ivan outlined are the right ones to
follow...

--
Jesse Barnes, Intel Open Source Technology Center

2009-04-23 02:03:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Wed, Apr 22, 2009 at 6:05 PM, Jesse Barnes <[email protected]> wrote:
> On Wed, 22 Apr 2009 17:49:18 -0700
> Yinghai Lu <[email protected]> wrote:
>
>> Jesse Barnes wrote:
>> > On Wed, 22 Apr 2009 15:37:04 -0700
>> > Yinghai Lu <[email protected]> wrote:
>> >
>> >> one system with 4g installed ( there is 1g hole)
>> >>
>> >> when 4G installed.
>> >> BIOS put ACPI etc need the hole
>> >> [ ? ?0.000000] BIOS-provided physical RAM map:
>> >> [ ? ?0.000000] ?BIOS-e820: 0000000000000000 - 000000000009bc00
>> >> (usable) [ ? ?0.000000] ?BIOS-e820: 000000000009bc00 -
>> >> 00000000000a0000 (reserved) [ ? ?0.000000] ?BIOS-e820:
>> >> 00000000000e3000 - 0000000000100000 (reserved) [ ? ?0.000000]
>> >> BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
>> >> [ ? ?0.000000] ?BIOS-e820: 00000000bffa0000 - 00000000bffae000
>> >> (ACPI data) [ ? ?0.000000] ?BIOS-e820: 00000000bffae000 -
>> >> 00000000bfff0000 (ACPI NVS) [ ? ?0.000000] ?BIOS-e820:
>> >> 00000000bfff0000 - 00000000c0000000 (reserved) [ ? ?0.000000]
>> >> BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
>> >> [ ? ?0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000
>> >> (reserved) [ ? ?0.000000] ?BIOS-e820: 0000000100000000 -
>> >> 0000000140000000 (usable) so in kernel resource will be reserved
>> >> for 0xbffa0000 - 0xbfff0000 for ACPI 0x100000 - ?0xbffa0000 for
>> >> RAM...
>> >>
>> >> and BIOS set
>> >> [ ? ?0.240007] pci 0000:00:01.0: bridge 64bit mmio pref:
>> >> [0xbdf00000-0xddefffff] [ ? ?0.237102] pci 0000:01:00.0: reg 10
>> >> 32bit mmio: [0xc0000000-0xcfffffff] that is conflict with reserved
>> >> res. so it can not be reserved Kernel.
>> >>
>> >> then Kernel try to get range from 0x140000000 ( above the RAM, 5G
>> >> and above 4g) and set let the bridge to use it, and ATI cards to
>> >> use it.
>> >>
>> >> but the problem is that ATI only support 32bit ...
>> >
>> > So Ivan's patch didn't work for you for this problem? ?I was
>> > planning on applying it, but it would be nice to get some test
>> > results first.
>>
>> looks like Ivan patch still has some problem.
>
> Can you be more specific? :) ?I'd like to get this resolved properly as
> well, and I think the principles Ivan outlined are the right ones to
> follow...
>

to check the BAR support 64bit or not should be read from
pci_read_bases and pci_bridge_read_bases...

aka Ivan's patch logic is some kind of wrong.

YH

2009-04-23 02:13:33

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

Jesse Barnes wrote:
> On Wed, 22 Apr 2009 17:49:18 -0700
> Yinghai Lu <[email protected]> wrote:
>
>> Jesse Barnes wrote:
>>> On Wed, 22 Apr 2009 15:37:04 -0700
>>> Yinghai Lu <[email protected]> wrote:
>>>
>>>> one system with 4g installed ( there is 1g hole)
>>>>
>>>> when 4G installed.
>>>> BIOS put ACPI etc need the hole
>>>> [ 0.000000] BIOS-provided physical RAM map:
>>>> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00
>>>> (usable) [ 0.000000] BIOS-e820: 000000000009bc00 -
>>>> 00000000000a0000 (reserved) [ 0.000000] BIOS-e820:
>>>> 00000000000e3000 - 0000000000100000 (reserved) [ 0.000000]
>>>> BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
>>>> [ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000
>>>> (ACPI data) [ 0.000000] BIOS-e820: 00000000bffae000 -
>>>> 00000000bfff0000 (ACPI NVS) [ 0.000000] BIOS-e820:
>>>> 00000000bfff0000 - 00000000c0000000 (reserved) [ 0.000000]
>>>> BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
>>>> [ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000
>>>> (reserved) [ 0.000000] BIOS-e820: 0000000100000000 -
>>>> 0000000140000000 (usable) so in kernel resource will be reserved
>>>> for 0xbffa0000 - 0xbfff0000 for ACPI 0x100000 - 0xbffa0000 for
>>>> RAM...
>>>>
>>>> and BIOS set
>>>> [ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref:
>>>> [0xbdf00000-0xddefffff] [ 0.237102] pci 0000:01:00.0: reg 10
>>>> 32bit mmio: [0xc0000000-0xcfffffff] that is conflict with reserved
>>>> res. so it can not be reserved Kernel.
>>>>
>>>> then Kernel try to get range from 0x140000000 ( above the RAM, 5G
>>>> and above 4g) and set let the bridge to use it, and ATI cards to
>>>> use it.
>>>>
>>>> but the problem is that ATI only support 32bit ...
>>> So Ivan's patch didn't work for you for this problem? I was
>>> planning on applying it, but it would be nice to get some test
>>> results first.
>> looks like Ivan patch still has some problem.
>
> Can you be more specific? :) I'd like to get this resolved properly as
> well, and I think the principles Ivan outlined are the right ones to
> follow...

also on AMD system with two ht chain, or other system with pci=use_crs to get correct root default res,
will get anonying

PCI: allocations above 4G disabled

even the system does support that.

also will have problem with some calling like request_resource(&iomem_resource, ....)

YH

2009-04-23 12:36:23

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Wed, Apr 22, 2009 at 03:37:04PM -0700, Yinghai Lu wrote:
> one system with 4g installed ( there is 1g hole)
>
> when 4G installed.
> BIOS put ACPI etc need the hole
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
> [ 0.000000] BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
> [ 0.000000] BIOS-e820: 00000000000e3000 - 0000000000100000 (reserved)
> [ 0.000000] BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
> [ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI data)
> [ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000 (ACPI NVS)
> [ 0.000000] BIOS-e820: 00000000bfff0000 - 00000000c0000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
> [ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
> [ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
> so in kernel resource will be reserved for 0xbffa0000 - 0xbfff0000 for ACPI
> 0x100000 - 0xbffa0000 for RAM...
>
> and BIOS set
> [ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref: [0xbdf00000-0xddefffff]
> [ 0.237102] pci 0000:01:00.0: reg 10 32bit mmio: [0xc0000000-0xcfffffff]
> that is conflict with reserved res. so it can not be reserved Kernel.
>
> then Kernel try to get range from 0x140000000 ( above the RAM, 5G and above 4g)
> and set let the bridge to use it, and ATI cards to use it.
>
> but the problem is that ATI only support 32bit ...

Yinghai, you are trying to get a quick fix for quite a complex problem
that cannot be solved with a quick fix. Even more, there is no rush on
a quick fix because it's not a critical bug at all. 32-bit stuff ends
up above 4G *only* when there is no free space left on the 32-bit
PCI bus, and it can be considered as very effective (though rather ugly)
way of disabling the BAR that we failed to allocate.
In this particular case it was simply a side effect of the "pci_mem_start"
issue (which was indeed critical, but hopefully fixed now).


> +/* need to insert those two under iomem */
> +struct resource iomem32_resource = {
> + .name = "PCI mem 32bit",
> + .start = 0,
> + .end = 0xffffffff,
> + .flags = IORESOURCE_MEM,
> +};
> +struct resource iomem64_resource = {
> + .name = "PCI mem 64bit",
> + .start = 1ULL<<32,
> + .end = -1,
> + .flags = IORESOURCE_MEM | IORESOURCE_MEM_64,
> +};
> +

This only works on x86 and similar systems with 1:1 CPU address to bus
address mapping. There is a lot of machines with multiple 32-bit PCI
bus spaces (4G per PCI domain).

Ivan.

2009-04-23 12:43:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3


* Ivan Kokshaysky <[email protected]> wrote:

> > +/* need to insert those two under iomem */
> > +struct resource iomem32_resource = {
> > + .name = "PCI mem 32bit",
> > + .start = 0,
> > + .end = 0xffffffff,
> > + .flags = IORESOURCE_MEM,
> > +};
> > +struct resource iomem64_resource = {
> > + .name = "PCI mem 64bit",
> > + .start = 1ULL<<32,
> > + .end = -1,
> > + .flags = IORESOURCE_MEM | IORESOURCE_MEM_64,
> > +};
> > +
>
> This only works on x86 and similar systems with 1:1 CPU address to
> bus address mapping. There is a lot of machines with multiple
> 32-bit PCI bus spaces (4G per PCI domain).

If you mean this "only" works on 95% of the systems that test the
upstream kernel then yes.

Obviously other architectural needs have to be considered too, but
you are making it sound as if there was some vast, more important
space to consider that Yinghai did not consider in his foolishness
;-)

Ingo

2009-04-23 12:58:50

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Wed, Apr 22, 2009 at 07:03:33PM -0700, Yinghai Lu wrote:
> to check the BAR support 64bit or not should be read from
> pci_read_bases and pci_bridge_read_bases...

pci_read_bases: take a closer look at decode_bar() function and
PCI_BASE_ADDRESS_MEM_TYPE_64 flag.

pci_bridge_read_bases: we cannot rely on the bits 0-3 of
PCI_PREF_MEMORY_BASE providing correct information anyway.
More reliable check for 64-bitness would be a test for
PCI_PREF_BASE_UPPER32 register being r/w, which I think
belongs in pci_bridge_check_ranges().

Ivan.

2009-04-23 13:09:36

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Thu, Apr 23, 2009 at 02:41:54PM +0200, Ingo Molnar wrote:
> If you mean this "only" works on 95% of the systems that test the
> upstream kernel then yes.
>
> Obviously other architectural needs have to be considered too, but
> you are making it sound as if there was some vast, more important
> space to consider that Yinghai did not consider in his foolishness
> ;-)

Maybe. But considering the fact that precisely 75% of the machines
I'm using everyday are alphas, I think it's excusable ;-)

Ivan.

2009-04-23 13:22:13

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Wed, Apr 22, 2009 at 07:10:29PM -0700, Yinghai Lu wrote:
> also on AMD system with two ht chain, or other system with pci=use_crs to get correct root default res,
> will get anonying
>
> PCI: allocations above 4G disabled
>
> even the system does support that.

Yep, but it's easy to fix (patch applies on the top of the previous one).

> also will have problem with some calling like request_resource(&iomem_resource, ....)

I don't think so. All critical resources are inserted much earlier than
mem64 one, and request_resource(&iomem_resource, ...) at later stages
would most likely fail regardless of mem64 thing.
Or am I missing something?

Ivan.

diff --git a/arch/x86/pci/dac_64bit.c b/arch/x86/pci/dac_64bit.c
index ee03c4a..35ffee3 100644
--- a/arch/x86/pci/dac_64bit.c
+++ b/arch/x86/pci/dac_64bit.c
@@ -33,12 +33,16 @@ void pcibios_pci64_setup(void)
void pcibios_pci64_verify(void)
{
struct pci_bus *b;
+ int disabled = 0;

if (mem64.flags & IORESOURCE_MEM64)
return; /* presumably DAC works */
list_for_each_entry(b, &pci_root_buses, node) {
- if (b->resource[2] == &mem64)
+ if (b->resource[2] == &mem64) {
b->resource[2] = NULL;
+ disabled = 1;
+ }
}
- printk(KERN_INFO "PCI: allocations above 4G disabled\n");
+ if (disabled)
+ printk(KERN_INFO "PCI: allocations above 4G disabled\n");
}

2009-04-23 15:08:13

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

Ivan Kokshaysky wrote:
> On Wed, Apr 22, 2009 at 03:37:04PM -0700, Yinghai Lu wrote:
>> one system with 4g installed ( there is 1g hole)
>>
>> when 4G installed.
>> BIOS put ACPI etc need the hole
>> [ 0.000000] BIOS-provided physical RAM map:
>> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
>> [ 0.000000] BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
>> [ 0.000000] BIOS-e820: 00000000000e3000 - 0000000000100000 (reserved)
>> [ 0.000000] BIOS-e820: 0000000000100000 - 00000000bffa0000 (usable)
>> [ 0.000000] BIOS-e820: 00000000bffa0000 - 00000000bffae000 (ACPI data)
>> [ 0.000000] BIOS-e820: 00000000bffae000 - 00000000bfff0000 (ACPI NVS)
>> [ 0.000000] BIOS-e820: 00000000bfff0000 - 00000000c0000000 (reserved)
>> [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
>> [ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
>> [ 0.000000] BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
>> so in kernel resource will be reserved for 0xbffa0000 - 0xbfff0000 for ACPI
>> 0x100000 - 0xbffa0000 for RAM...
>>
>> and BIOS set
>> [ 0.240007] pci 0000:00:01.0: bridge 64bit mmio pref: [0xbdf00000-0xddefffff]
>> [ 0.237102] pci 0000:01:00.0: reg 10 32bit mmio: [0xc0000000-0xcfffffff]
>> that is conflict with reserved res. so it can not be reserved Kernel.
>>
>> then Kernel try to get range from 0x140000000 ( above the RAM, 5G and above 4g)
>> and set let the bridge to use it, and ATI cards to use it.
>>
>> but the problem is that ATI only support 32bit ...
>
> Yinghai, you are trying to get a quick fix for quite a complex problem
> that cannot be solved with a quick fix. Even more, there is no rush on
> a quick fix because it's not a critical bug at all. 32-bit stuff ends
> up above 4G *only* when there is no free space left on the 32-bit
> PCI bus, and it can be considered as very effective (though rather ugly)
> way of disabling the BAR that we failed to allocate.
> In this particular case it was simply a side effect of the "pci_mem_start"
> issue (which was indeed critical, but hopefully fixed now).
>

i agreed that that is not crital bug at all. pci_mem_start patch should fix that allocation alone.
actually "pci: don't assume pref memio are 64bit" just make kernel give customer surprise.

>
>> +/* need to insert those two under iomem */
>> +struct resource iomem32_resource = {
>> + .name = "PCI mem 32bit",
>> + .start = 0,
>> + .end = 0xffffffff,
>> + .flags = IORESOURCE_MEM,
>> +};
>> +struct resource iomem64_resource = {
>> + .name = "PCI mem 64bit",
>> + .start = 1ULL<<32,
>> + .end = -1,
>> + .flags = IORESOURCE_MEM | IORESOURCE_MEM_64,
>> +};
>> +
>
> This only works on x86 and similar systems with 1:1 CPU address to bus
> address mapping. There is a lot of machines with multiple 32-bit PCI
> bus spaces (4G per PCI domain).

need to move that code to arch code for x86?

YH

2009-04-23 15:16:23

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

Ivan Kokshaysky wrote:
> On Wed, Apr 22, 2009 at 07:10:29PM -0700, Yinghai Lu wrote:
>> also on AMD system with two ht chain, or other system with pci=use_crs to get correct root default res,
>> will get anonying
>>
>> PCI: allocations above 4G disabled
>>
>> even the system does support that.
>
> Yep, but it's easy to fix (patch applies on the top of the previous one).

another case: one system with 8g ram install, one device on bus 0 does get res assigned, but one res from it does support
64bit mmio. and at that case the assign_unassigned code still could assign 64 res to it.
but kernel will still print out that warning.

>
>> also will have problem with some calling like request_resource(&iomem_resource, ....)
>
> I don't think so. All critical resources are inserted much earlier than
> mem64 one, and request_resource(&iomem_resource, ...) at later stages
> would most likely fail regardless of mem64 thing.
> Or am I missing something?

at least the one in mm/memory_hotplug.c

/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
struct resource *res;
res = kzalloc(sizeof(struct resource), GFP_KERNEL);
BUG_ON(!res);

res->name = "System RAM";
res->start = start;
res->end = start + size - 1;
res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
if (request_resource(&iomem_resource, res) < 0) {
printk("System RAM resource %llx - %llx cannot be added\n",
(unsigned long long)res->start, (unsigned long long)res->end);
kfree(res);
res = NULL;
}
return res;
}



>
> Ivan.
>
> diff --git a/arch/x86/pci/dac_64bit.c b/arch/x86/pci/dac_64bit.c
> index ee03c4a..35ffee3 100644
> --- a/arch/x86/pci/dac_64bit.c
> +++ b/arch/x86/pci/dac_64bit.c
> @@ -33,12 +33,16 @@ void pcibios_pci64_setup(void)
> void pcibios_pci64_verify(void)
> {
> struct pci_bus *b;
> + int disabled = 0;
>
> if (mem64.flags & IORESOURCE_MEM64)
> return; /* presumably DAC works */
> list_for_each_entry(b, &pci_root_buses, node) {
> - if (b->resource[2] == &mem64)
> + if (b->resource[2] == &mem64) {
> b->resource[2] = NULL;
> + disabled = 1;
> + }
> }
> - printk(KERN_INFO "PCI: allocations above 4G disabled\n");
> + if (disabled)
> + printk(KERN_INFO "PCI: allocations above 4G disabled\n");
> }

2009-04-23 15:31:10

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Thu, Apr 23, 2009 at 5:58 AM, Ivan Kokshaysky
<[email protected]> wrote:
> On Wed, Apr 22, 2009 at 07:03:33PM -0700, Yinghai Lu wrote:
>> to check the BAR support 64bit or not should be read from
>> pci_read_bases and pci_bridge_read_bases...
>
> pci_read_bases: take a closer look at decode_bar() function and
> PCI_BASE_ADDRESS_MEM_TYPE_64 flag.
>
> pci_bridge_read_bases: we cannot rely on the bits 0-3 of
> PCI_PREF_MEMORY_BASE providing correct information anyway.
> More reliable check for 64-bitness would be a test for
> PCI_PREF_BASE_UPPER32 register being r/w, which I think
> belongs in pci_bridge_check_ranges().

check bits 0-3 and check PCI_PREF_BASE_UPPER32 register being r/w ?

YH

2009-04-23 22:19:31

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] pci: don't assume pref memio are 64bit -v3

On Thu, Apr 23, 2009 at 08:13:04AM -0700, Yinghai Lu wrote:
> at least the one in mm/memory_hotplug.c
>
> /* add this memory to iomem resource */
> static struct resource *register_memory_resource(u64 start, u64 size)
> {
> struct resource *res;
> res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> BUG_ON(!res);
>
> res->name = "System RAM";
> res->start = start;
> res->end = start + size - 1;
> res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> if (request_resource(&iomem_resource, res) < 0) {
> printk("System RAM resource %llx - %llx cannot be added\n",
> (unsigned long long)res->start, (unsigned long long)res->end);
> kfree(res);
> res = NULL;
> }
> return res;
> }

Indeed. It's a really strong argument against mem64 resource approach.

On the other hand, it shows that getting 64-bit allocations right is
indeed a very complex issue - without well defined root bus ranges
there is a high risk of unexpectedly breaking something, like that memory
hotplug. Oh well...

So if the main purpose is to prevent 32-bit allocations in the DAC
area, some mix of your v2 and v3 patches seems to be the way to go.
That is

- keep IORESOURCE_MEM_64 bits from -v3, but drop iomem32_resource
and iomem64_resource things;

- pass "max" argument to allocate_resource() like you did in -v2,
but *only* then allocating from iomem_resource (r == &iomem_resource).
Also, instead of max = 0xffffffff use something like max = PCIBIOS_MAX_MEM_32.

include/linux/pci.h:
#ifndef PCIBIOS_MAX_MEM_32
#define PCIBIOS_MAX_MEM_32 (-1)
#endif

This should preserve the current behaviour of pci_bus_alloc_resource()
on non-x86 arches; overridden in arch/x86/include/asm/pci.h:
#define PCIBIOS_MAX_MEM_32 0xffffffff

> check bits 0-3 and check PCI_PREF_BASE_UPPER32 register being r/w ?

Yes. If these bits are zero, no further checks are needed -
bridge is 32-bit. If they aren't, do additional check for
PCI_PREF_BASE_UPPER32 being non-zero or writable, but *only*
if this prefetch resource is not already allocated (res->parent == NULL),
just for safety reasons - we don't want to disconnect the allocated
range from the bus even for a short time.

Ivan.

2009-04-24 03:51:01

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 1/4] pci/x86: don't assume pref memio are 64bit -v4


we should not assign 64bit range to pci device that only take 32bit pref

try to set IORESOURCE_MEM_64 in 64bit resource of pci_device/pci_bridge
and make the bus resource only have that bit set when all device under that do support
64bit pref mem then use that flag to allocate resource in wanted area

v2: fix b_res->flags and logic and passing result.
v3: split iomem to iomem32, iomem64, and iomem64 will take IORESOURCE_MEM_64
V4: according to Ivan
make it support x86 only, by PCIBIOS_MAX_MEM_32
double check if the bridge does support pref mem64 with write/read UPPER32

[Impact: do assign wrong range to device that doesn't support it]

Reported-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/include/asm/pci.h | 1
drivers/pci/bus.c | 7 +++++-
drivers/pci/probe.c | 9 ++++++-
drivers/pci/setup-bus.c | 52 ++++++++++++++++++++++++++++++++++++---------
include/linux/ioport.h | 2 +
include/linux/pci.h | 4 +++
6 files changed, 62 insertions(+), 13 deletions(-)

Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -193,7 +193,7 @@ int __pci_read_base(struct pci_dev *dev,
res->flags |= pci_calc_resource_flags(l) | IORESOURCE_SIZEALIGN;
if (type == pci_bar_io) {
l &= PCI_BASE_ADDRESS_IO_MASK;
- mask = PCI_BASE_ADDRESS_IO_MASK & 0xffff;
+ mask = PCI_BASE_ADDRESS_IO_MASK & IO_SPACE_LIMIT;
} else {
l &= PCI_BASE_ADDRESS_MEM_MASK;
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
@@ -237,6 +237,8 @@ int __pci_read_base(struct pci_dev *dev,
dev_printk(KERN_DEBUG, &dev->dev,
"reg %x 64bit mmio: %pR\n", pos, res);
}
+
+ res->flags |= IORESOURCE_MEM_64;
} else {
sz = pci_size(l, sz, mask);

@@ -362,7 +364,10 @@ void __devinit pci_read_bridge_bases(str
}
}
if (base <= limit) {
- res->flags = (mem_base_lo & PCI_MEMORY_RANGE_TYPE_MASK) | IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ res->flags = (mem_base_lo & PCI_PREF_RANGE_TYPE_MASK) |
+ IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ if (res->flags & PCI_PREF_RANGE_TYPE_64)
+ res->flags |= IORESOURCE_MEM_64;
res->start = base;
res->end = limit + 0xfffff;
dev_printk(KERN_DEBUG, &dev->dev, "bridge %sbit mmio pref: %pR\n",
Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -143,6 +143,7 @@ static void pci_setup_bridge(struct pci_
struct pci_dev *bridge = bus->self;
struct pci_bus_region region;
u32 l, bu, lu, io_upper16;
+ int pref_mem64;

if (pci_is_enabled(bridge))
return;
@@ -198,16 +199,22 @@ static void pci_setup_bridge(struct pci_
pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, 0);

/* Set up PREF base/limit. */
+ pref_mem64 = 0;
bu = lu = 0;
pcibios_resource_to_bus(bridge, &region, bus->resource[2]);
if (bus->resource[2]->flags & IORESOURCE_PREFETCH) {
+ int width = 8;
l = (region.start >> 16) & 0xfff0;
l |= region.end & 0xfff00000;
- bu = upper_32_bits(region.start);
- lu = upper_32_bits(region.end);
- dev_info(&bridge->dev, " PREFETCH window: %#016llx-%#016llx\n",
- (unsigned long long)region.start,
- (unsigned long long)region.end);
+ if (bus->resource[2]->flags & IORESOURCE_MEM_64) {
+ pref_mem64 = 1;
+ bu = upper_32_bits(region.start);
+ lu = upper_32_bits(region.end);
+ width = 16;
+ }
+ dev_info(&bridge->dev, " PREFETCH window: %#0*llx-%#0*llx\n",
+ width, (unsigned long long)region.start,
+ width, (unsigned long long)region.end);
}
else {
l = 0x0000fff0;
@@ -215,9 +222,11 @@ static void pci_setup_bridge(struct pci_
}
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);

- /* Set the upper 32 bits of PREF base & limit. */
- pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
- pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ if (pref_mem64) {
+ /* Set the upper 32 bits of PREF base & limit. */
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, bu);
+ pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
+ }

pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
}
@@ -255,8 +264,25 @@ static void pci_bridge_check_ranges(stru
pci_read_config_dword(bridge, PCI_PREF_MEMORY_BASE, &pmem);
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, 0x0);
}
- if (pmem)
+ if (pmem) {
b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH;
+ if ((pmem & PCI_PREF_RANGE_TYPE_MASK) == PCI_PREF_RANGE_TYPE_64)
+ b_res[2].flags |= IORESOURCE_MEM_64;
+ }
+
+ /* double check if bridge does support 64 bit pref */
+ if (b_res[2].flags & IORESOURCE_MEM_64) {
+ u32 mem_base_hi, tmp;
+ pci_read_config_dword(bridge, PCI_PREF_BASE_UPPER32,
+ &mem_base_hi);
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32,
+ 0xffffffff);
+ pci_read_config_dword(bridge, PCI_PREF_BASE_UPPER32, &tmp);
+ if (!tmp)
+ b_res[2].flags &= ~IORESOURCE_MEM_64;
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32,
+ mem_base_hi);
+ }
}

/* Helper function for sizing routines: find first available
@@ -336,6 +362,7 @@ static int pbus_size_mem(struct pci_bus
resource_size_t aligns[12]; /* Alignments from 1Mb to 2Gb */
int order, max_order;
struct resource *b_res = find_free_bus_resource(bus, type);
+ unsigned int mem64_mask = 0;

if (!b_res)
return 0;
@@ -344,9 +371,12 @@ static int pbus_size_mem(struct pci_bus
max_order = 0;
size = 0;

+ mem64_mask = b_res->flags & IORESOURCE_MEM_64;
+ b_res->flags &= ~IORESOURCE_MEM_64;
+
list_for_each_entry(dev, &bus->devices, bus_list) {
int i;
-
+
for (i = 0; i < PCI_NUM_RESOURCES; i++) {
struct resource *r = &dev->resource[i];
resource_size_t r_size;
@@ -372,6 +402,7 @@ static int pbus_size_mem(struct pci_bus
aligns[order] += align;
if (order > max_order)
max_order = order;
+ mem64_mask &= r->flags & IORESOURCE_MEM_64;
}
}

@@ -396,6 +427,7 @@ static int pbus_size_mem(struct pci_bus
b_res->start = min_align;
b_res->end = size + min_align - 1;
b_res->flags |= IORESOURCE_STARTALIGN;
+ b_res->flags |= mem64_mask;
return 1;
}

Index: linux-2.6/include/linux/ioport.h
===================================================================
--- linux-2.6.orig/include/linux/ioport.h
+++ linux-2.6/include/linux/ioport.h
@@ -49,6 +49,8 @@ struct resource_list {
#define IORESOURCE_SIZEALIGN 0x00020000 /* size indicates alignment */
#define IORESOURCE_STARTALIGN 0x00040000 /* start field is alignment */

+#define IORESOURCE_MEM_64 0x00100000
+
#define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this resource */
#define IORESOURCE_DISABLED 0x10000000
#define IORESOURCE_UNSET 0x20000000
Index: linux-2.6/drivers/pci/bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/bus.c
+++ linux-2.6/drivers/pci/bus.c
@@ -41,9 +41,14 @@ pci_bus_alloc_resource(struct pci_bus *b
void *alignf_data)
{
int i, ret = -ENOMEM;
+ resource_size_t max = -1;

type_mask |= IORESOURCE_IO | IORESOURCE_MEM;

+ /* don't allocate too high if the pref mem doesn't support 64bit*/
+ if (!(res->flags & IORESOURCE_MEM_64))
+ max = PCIBIOS_MAX_MEM_32;
+
for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) {
struct resource *r = bus->resource[i];
if (!r)
@@ -62,7 +67,7 @@ pci_bus_alloc_resource(struct pci_bus *b
/* Ok, try it out.. */
ret = allocate_resource(r, res, size,
r->start ? : min,
- -1, align,
+ max, align,
alignf, alignf_data);
if (ret == 0)
break;
Index: linux-2.6/arch/x86/include/asm/pci.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pci.h
+++ linux-2.6/arch/x86/include/asm/pci.h
@@ -130,6 +130,7 @@ extern void pci_iommu_alloc(void);

/* generic pci stuff */
#include <asm-generic/pci.h>
+#define PCIBIOS_MAX_MEM_32 0xffffffff

#ifdef CONFIG_NUMA
/* Returns the node based on pci bus */
Index: linux-2.6/include/linux/pci.h
===================================================================
--- linux-2.6.orig/include/linux/pci.h
+++ linux-2.6/include/linux/pci.h
@@ -1100,6 +1100,10 @@ static inline struct pci_dev *pci_get_bu

#include <asm/pci.h>

+#ifndef PCIBIOS_MAX_MEM_32
+#define PCIBIOS_MAX_MEM_32 (-1)
+#endif
+
/* these helpers provide future and backwards compatibility
* for accessing popular PCI BAR info */
#define pci_resource_start(dev, bar) ((dev)->resource[(bar)].start)

2009-04-24 03:51:47

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 2/4] pci: try to assign res for device under transparent bridges -v2


we could run out of space under under 4g, but device under transparent bridge
still use 64bit resource, so try on parent bus again and again.

[ Impact: better support for assigne unassigned resources ]

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/setup-bus.c | 1
drivers/pci/setup-res.c | 49 +++++++++++++++++++++++++++++++++++-------------
2 files changed, 36 insertions(+), 14 deletions(-)

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -58,7 +58,6 @@ static void pbus_assign_resources_sorted
res = list->res;
idx = res - &list->dev->resource[0];
if (pci_assign_resource(list->dev, idx)) {
- /* FIXME: get rid of this */
res->start = 0;
res->end = 0;
res->flags = 0;
Index: linux-2.6/drivers/pci/setup-res.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-res.c
+++ linux-2.6/drivers/pci/setup-res.c
@@ -135,23 +135,16 @@ void pci_disable_bridge_window(struct pc
}
#endif /* CONFIG_PCI_QUIRKS */

-int pci_assign_resource(struct pci_dev *dev, int resno)
+static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev,
+ int resno)
{
- struct pci_bus *bus = dev->bus;
struct resource *res = dev->resource + resno;
resource_size_t size, min, align;
int ret;

size = resource_size(res);
min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM;
-
align = resource_alignment(res);
- if (!align) {
- dev_info(&dev->dev, "BAR %d: can't allocate resource (bogus "
- "alignment) %pR flags %#lx\n",
- resno, res, res->flags);
- return -EINVAL;
- }

/* First, try exact prefetching match.. */
ret = pci_bus_alloc_resource(bus, res, size, align, min,
@@ -169,10 +162,7 @@ int pci_assign_resource(struct pci_dev *
pcibios_align_resource, dev);
}

- if (ret) {
- dev_info(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
- resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
- } else {
+ if (!ret) {
res->flags &= ~IORESOURCE_STARTALIGN;
if (resno < PCI_BRIDGE_RESOURCES)
pci_update_resource(dev, resno);
@@ -180,6 +170,39 @@ int pci_assign_resource(struct pci_dev *

return ret;
}
+
+int pci_assign_resource(struct pci_dev *dev, int resno)
+{
+ struct resource *res = dev->resource + resno;
+ resource_size_t align;
+ struct pci_bus *bus;
+ int ret;
+
+ align = resource_alignment(res);
+ if (!align) {
+ dev_info(&dev->dev, "BAR %d: can't allocate resource (bogus "
+ "alignment) %pR flags %#lx\n",
+ resno, res, res->flags);
+ return -EINVAL;
+ }
+
+ bus = dev->bus;
+ while ((ret = __pci_assign_resource(bus, dev, resno))) {
+ if (bus->parent && bus->self->transparent)
+ bus = bus->parent;
+ else
+ bus = NULL;
+ if (bus)
+ continue;
+ break;
+ }
+
+ if (ret)
+ dev_info(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
+ resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
+
+ return ret;
+}

#if 0
int pci_assign_resource_fixed(struct pci_dev *dev, int resno)

2009-04-24 03:52:28

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 3/4] x86: reserve range near the ram

From: Linus Torvalds <[email protected]>

The point is to take all RAM resources we have, and
_after_ we've added all the resources we've seen in the E820 tree, we then
_also_ try to add fake reserved entries for any "round up to X" at the end
of the RAM resources.

[ Impact: protect stolen RAM ]

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/e820.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
}
}

+/* How much should we pad RAM ending depending on where it is? */
+static unsigned long ram_alignment(resource_size_t pos)
+{
+ unsigned long mb = pos >> 20;
+
+ /* To 64kB in the first megabyte */
+ if (!mb)
+ return 64*1024;
+
+ /* To 1MB in the first 16MB */
+ if (mb < 16)
+ return 1024*1024;
+
+ /* To 32MB for anything above that */
+ return 32*1024*1024;
+}
+
void __init e820_reserve_resources_late(void)
{
int i;
@@ -1381,6 +1398,24 @@ void __init e820_reserve_resources_late(
insert_resource_expand_to_fit(&iomem_resource, res);
res++;
}
+
+ /*
+ * Try to bump up RAM regions to reasonable boundaries to
+ * avoid stolen RAM
+ */
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *entry = &e820_saved.map[i];
+ resource_size_t start, end;
+
+ if (entry->type != E820_RAM)
+ continue;
+ start = entry->addr + entry->size;
+ end = round_up(start, ram_alignment(start));
+ if (start == end)
+ continue;
+ reserve_region_with_split(&iomem_resource, start,
+ end - 1, "RAM buffer");
+ }
}

char *__init default_machine_specific_memory_setup(void)

2009-04-24 03:53:19

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 4/4] x86/pci: make pci_mem_start to be aligned only -v5


don't need to reserved one round after the gapstart.

because have round up in via reserve stolen ram already.

[Impact: make more big space below 4g for assigning to unassigned pci devices]

Reported-by: Yannick <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/kernel/e820.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -617,7 +617,7 @@ __init int e820_search_gap(unsigned long
*/
__init void e820_setup_gap(void)
{
- unsigned long gapstart, gapsize, round;
+ unsigned long gapstart, gapsize;
int found;

gapstart = 0x10000000;
@@ -635,14 +635,9 @@ __init void e820_setup_gap(void)
#endif

/*
- * See how much we want to round up: start off with
- * rounding to the next 1MB area.
+ * e820_reserve_resources_late protect stolen RAM already
*/
- round = 0x100000;
- while ((gapsize >> 4) > round)
- round += round;
- /* Fun with two's complement */
- pci_mem_start = (gapstart + round) & -round;
+ pci_mem_start = gapstart;

printk(KERN_INFO
"Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",

2009-04-24 13:16:54

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH 1/4] pci/x86: don't assume pref memio are 64bit -v4

On Thu, Apr 23, 2009 at 08:48:32PM -0700, Yinghai Lu wrote:
>
> we should not assign 64bit range to pci device that only take 32bit pref
>
> try to set IORESOURCE_MEM_64 in 64bit resource of pci_device/pci_bridge
> and make the bus resource only have that bit set when all device under that do support
> 64bit pref mem then use that flag to allocate resource in wanted area
>
> v2: fix b_res->flags and logic and passing result.
> v3: split iomem to iomem32, iomem64, and iomem64 will take IORESOURCE_MEM_64
> V4: according to Ivan
> make it support x86 only, by PCIBIOS_MAX_MEM_32
> double check if the bridge does support pref mem64 with write/read UPPER32

Thanks, Yinghai.

> [Impact: do assign wrong range to device that doesn't support it]
>
> Reported-by: Yannick <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>

Reviewed-by: Ivan Kokshaysky <[email protected]>

Ivan.

2009-04-28 07:39:18

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] driver: make dev_set_name(, NULL) work


while looking dev_set_name() calling, there is one
dev_set_name(&dev->dev, NULL)
to used to try to free the device name, before kfree that device.

need to move the check for device_add in
| commit 8a577ffc75d9194fe8cdb7479236f2081c26ca1f
| Author: Kay Sievers <[email protected]>
| Date: Sat Apr 18 15:05:45 2009 -0700
|
| driver: dont update dev_name via device_add path
from kobject_set_name_vargs to kobject_add_vargs instead.

in kobject_set_name_vargs will check if fmt is NULL.

actually we need to use dev_set_name(,NULL) later on failing path
and release to prevent leaking

[ Impact: make dev_set_name(, NULL) could kfree old name ]

Signed-off-by: Yinghai Lu <[email protected]>

---
lib/kobject.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6/lib/kobject.c
===================================================================
--- linux-2.6.orig/lib/kobject.c
+++ linux-2.6/lib/kobject.c
@@ -218,8 +218,8 @@ int kobject_set_name_vargs(struct kobjec
const char *old_name = kobj->name;
char *s;

- if (kobj->name && !fmt)
- return 0;
+ if (!fmt)
+ goto out;

kobj->name = kvasprintf(GFP_KERNEL, fmt, vargs);
if (!kobj->name)
@@ -229,6 +229,7 @@ int kobject_set_name_vargs(struct kobjec
while ((s = strchr(kobj->name, '/')))
s[0] = '!';

+out:
kfree(old_name);
return 0;
}
@@ -301,11 +302,16 @@ static int kobject_add_varg(struct kobje
{
int retval;

+ if (kobj->name && !fmt)
+ goto add_with_name;
+
retval = kobject_set_name_vargs(kobj, fmt, vargs);
if (retval) {
printk(KERN_ERR "kobject: can not set name properly!\n");
return retval;
}
+
+add_with_name:
kobj->parent = parent;
return kobject_add_internal(kobj);
}

2009-04-28 07:44:01

by Yinghai Lu

[permalink] [raw]
Subject: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking


those about 1/3 dev_set_name() etc.

wonder if there is better way to do that


Signed-off-by: Yinghai Lu <[email protected]>

---
arch/arm/common/locomo.c | 1 +
arch/arm/common/sa1111.c | 1 +
arch/arm/kernel/ecard.c | 1 +
arch/arm/mach-integrator/impd1.c | 1 +
arch/arm/mach-integrator/lm.c | 4 ++++
arch/ia64/sn/kernel/tiocx.c | 1 +
arch/mips/kernel/vpe.c | 1 +
arch/parisc/kernel/drivers.c | 1 +
arch/powerpc/kernel/vio.c | 1 +
arch/powerpc/platforms/ps3/system-bus.c | 4 ++++
arch/sparc/kernel/of_device_32.c | 1 +
arch/sparc/kernel/of_device_64.c | 1 +
arch/sparc/kernel/vio.c | 1 +
drivers/acpi/scan.c | 1 +
drivers/base/firmware_class.c | 1 +
drivers/base/platform.c | 1 +
drivers/dio/dio.c | 3 +++
drivers/dma/dmaengine.c | 1 +
drivers/eisa/eisa-bus.c | 1 +
drivers/firewire/fw-device.c | 3 +++
drivers/firmware/dmi-id.c | 8 +++++++-
21 files changed, 37 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/arm/common/locomo.c
===================================================================
--- linux-2.6.orig/arch/arm/common/locomo.c
+++ linux-2.6/arch/arm/common/locomo.c
@@ -559,6 +559,7 @@ locomo_init_one_child(struct locomo *lch

ret = device_register(&dev->dev);
if (ret) {
+ dev_set_name(&dev->dev, NULL);
out:
kfree(dev);
}
Index: linux-2.6/arch/arm/common/sa1111.c
===================================================================
--- linux-2.6.orig/arch/arm/common/sa1111.c
+++ linux-2.6/arch/arm/common/sa1111.c
@@ -577,6 +577,7 @@ sa1111_init_one_child(struct sa1111 *sac
ret = device_register(&dev->dev);
if (ret) {
release_resource(&dev->res);
+ dev_set_name(&dev->dev, NULL);
kfree(dev);
goto out;
}
Index: linux-2.6/arch/arm/kernel/ecard.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/ecard.c
+++ linux-2.6/arch/arm/kernel/ecard.c
@@ -795,6 +795,7 @@ static void __init ecard_free_card(struc
if (ec->resource[i].flags)
release_resource(&ec->resource[i]);

+ dev_set_name(&ec->dev, NULL);
kfree(ec);
}

Index: linux-2.6/arch/arm/mach-integrator/impd1.c
===================================================================
--- linux-2.6.orig/arch/arm/mach-integrator/impd1.c
+++ linux-2.6/arch/arm/mach-integrator/impd1.c
@@ -412,6 +412,7 @@ static int impd1_probe(struct lm_device
ret = amba_device_register(d, &dev->resource);
if (ret) {
dev_err(&d->dev, "unable to register device: %d\n", ret);
+ dev_set_name(&d->dev, NULL);
kfree(d);
}
}
Index: linux-2.6/arch/arm/mach-integrator/lm.c
===================================================================
--- linux-2.6.orig/arch/arm/mach-integrator/lm.c
+++ linux-2.6/arch/arm/mach-integrator/lm.c
@@ -71,6 +71,7 @@ static void lm_device_release(struct dev
{
struct lm_device *d = to_lm_device(dev);

+ dev_set_name(&dev, NULL);
kfree(d);
}

@@ -92,6 +93,9 @@ int lm_device_register(struct lm_device
if (ret)
release_resource(&dev->resource);
}
+ if (ret)
+ dev_set_name(&dev->dev, NULL);
+
return ret;
}

Index: linux-2.6/arch/ia64/sn/kernel/tiocx.c
===================================================================
--- linux-2.6.orig/arch/ia64/sn/kernel/tiocx.c
+++ linux-2.6/arch/ia64/sn/kernel/tiocx.c
@@ -73,6 +73,7 @@ static int tiocx_uevent(struct device *d

static void tiocx_bus_release(struct device *dev)
{
+ dev_set_name(dev, NULL);
kfree(to_cx_dev(dev));
}

Index: linux-2.6/arch/mips/kernel/vpe.c
===================================================================
--- linux-2.6.orig/arch/mips/kernel/vpe.c
+++ linux-2.6/arch/mips/kernel/vpe.c
@@ -1585,6 +1585,7 @@ out_reenable:
return 0;

out_class:
+ dev_set_name(&vpe_device, NULL);
class_unregister(&vpe_class);
out_chrdev:
unregister_chrdev(major, module_name);
Index: linux-2.6/arch/parisc/kernel/drivers.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/drivers.c
+++ linux-2.6/arch/parisc/kernel/drivers.c
@@ -427,6 +427,7 @@ struct parisc_device * create_tree_node(
dev->dev.dma_mask = &dev->dma_mask;
dev->dev.coherent_dma_mask = dev->dma_mask;
if (device_register(&dev->dev)) {
+ dev_set_name(&dev->dev, NULL);
kfree(dev);
return NULL;
}
Index: linux-2.6/arch/powerpc/kernel/vio.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/vio.c
+++ linux-2.6/arch/powerpc/kernel/vio.c
@@ -1246,6 +1246,7 @@ struct vio_dev *vio_register_device_node
printk(KERN_ERR "%s: failed to register device %s\n",
__func__, dev_name(&viodev->dev));
/* XXX free TCE table */
+ dev_set_name(&viodev->dev, NULL);
kfree(viodev);
return NULL;
}
Index: linux-2.6/arch/powerpc/platforms/ps3/system-bus.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/ps3/system-bus.c
+++ linux-2.6/arch/powerpc/platforms/ps3/system-bus.c
@@ -769,6 +769,10 @@ int ps3_system_bus_device_register(struc
pr_debug("%s:%d add %s\n", __func__, __LINE__, dev_name(&dev->core));

result = device_register(&dev->core);
+
+ if (result)
+ dev_set_name(&dev->core, NULL);
+
return result;
}

Index: linux-2.6/arch/sparc/kernel/of_device_32.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/of_device_32.c
+++ linux-2.6/arch/sparc/kernel/of_device_32.c
@@ -587,6 +587,7 @@ build_resources:
if (of_device_register(op)) {
printk("%s: Could not register of device.\n",
dp->full_name);
+ dev_set_name(&op->dev, NULL);
kfree(op);
op = NULL;
}
Index: linux-2.6/arch/sparc/kernel/of_device_64.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/of_device_64.c
+++ linux-2.6/arch/sparc/kernel/of_device_64.c
@@ -855,6 +855,7 @@ static struct of_device * __init scan_on
if (of_device_register(op)) {
printk("%s: Could not register of device.\n",
dp->full_name);
+ dev_set_name(&op->dev, NULL);
kfree(op);
op = NULL;
}
Index: linux-2.6/arch/sparc/kernel/vio.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/vio.c
+++ linux-2.6/arch/sparc/kernel/vio.c
@@ -296,6 +296,7 @@ static struct vio_dev *vio_create_one(st
if (err) {
printk(KERN_ERR "VIO: Could not register device %s, err=%d\n",
dev_name(&vdev->dev), err);
+ dev_set_name(&vdev->dev, NULL);
kfree(vdev);
return NULL;
}
Index: linux-2.6/drivers/acpi/scan.c
===================================================================
--- linux-2.6.orig/drivers/acpi/scan.c
+++ linux-2.6/drivers/acpi/scan.c
@@ -1325,6 +1325,7 @@ acpi_add_single_object(struct acpi_devic
*child = device;
else {
kfree(device->pnp.cid_list);
+ dev_set_name(&device->dev, NULL);
kfree(device);
}

Index: linux-2.6/drivers/base/firmware_class.c
===================================================================
--- linux-2.6.orig/drivers/base/firmware_class.c
+++ linux-2.6/drivers/base/firmware_class.c
@@ -330,6 +330,7 @@ static int fw_register_device(struct dev

error_kfree:
kfree(fw_priv);
+ dev_set_name(f_dev, NULL);
kfree(f_dev);
return retval;
}
Index: linux-2.6/drivers/base/platform.c
===================================================================
--- linux-2.6.orig/drivers/base/platform.c
+++ linux-2.6/drivers/base/platform.c
@@ -293,6 +293,7 @@ int platform_device_add(struct platform_
return ret;

failed:
+ dev_set_name(&pdev->dev, NULL);
while (--i >= 0) {
struct resource *r = &pdev->resource[i];
unsigned long type = resource_type(r);
Index: linux-2.6/drivers/dio/dio.c
===================================================================
--- linux-2.6.orig/drivers/dio/dio.c
+++ linux-2.6/drivers/dio/dio.c
@@ -186,6 +186,7 @@ static int __init dio_init(void)
error = device_register(&dio_bus.dev);
if (error) {
pr_err("DIO: Error registering dio_bus\n");
+ dev_set_name(&dio_bus.dev, NULL);
return error;
}

@@ -261,6 +262,8 @@ static int __init dio_init(void)
if (error) {
pr_err("DIO: Error registering device %s\n",
dev->name);
+ dev_set_name(&dev->dev, NULL);
+ kfree(dev);
continue;
}
error = dio_create_sysfs_dev_files(dev);
Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c
+++ linux-2.6/drivers/dma/dmaengine.c
@@ -699,6 +699,7 @@ int dma_async_device_register(struct dma
if (rc) {
free_percpu(chan->local);
chan->local = NULL;
+ dev_set_name(&chan->dev->device, NULL);
kfree(chan->dev);
atomic_dec(idr_ref);
goto err_out;
Index: linux-2.6/drivers/eisa/eisa-bus.c
===================================================================
--- linux-2.6.orig/drivers/eisa/eisa-bus.c
+++ linux-2.6/drivers/eisa/eisa-bus.c
@@ -322,6 +322,7 @@ static int __init eisa_probe (struct eis

if (eisa_init_device (root, edev, 0)) {
eisa_release_resources (edev);
+ dev_set_name(&edev->dev, NULL);
kfree (edev);
if (!root->force_probe)
return -ENODEV;
Index: linux-2.6/drivers/firewire/fw-device.c
===================================================================
--- linux-2.6.orig/drivers/firewire/fw-device.c
+++ linux-2.6/drivers/firewire/fw-device.c
@@ -529,6 +529,7 @@ static void fw_unit_release(struct devic
{
struct fw_unit *unit = fw_unit(dev);

+ dev_set_name(dev, NULL);
kfree(unit);
}

@@ -579,6 +580,7 @@ static void create_units(struct fw_devic
continue;

skip_unit:
+ dev_set_name(&unit->device, NULL);
kfree(unit);
}
}
@@ -675,6 +677,7 @@ static void fw_device_release(struct dev

fw_node_put(device->node);
kfree(device->config_rom);
+ dev_set_name(dev, NULL);
kfree(device);
fw_card_put(card);
}
Index: linux-2.6/drivers/firmware/dmi-id.c
===================================================================
--- linux-2.6.orig/drivers/firmware/dmi-id.c
+++ linux-2.6/drivers/firmware/dmi-id.c
@@ -158,9 +158,15 @@ static int dmi_dev_uevent(struct device
return 0;
}

+static void dmi_dev_release(struct device *dev)
+{
+ dev_set_name(dev, NULL);
+ kfree(dev);
+}
+
static struct class dmi_class = {
.name = "dmi",
- .dev_release = (void(*)(struct device *)) kfree,
+ .dev_release = dmi_dev_release,
.dev_uevent = dmi_dev_uevent,
};

2009-04-28 14:57:51

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] driver: make dev_set_name(, NULL) work

On Tue, Apr 28, 2009 at 12:36:06AM -0700, Yinghai Lu wrote:
>
> while looking dev_set_name() calling, there is one
> dev_set_name(&dev->dev, NULL)
> to used to try to free the device name, before kfree that device.

What's wrong with that?

> need to move the check for device_add in
> | commit 8a577ffc75d9194fe8cdb7479236f2081c26ca1f
> | Author: Kay Sievers <[email protected]>
> | Date: Sat Apr 18 15:05:45 2009 -0700
> |
> | driver: dont update dev_name via device_add path
> from kobject_set_name_vargs to kobject_add_vargs instead.
>
> in kobject_set_name_vargs will check if fmt is NULL.
>
> actually we need to use dev_set_name(,NULL) later on failing path
> and release to prevent leaking

Are you sure?

confused,

greg k-h

2009-04-28 14:58:12

by Greg KH

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

On Tue, Apr 28, 2009 at 12:42:08AM -0700, Yinghai Lu wrote:
>
> those about 1/3 dev_set_name() etc.
>
> wonder if there is better way to do that

I don't see why this is needed, the name will be cleaned up when the
device goes away, automatically, right?

still confused,

greg k-h

2009-04-28 15:16:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] driver: make dev_set_name(, NULL) work

Greg KH wrote:
> On Tue, Apr 28, 2009 at 12:36:06AM -0700, Yinghai Lu wrote:
>> while looking dev_set_name() calling, there is one
>> dev_set_name(&dev->dev, NULL)
>> to used to try to free the device name, before kfree that device.
>
> What's wrong with that?
>
>> need to move the check for device_add in
>> | commit 8a577ffc75d9194fe8cdb7479236f2081c26ca1f
>> | Author: Kay Sievers <[email protected]>
>> | Date: Sat Apr 18 15:05:45 2009 -0700
>> |
>> | driver: dont update dev_name via device_add path
>> from kobject_set_name_vargs to kobject_add_vargs instead.
>>
>> in kobject_set_name_vargs will check if fmt is NULL.
>>
>> actually we need to use dev_set_name(,NULL) later on failing path
>> and release to prevent leaking
>
> Are you sure?
>
> confused,
>

in arch/arm/common/sa111.c


static int
sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
struct sa1111_dev_info *info)
{
struct sa1111_dev *dev;
int ret;

dev = kzalloc(sizeof(struct sa1111_dev), GFP_KERNEL);
if (!dev) {
ret = -ENOMEM;
goto out;
}

dev_set_name(&dev->dev, "%4.4lx", info->offset);
dev->devid = info->devid;
dev->dev.parent = sachip->dev;
dev->dev.bus = &sa1111_bus_type;
dev->dev.release = sa1111_dev_release;
dev->dev.coherent_dma_mask = sachip->dev->coherent_dma_mask;
dev->res.start = sachip->phys + info->offset;
dev->res.end = dev->res.start + 511;
dev->res.name = dev_name(&dev->dev);
dev->res.flags = IORESOURCE_MEM;
dev->mapbase = sachip->base + info->offset;
dev->skpcr_mask = info->skpcr_mask;
memmove(dev->irq, info->irq, sizeof(dev->irq));

ret = request_resource(parent, &dev->res);
if (ret) {
printk("SA1111: failed to allocate resource for %s\n",
dev->res.name);
dev_set_name(&dev->dev, NULL);
kfree(dev);
goto out;
}


when first dev_set_name is called, dev->dev.kobj.name will initialized from kmalloc.
so before kfree(dev), do we need to kfree that name?

YH

2009-04-28 15:26:38

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

Kay Sievers wrote:
> On Tue, Apr 28, 2009 at 09:42, Yinghai Lu <[email protected]> wrote:
>> those about 1/3 dev_set_name() etc.
>
> put_device()?
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/base/core.c;h=4aa527b8a91381289eb175b33f46e3e418d10374;hb=HEAD#l848
>
ok, normal release path seems right, put_device will free the name.

how about other fail path, that there is not put_device involved?

YH

2009-04-28 15:36:29

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

Yinghai Lu wrote:
> Kay Sievers wrote:
>> On Tue, Apr 28, 2009 at 09:42, Yinghai Lu <[email protected]> wrote:
>>> those about 1/3 dev_set_name() etc.
>> put_device()?
>>
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/base/core.c;h=4aa527b8a91381289eb175b33f46e3e418d10374;hb=HEAD#l848
>>
> ok, normal release path seems right, put_device will free the name.
>
> how about other fail path, that there is not put_device involved?
>

looks like need to follow this pattern

static int
sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
struct sa1111_dev_info *info)
{
struct sa1111_dev *dev;
int ret;

dev = kzalloc(sizeof(struct sa1111_dev), GFP_KERNEL);
if (!dev) {
ret = -ENOMEM;
goto out;
}

dev_set_name(&dev->dev, "%4.4lx", info->offset);
dev->devid = info->devid;
dev->dev.parent = sachip->dev;
dev->dev.bus = &sa1111_bus_type;
dev->dev.release = sa1111_dev_release;
dev->dev.coherent_dma_mask = sachip->dev->coherent_dma_mask;
dev->res.start = sachip->phys + info->offset;
dev->res.end = dev->res.start + 511;
dev->res.name = dev_name(&dev->dev);
dev->res.flags = IORESOURCE_MEM;
dev->mapbase = sachip->base + info->offset;
dev->skpcr_mask = info->skpcr_mask;
memmove(dev->irq, info->irq, sizeof(dev->irq));

ret = request_resource(parent, &dev->res);
if (ret) {
printk("SA1111: failed to allocate resource for %s\n",
dev->res.name);
dev_set_name(&dev->dev, NULL); ============> clear the name
kfree(dev);
goto out;
}


ret = device_register(&dev->dev);
if (ret) {
release_resource(&dev->res);
put_device(&dev->dev); ==================> put the device...
kfree(dev);
goto out;
}


YH

2009-04-28 15:44:50

by Greg KH

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

On Tue, Apr 28, 2009 at 08:34:29AM -0700, Yinghai Lu wrote:
> Yinghai Lu wrote:
> > Kay Sievers wrote:
> >> On Tue, Apr 28, 2009 at 09:42, Yinghai Lu <[email protected]> wrote:
> >>> those about 1/3 dev_set_name() etc.
> >> put_device()?
> >>
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/base/core.c;h=4aa527b8a91381289eb175b33f46e3e418d10374;hb=HEAD#l848
> >>
> > ok, normal release path seems right, put_device will free the name.
> >
> > how about other fail path, that there is not put_device involved?
> >
>
> looks like need to follow this pattern
>
> static int
> sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
> struct sa1111_dev_info *info)
> {
> struct sa1111_dev *dev;
> int ret;
>
> dev = kzalloc(sizeof(struct sa1111_dev), GFP_KERNEL);
> if (!dev) {
> ret = -ENOMEM;
> goto out;
> }
>
> dev_set_name(&dev->dev, "%4.4lx", info->offset);
> dev->devid = info->devid;
> dev->dev.parent = sachip->dev;
> dev->dev.bus = &sa1111_bus_type;
> dev->dev.release = sa1111_dev_release;
> dev->dev.coherent_dma_mask = sachip->dev->coherent_dma_mask;
> dev->res.start = sachip->phys + info->offset;
> dev->res.end = dev->res.start + 511;
> dev->res.name = dev_name(&dev->dev);
> dev->res.flags = IORESOURCE_MEM;
> dev->mapbase = sachip->base + info->offset;
> dev->skpcr_mask = info->skpcr_mask;
> memmove(dev->irq, info->irq, sizeof(dev->irq));
>
> ret = request_resource(parent, &dev->res);
> if (ret) {
> printk("SA1111: failed to allocate resource for %s\n",
> dev->res.name);
> dev_set_name(&dev->dev, NULL); ============> clear the name
> kfree(dev);
> goto out;
> }
>
>
> ret = device_register(&dev->dev);
> if (ret) {
> release_resource(&dev->res);
> put_device(&dev->dev); ==================> put the device...
> kfree(dev);
> goto out;
> }

You can just do a "put_device()" in both places, and it should be fine.

thanks,

greg k-h

2009-04-28 15:45:20

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] driver: make dev_set_name(, NULL) work

On Tue, Apr 28, 2009 at 08:14:13AM -0700, Yinghai Lu wrote:
> Greg KH wrote:
> > On Tue, Apr 28, 2009 at 12:36:06AM -0700, Yinghai Lu wrote:
> >> while looking dev_set_name() calling, there is one
> >> dev_set_name(&dev->dev, NULL)
> >> to used to try to free the device name, before kfree that device.
> >
> > What's wrong with that?
> >
> >> need to move the check for device_add in
> >> | commit 8a577ffc75d9194fe8cdb7479236f2081c26ca1f
> >> | Author: Kay Sievers <[email protected]>
> >> | Date: Sat Apr 18 15:05:45 2009 -0700
> >> |
> >> | driver: dont update dev_name via device_add path
> >> from kobject_set_name_vargs to kobject_add_vargs instead.
> >>
> >> in kobject_set_name_vargs will check if fmt is NULL.
> >>
> >> actually we need to use dev_set_name(,NULL) later on failing path
> >> and release to prevent leaking
> >
> > Are you sure?
> >
> > confused,
> >
>
> in arch/arm/common/sa111.c
>
>
> static int
> sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
> struct sa1111_dev_info *info)
> {
> struct sa1111_dev *dev;
> int ret;
>
> dev = kzalloc(sizeof(struct sa1111_dev), GFP_KERNEL);
> if (!dev) {
> ret = -ENOMEM;
> goto out;
> }
>
> dev_set_name(&dev->dev, "%4.4lx", info->offset);
> dev->devid = info->devid;
> dev->dev.parent = sachip->dev;
> dev->dev.bus = &sa1111_bus_type;
> dev->dev.release = sa1111_dev_release;
> dev->dev.coherent_dma_mask = sachip->dev->coherent_dma_mask;
> dev->res.start = sachip->phys + info->offset;
> dev->res.end = dev->res.start + 511;
> dev->res.name = dev_name(&dev->dev);
> dev->res.flags = IORESOURCE_MEM;
> dev->mapbase = sachip->base + info->offset;
> dev->skpcr_mask = info->skpcr_mask;
> memmove(dev->irq, info->irq, sizeof(dev->irq));
>
> ret = request_resource(parent, &dev->res);
> if (ret) {
> printk("SA1111: failed to allocate resource for %s\n",
> dev->res.name);
> dev_set_name(&dev->dev, NULL);
> kfree(dev);
> goto out;
> }
>
>
> when first dev_set_name is called, dev->dev.kobj.name will initialized from kmalloc.
> so before kfree(dev), do we need to kfree that name?

Do a "put_device()" instead, and everything will be freed.

thanks,

greg k-h

2009-04-28 15:53:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

Greg KH wrote:
> On Tue, Apr 28, 2009 at 08:34:29AM -0700, Yinghai Lu wrote:
>> Yinghai Lu wrote:
>>> Kay Sievers wrote:
>>>> On Tue, Apr 28, 2009 at 09:42, Yinghai Lu <[email protected]> wrote:
>>>>> those about 1/3 dev_set_name() etc.
>>>> put_device()?
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/base/core.c;h=4aa527b8a91381289eb175b33f46e3e418d10374;hb=HEAD#l848
>>>>
>>> ok, normal release path seems right, put_device will free the name.
>>>
>>> how about other fail path, that there is not put_device involved?
>>>
>>
>> looks like need to follow this pattern
>>
>> static int
>> sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
>> struct sa1111_dev_info *info)
>> {
>> struct sa1111_dev *dev;
>> int ret;
>>
>> dev = kzalloc(sizeof(struct sa1111_dev), GFP_KERNEL);
>> if (!dev) {
>> ret = -ENOMEM;
>> goto out;
>> }
>>
>> dev_set_name(&dev->dev, "%4.4lx", info->offset);
>> dev->devid = info->devid;
>> dev->dev.parent = sachip->dev;
>> dev->dev.bus = &sa1111_bus_type;
>> dev->dev.release = sa1111_dev_release;
>> dev->dev.coherent_dma_mask = sachip->dev->coherent_dma_mask;
>> dev->res.start = sachip->phys + info->offset;
>> dev->res.end = dev->res.start + 511;
>> dev->res.name = dev_name(&dev->dev);
>> dev->res.flags = IORESOURCE_MEM;
>> dev->mapbase = sachip->base + info->offset;
>> dev->skpcr_mask = info->skpcr_mask;
>> memmove(dev->irq, info->irq, sizeof(dev->irq));
>>
>> ret = request_resource(parent, &dev->res);
>> if (ret) {
>> printk("SA1111: failed to allocate resource for %s\n",
>> dev->res.name);
>> dev_set_name(&dev->dev, NULL); ============> clear the name
>> kfree(dev);
>> goto out;
>> }
>>
>>
>> ret = device_register(&dev->dev);
>> if (ret) {
>> release_resource(&dev->res);
>> put_device(&dev->dev); ==================> put the device...
>> kfree(dev);
>> goto out;
>> }
>
> You can just do a "put_device()" in both places, and it should be fine.
>

before device_register==>device_initialize is called, kobj->ref is still 0.

will get warn from
if (!kobj->state_initialized)
WARN(1, KERN_WARNING "kobject: '%s' (%p): is not "
"initialized, yet kobject_put() is being "
"called.\n", kobject_name(kobj), kobj);

also wonder
int kref_put(struct kref *kref, void (*release)(struct kref *kref))
{
WARN_ON(release == NULL);
WARN_ON(release == (void (*)(struct kref *))kfree);

if (atomic_dec_and_test(&kref->refcount)) {
release(kref);
return 1;
}
return 0;
}

what will be return from atomic_dec_and_test

YH

2009-04-28 15:56:37

by Kay Sievers

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

On Tue, Apr 28, 2009 at 17:51, Yinghai Lu <[email protected]> wrote:

> before device_register==>device_initialize is called, kobj->ref is still 0.
>
> will get warn from
>                if (!kobj->state_initialized)

Initialize the device before you do anything with it. And call _put()
any time to get rid of ressources, which might have been allocated
before registering.

Thanks,
Kay

2009-04-28 16:10:16

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

Kay Sievers wrote:
> On Tue, Apr 28, 2009 at 17:51, Yinghai Lu <[email protected]> wrote:
>
>> before device_register==>device_initialize is called, kobj->ref is still 0.
>>
>> will get warn from
>> if (!kobj->state_initialized)
>
> Initialize the device before you do anything with it. And call _put()
> any time to get rid of ressources, which might have been allocated
> before registering.
>

struct sa1111_dev *dev;
int ret;

dev = kzalloc(sizeof(struct sa1111_dev), GFP_KERNEL);
if (!dev) {
ret = -ENOMEM;
goto out;
}

dev_set_name(&dev->dev, "%4.4lx", info->offset);
dev->devid = info->devid;
dev->dev.parent = sachip->dev;
dev->dev.bus = &sa1111_bus_type;
dev->dev.release = sa1111_dev_release;
dev->dev.coherent_dma_mask = sachip->dev->coherent_dma_mask;
dev->res.start = sachip->phys + info->offset;
dev->res.end = dev->res.start + 511;
dev->res.name = dev_name(&dev->dev);
dev->res.flags = IORESOURCE_MEM;
dev->mapbase = sachip->base + info->offset;
dev->skpcr_mask = info->skpcr_mask;
memmove(dev->irq, info->irq, sizeof(dev->irq));

ret = request_resource(parent, &dev->res);
if (ret) {
printk("SA1111: failed to allocate resource for %s\n",
dev->res.name);
dev_set_name(&dev->dev, NULL);
kfree(dev);
goto out;
}


ret = device_register(&dev->dev);
if (ret) {
release_resource(&dev->res);
put_device(&dev->dev);
kfree(dev);
goto out;
}


so you mean don't call dev_set_name before device_register and let device_register or device_add take name param?

YH

2009-04-28 16:15:50

by Kay Sievers

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

On Tue, Apr 28, 2009 at 18:08, Yinghai Lu <[email protected]> wrote:
> Kay Sievers wrote:
>> On Tue, Apr 28, 2009 at 17:51, Yinghai Lu <[email protected]> wrote:
>>
>>> before device_register==>device_initialize is called, kobj->ref is still 0.
>>>
>>> will get warn from
>>>                if (!kobj->state_initialized)
>>
>> Initialize the device before you do anything with it. And call _put()
>> any time to get rid of ressources, which might have been allocated
>> before registering.

> so you mean don't call dev_set_name before device_register and let device_register or device_add take name param?

No. I meant:

device_initialize()
call put_device() any time you want to get rid of ressources of the device
device_set_name()
call put_device() any time you want to get rid of ressources of the device
device_register()
...

Kay

2009-04-28 16:38:30

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

Kay Sievers wrote:
> On Tue, Apr 28, 2009 at 17:51, Yinghai Lu <[email protected]> wrote:
>
>> before device_register==>device_initialize is called, kobj->ref is still 0.
>>
>> will get warn from
>> if (!kobj->state_initialized)
>
> Initialize the device before you do anything with it. And call _put()
> any time to get rid of ressources, which might have been allocated
> before registering.

need to replace device_register with device_add and call device_initialize before device_set_name?

YH

2009-04-28 16:51:28

by Kay Sievers

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

On Tue, Apr 28, 2009 at 18:36, Yinghai Lu <[email protected]> wrote:
> Kay Sievers wrote:
>> On Tue, Apr 28, 2009 at 17:51, Yinghai Lu <[email protected]> wrote:
>>
>>> before device_register==>device_initialize is called, kobj->ref is still 0.
>>>
>>> will get warn from
>>>                if (!kobj->state_initialized)
>>
>> Initialize the device before you do anything with it. And call _put()
>> any time to get rid of ressources, which might have been allocated
>> before registering.
>
> need to replace device_register with device_add and call device_initialize before device_set_name?

Sounds right in the case you want to jump out between set_name() and
_register() -- means you have an uninitialized device, where you can
not call put_device(). Otherwise after a failing _register() it should
work fine.

Kay

2009-04-28 19:06:51

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH] use dev_set_name(,NULL) to prevent leaking

Kay Sievers wrote:
> On Tue, Apr 28, 2009 at 18:08, Yinghai Lu <[email protected]> wrote:
>> Kay Sievers wrote:
>>> On Tue, Apr 28, 2009 at 17:51, Yinghai Lu <[email protected]> wrote:
>>>
>>>> before device_register==>device_initialize is called, kobj->ref is still 0.
>>>>
>>>> will get warn from
>>>> if (!kobj->state_initialized)
>>> Initialize the device before you do anything with it. And call _put()
>>> any time to get rid of ressources, which might have been allocated
>>> before registering.
>
>> so you mean don't call dev_set_name before device_register and let device_register or device_add take name param?
>
> No. I meant:
>
> device_initialize()
> call put_device() any time you want to get rid of ressources of the device
> device_set_name()
> call put_device() any time you want to get rid of ressources of the device
> device_register()
> ...
>

that looks overkilling.

still hope to make dev_set_name(,NULL) work. aka if device_initialize/dev_register is called before, could only use device_set_name(,NULL) to clear the set name.

when dev_register failed, could not find corresponding kfree, calling put_device in that case looks scary.

diff --git a/arch/arm/common/locomo.c b/arch/arm/common/locomo.c
index 2293f0c..2cd7e16 100644
--- a/arch/arm/common/locomo.c
+++ b/arch/arm/common/locomo.c
@@ -560,6 +560,7 @@ locomo_init_one_child(struct locomo *lchip, struct locomo_dev_info *info)
ret = device_register(&dev->dev);
if (ret) {
out:
+ put_device(&dev->dev);
kfree(dev);
}
return ret;
diff --git a/arch/arm/common/sa1111.c b/arch/arm/common/sa1111.c
index ef12794..c52534c 100644
--- a/arch/arm/common/sa1111.c
+++ b/arch/arm/common/sa1111.c
@@ -550,6 +550,7 @@ sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
goto out;
}

+ device_initialize(&dev->dev);
dev_set_name(&dev->dev, "%4.4lx", info->offset);
dev->devid = info->devid;
dev->dev.parent = sachip->dev;
@@ -568,15 +569,16 @@ sa1111_init_one_child(struct sa1111 *sachip, struct resource *parent,
if (ret) {
printk("SA1111: failed to allocate resource for %s\n",
dev->res.name);
- dev_set_name(&dev->dev, NULL);
+ put_device(&dev->dev);
kfree(dev);
goto out;
}


- ret = device_register(&dev->dev);
+ ret = device_add(&dev->dev);
if (ret) {
release_resource(&dev->res);
+ put_device(&dev->dev);
kfree(dev);
goto out;
}
diff --git a/arch/arm/kernel/ecard.c b/arch/arm/kernel/ecard.c
index eed2f79..05789c5 100644
--- a/arch/arm/kernel/ecard.c
+++ b/arch/arm/kernel/ecard.c
@@ -795,6 +795,7 @@ static void __init ecard_free_card(struct expansion_card *ec)
if (ec->resource[i].flags)
release_resource(&ec->resource[i]);

+ put_device(&ec->dev);
kfree(ec);
}

@@ -817,6 +818,7 @@ static struct expansion_card *__init ecard_alloc_card(int type, int slot)
ec->dma = NO_DMA;
ec->ops = &ecard_default_ops;

+ device_initialize(&ec->dev);
dev_set_name(&ec->dev, "ecard%d", slot);
ec->dev.parent = NULL;
ec->dev.bus = &ecard_bus_type;
@@ -1063,7 +1065,7 @@ ecard_probe(int slot, card_type_t type)
*ecp = ec;
slot_to_expcard[slot] = ec;

- device_register(&ec->dev);
+ device_add(&ec->dev);

return 0;

diff --git a/arch/arm/mach-integrator/impd1.c b/arch/arm/mach-integrator/impd1.c
index 0058c93..919390f 100644
--- a/arch/arm/mach-integrator/impd1.c
+++ b/arch/arm/mach-integrator/impd1.c
@@ -399,6 +399,7 @@ static int impd1_probe(struct lm_device *dev)
if (!d)
continue;

+ device_initialize(&d->dev);
dev_set_name(&d->dev, "lm%x:%5.5lx", dev->id, idev->offset >> 12);
d->dev.parent = &dev->dev;
d->res.start = dev->resource.start + idev->offset;
diff --git a/arch/arm/mach-integrator/lm.c b/arch/arm/mach-integrator/lm.c
index f52c7af..af0363c 100644
--- a/arch/arm/mach-integrator/lm.c
+++ b/arch/arm/mach-integrator/lm.c
@@ -78,6 +78,7 @@ int lm_device_register(struct lm_device *dev)
{
int ret;

+ device_initialize(&dev->dev);
dev->dev.release = lm_device_release;
dev->dev.bus = &lm_bustype;

@@ -88,10 +89,13 @@ int lm_device_register(struct lm_device *dev)

ret = request_resource(&iomem_resource, &dev->resource);
if (ret == 0) {
- ret = device_register(&dev->dev);
+ ret = device_add(&dev->dev);
if (ret)
release_resource(&dev->resource);
}
+ if (ret)
+ put_device(&dev->dev);
+
return ret;
}

diff --git a/arch/parisc/kernel/drivers.c b/arch/parisc/kernel/drivers.c
index 994bcd9..6eb2f00 100644
--- a/arch/parisc/kernel/drivers.c
+++ b/arch/parisc/kernel/drivers.c
@@ -427,6 +427,7 @@ struct parisc_device * create_tree_node(char id, struct device *parent)
dev->dev.dma_mask = &dev->dma_mask;
dev->dev.coherent_dma_mask = dev->dma_mask;
if (device_register(&dev->dev)) {
+ put_device(&dev->dev);
kfree(dev);
return NULL;
}
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 819e59f..d15e238 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1246,6 +1246,7 @@ struct vio_dev *vio_register_device_node(struct device_node *of_node)
printk(KERN_ERR "%s: failed to register device %s\n",
__func__, dev_name(&viodev->dev));
/* XXX free TCE table */
+ put_device(&viodev->dev);
kfree(viodev);
return NULL;
}
diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c
index 9a73d02..8cfdd39 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -738,6 +738,7 @@ int ps3_system_bus_device_register(struct ps3_system_bus_device *dev)
static unsigned int dev_vuart_count;
static unsigned int dev_lpm_count;

+ device_initialize(&dev->core);
if (!dev->core.parent)
dev->core.parent = &ps3_system_bus;
dev->core.bus = &ps3_system_bus_type;
@@ -768,7 +769,7 @@ int ps3_system_bus_device_register(struct ps3_system_bus_device *dev)

pr_debug("%s:%d add %s\n", __func__, __LINE__, dev_name(&dev->core));

- result = device_register(&dev->core);
+ result = device_add(&dev->core);
return result;
}

diff --git a/arch/sparc/kernel/of_device_32.c b/arch/sparc/kernel/of_device_32.c
index c8f14c1..50a50fb 100644
--- a/arch/sparc/kernel/of_device_32.c
+++ b/arch/sparc/kernel/of_device_32.c
@@ -587,6 +587,7 @@ build_resources:
if (of_device_register(op)) {
printk("%s: Could not register of device.\n",
dp->full_name);
+ put_device(&op->dev);
kfree(op);
op = NULL;
}
diff --git a/arch/sparc/kernel/of_device_64.c b/arch/sparc/kernel/of_device_64.c
index 5ac287a..8ebce84 100644
--- a/arch/sparc/kernel/of_device_64.c
+++ b/arch/sparc/kernel/of_device_64.c
@@ -855,6 +855,7 @@ static struct of_device * __init scan_one_device(struct device_node *dp,
if (of_device_register(op)) {
printk("%s: Could not register of device.\n",
dp->full_name);
+ put_device(&op->dev);
kfree(op);
op = NULL;
}
diff --git a/arch/sparc/kernel/vio.c b/arch/sparc/kernel/vio.c
index 753d128..5bb557b 100644
--- a/arch/sparc/kernel/vio.c
+++ b/arch/sparc/kernel/vio.c
@@ -296,6 +296,7 @@ static struct vio_dev *vio_create_one(struct mdesc_handle *hp, u64 mp,
if (err) {
printk(KERN_ERR "VIO: Could not register device %s, err=%d\n",
dev_name(&vdev->dev), err);
+ put_device(&dev->dev);
kfree(vdev);
return NULL;
}
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 8ff510b..bd28e8b 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -535,6 +535,7 @@ static int acpi_device_register(struct acpi_device *device,
result = device_add(&device->dev);
if(result) {
dev_err(&device->dev, "Error adding device\n");
+ put_device(&device->dev);
goto end;
}

diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 3d763fd..cabb645 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -207,6 +207,7 @@ int amba_device_register(struct amba_device *dev, struct resource *parent)
void __iomem *tmp;
int i, ret;

+ /* we already called device_initialize and dev_set_name */
dev->dev.release = amba_device_release;
dev->dev.bus = &amba_bustype;
dev->dev.dma_mask = &dev->dma_mask;
@@ -240,7 +241,7 @@ int amba_device_register(struct amba_device *dev, struct resource *parent)
goto err_release;
}

- ret = device_register(&dev->dev);
+ ret = device_add(&dev->dev);
if (ret)
goto err_release;

@@ -251,11 +252,11 @@ int amba_device_register(struct amba_device *dev, struct resource *parent)
if (ret == 0)
return ret;

- device_unregister(&dev->dev);
-
+ device_del(&dev->dev);
err_release:
release_resource(&dev->res);
err_out:
+ put_device(&dev->dev);
return ret;
}

diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index d3a59c6..980e80f 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -330,6 +330,7 @@ static int fw_register_device(struct device **dev_p, const char *fw_name,

error_kfree:
kfree(fw_priv);
+ put_device(f_dev);
kfree(f_dev);
return retval;
}
diff --git a/drivers/dio/dio.c b/drivers/dio/dio.c
index 55dd88d..d2ede72 100644
--- a/drivers/dio/dio.c
+++ b/drivers/dio/dio.c
@@ -186,6 +186,7 @@ static int __init dio_init(void)
error = device_register(&dio_bus.dev);
if (error) {
pr_err("DIO: Error registering dio_bus\n");
+ put_device(&dio_bus.dev);
return error;
}

@@ -261,6 +262,7 @@ static int __init dio_init(void)
if (error) {
pr_err("DIO: Error registering device %s\n",
dev->name);
+ put_device(&dev->dev);
continue;
}
error = dio_create_sysfs_dev_files(dev);
diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 92438e9..e67dab0 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -699,6 +699,7 @@ int dma_async_device_register(struct dma_device *device)
if (rc) {
free_percpu(chan->local);
chan->local = NULL;
+ put_device(&chan->dev->device);
kfree(chan->dev);
atomic_dec(idr_ref);
goto err_out;
diff --git a/drivers/firewire/fw-device.c b/drivers/firewire/fw-device.c
index a47e212..3613654 100644
--- a/drivers/firewire/fw-device.c
+++ b/drivers/firewire/fw-device.c
@@ -579,6 +579,7 @@ static void create_units(struct fw_device *device)
continue;

skip_unit:
+ put_device(&unit->device);
kfree(unit);
}
}
diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 022876a..47fd51a 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -401,9 +401,10 @@ err_out_files:
for (j = 0; j < i; j++)
device_remove_file(&connector->kdev,
&connector_attrs[i]);
- device_unregister(&connector->kdev);
+ device_del(&connector->kdev);

out:
+ put_device(&connector->kdev);
return ret;
}
EXPORT_SYMBOL(drm_sysfs_connector_add);
@@ -489,8 +490,8 @@ int drm_sysfs_device_add(struct drm_minor *minor)

return 0;

- device_unregister(&minor->kdev);
err_out:
+ put_device(&minor->kdev);

return err;
}
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 3d4e099..836a48e 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -1847,6 +1847,7 @@ static int ide_cd_probe(ide_drive_t *drive)
out_free_disk:
put_disk(g);
out_free_cd:
+ put_device(&info->dev);
kfree(info);
failed:
return -ENODEV;
diff --git a/drivers/ide/ide-gd.c b/drivers/ide/ide-gd.c
index 4b6b71e..f94a563 100644
--- a/drivers/ide/ide-gd.c
+++ b/drivers/ide/ide-gd.c
@@ -391,6 +391,7 @@ static int ide_gd_probe(ide_drive_t *drive)
out_free_disk:
put_disk(g);
out_free_idkp:
+ put_device(&idkp->dev);
kfree(idkp);
failed:
return -ENODEV;
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 7f264ed..755b27c 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -566,9 +566,10 @@ static int ide_register_port(ide_hwif_t *hwif)
MKDEV(0, 0), hwif, hwif->name);
if (IS_ERR(hwif->portdev)) {
ret = PTR_ERR(hwif->portdev);
- device_unregister(&hwif->gendev);
+ device_del(&hwif->gendev);
}
out:
+ put_device(&hwif->gendev);
return ret;
}

diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
index cb942a9..5ec390f 100644
--- a/drivers/ide/ide-tape.c
+++ b/drivers/ide/ide-tape.c
@@ -2419,6 +2419,7 @@ static int ide_tape_probe(ide_drive_t *drive)
out_free_disk:
put_disk(g);
out_free_tape:
+ put_device(&tape->dev);
kfree(tape);
failed:
return -ENODEV;
diff --git a/drivers/ieee1394/hosts.c b/drivers/ieee1394/hosts.c
index e947d8f..f44acaa 100644
--- a/drivers/ieee1394/hosts.c
+++ b/drivers/ieee1394/hosts.c
@@ -157,14 +157,16 @@ struct hpsb_host *hpsb_alloc_host(struct hpsb_host_driver *drv, size_t extra,
set_dev_node(&h->device, dev_to_node(dev));
dev_set_name(&h->device, "fw-host%d", h->id);

+ device_initialize(&h->host_dev);
h->host_dev.parent = &h->device;
h->host_dev.class = &hpsb_host_class;
dev_set_name(&h->host_dev, "fw-host%d", h->id);

if (device_register(&h->device))
goto fail;
- if (device_register(&h->host_dev)) {
- device_unregister(&h->device);
+
+ if (device_add(&h->host_dev)) {
+ device_del(&h->device);
goto fail;
}
get_device(&h->device);
@@ -172,6 +174,8 @@ struct hpsb_host *hpsb_alloc_host(struct hpsb_host_driver *drv, size_t extra,
return h;

fail:
+ put_device(&h->device);
+ put_device(&h->host_dev);
kfree(h);
return NULL;
}
diff --git a/drivers/ieee1394/nodemgr.c b/drivers/ieee1394/nodemgr.c
index 065f249..86fe12c 100644
--- a/drivers/ieee1394/nodemgr.c
+++ b/drivers/ieee1394/nodemgr.c
@@ -826,13 +826,14 @@ static struct node_entry *nodemgr_create_node(octlet_t guid,
ne->device.parent = &host->device;
dev_set_name(&ne->device, "%016Lx", (unsigned long long)(ne->guid));

+ device_initialize(&ne->node_dev);
ne->node_dev.parent = &ne->device;
ne->node_dev.class = &nodemgr_ne_class;
dev_set_name(&ne->node_dev, "%016Lx", (unsigned long long)(ne->guid));

if (device_register(&ne->device))
goto fail_devreg;
- if (device_register(&ne->node_dev))
+ if (device_add(&ne->node_dev))
goto fail_classdevreg;
get_device(&ne->device);

@@ -847,8 +848,10 @@ static struct node_entry *nodemgr_create_node(octlet_t guid,
return ne;

fail_classdevreg:
- device_unregister(&ne->device);
+ device_del(&ne->device);
fail_devreg:
+ put_device(&ne->device);
+ put_device(&ne->node_dev);
kfree(ne);
fail_alloc:
HPSB_ERR("Failed to create node ID:BUS[" NODE_BUS_FMT "] GUID[%016Lx]",
@@ -930,13 +933,14 @@ static void nodemgr_register_device(struct node_entry *ne,

dev_set_name(&ud->device, "%s-%u", dev_name(&ne->device), ud->id);

+ device_initialize(&ud->unit_dev);
ud->unit_dev.parent = &ud->device;
ud->unit_dev.class = &nodemgr_ud_class;
dev_set_name(&ud->unit_dev, "%s-%u", dev_name(&ne->device), ud->id);

if (device_register(&ud->device))
goto fail_devreg;
- if (device_register(&ud->unit_dev))
+ if (device_add(&ud->unit_dev))
goto fail_classdevreg;
get_device(&ud->device);

@@ -945,9 +949,11 @@ static void nodemgr_register_device(struct node_entry *ne,
return;

fail_classdevreg:
- device_unregister(&ud->device);
+ device_del(&ud->device);
fail_devreg:
HPSB_ERR("Failed to create unit %s", dev_name(&ud->device));
+ put_device(&ud->device);
+ put_device(&ud->unit_dev);
}


diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 5c04cfb..e4cc434 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -821,9 +821,10 @@ err_put:
kobject_put(&class_dev->kobj);

err_unregister:
- device_unregister(class_dev);
+ device_del(class_dev);

err:
+ put_device(class_dev);
return ret;
}

diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 51bd966..0ce368a 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1277,8 +1277,9 @@ static void ib_ucm_add_one(struct ib_device *device)
return;

err_dev:
- device_unregister(&ucm_dev->dev);
+ device_del(&ucm_dev->dev);
err_cdev:
+ put_device(&ucm_dev->dev);
cdev_del(&ucm_dev->cdev);
clear_bit(ucm_dev->devnum, dev_map);
err:
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 54c8fe2..bcb4c05 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1963,9 +1963,10 @@ static struct srp_host *srp_add_port(struct srp_device *device, u8 port)
return host;

err_class:
- device_unregister(&host->dev);
+ device_del(&host->dev);

free_host:
+ put_device(&host->dev);
kfree(host);

return NULL;
diff --git a/drivers/input/input.c b/drivers/input/input.c
index 935a183..5c13a1a 100644
--- a/drivers/input/input.c
+++ b/drivers/input/input.c
@@ -1398,8 +1398,10 @@ int input_register_device(struct input_dev *dev)
(unsigned long) atomic_inc_return(&input_no) - 1);

error = device_add(&dev->dev);
- if (error)
+ if (error) {
+ put_device(&dev->dev);
return error;
+ }

path = kobject_get_path(&dev->dev.kobj, GFP_KERNEL);
printk(KERN_INFO "input: %s as %s\n",
@@ -1409,6 +1411,7 @@ int input_register_device(struct input_dev *dev)
error = mutex_lock_interruptible(&input_mutex);
if (error) {
device_del(&dev->dev);
+ put_device(&dev->dev);
return error;
}

diff --git a/drivers/isdn/mISDN/core.c b/drivers/isdn/mISDN/core.c
index 9426c98..9ee0b2e 100644
--- a/drivers/isdn/mISDN/core.c
+++ b/drivers/isdn/mISDN/core.c
@@ -253,6 +253,7 @@ mISDN_register_device(struct mISDNdevice *dev,

error3:
delete_stack(dev);
+ put_device(&dev->dev);
return err;
error1:
return err;
diff --git a/drivers/isdn/mISDN/dsp_pipeline.c b/drivers/isdn/mISDN/dsp_pipeline.c
index 18cf87c..28ee4e2 100644
--- a/drivers/isdn/mISDN/dsp_pipeline.c
+++ b/drivers/isdn/mISDN/dsp_pipeline.c
@@ -127,9 +127,9 @@ int mISDN_dsp_element_register(struct mISDN_dsp_element *elem)
return 0;

err2:
- device_unregister(&entry->dev);
- return ret;
+ device_del(&entry->dev);
err1:
+ put_device(&entry->dev);
kfree(entry);
return ret;
}
diff --git a/drivers/macintosh/macio_asic.c b/drivers/macintosh/macio_asic.c
index 6e149f4..8b5b79c 100644
--- a/drivers/macintosh/macio_asic.c
+++ b/drivers/macintosh/macio_asic.c
@@ -409,6 +409,7 @@ static struct macio_dev * macio_add_one_device(struct macio_chip *chip,
if (of_device_register(&dev->ofdev) != 0) {
printk(KERN_DEBUG"macio: device registration error for %s!\n",
dev_name(&dev->ofdev.dev));
+ put_dev(&dev->ofdev);
kfree(dev);
return NULL;
}
diff --git a/drivers/mca/mca-bus.c b/drivers/mca/mca-bus.c
index ada5ebb..70c8de9 100644
--- a/drivers/mca/mca-bus.c
+++ b/drivers/mca/mca-bus.c
@@ -129,8 +129,9 @@ int __init mca_register_device(int bus, struct mca_device *mca_dev)
err_out_id:
device_remove_file(&mca_dev->dev, &dev_attr_id);
err_out_devreg:
- device_unregister(&mca_dev->dev);
+ device_del(&mca_dev->dev);
err_out:
+ put_device(&mca_dev->dev);
return 0;
}

@@ -154,6 +155,7 @@ struct mca_bus * __devinit mca_attach_bus(int bus)
dev_set_name(&mca_bus->dev, "mca%d", bus);
sprintf(mca_bus->name,"Host %s MCA Bridge", bus ? "Secondary" : "Primary");
if (device_register(&mca_bus->dev)) {
+ put_device(&mca_bus->dev);
kfree(mca_bus);
return NULL;
}
diff --git a/drivers/media/video/bt8xx/bttv-gpio.c b/drivers/media/video/bt8xx/bttv-gpio.c
index 74c325e..a886d80 100644
--- a/drivers/media/video/bt8xx/bttv-gpio.c
+++ b/drivers/media/video/bt8xx/bttv-gpio.c
@@ -95,6 +95,7 @@ int bttv_sub_add_device(struct bttv_core *core, char *name)

err = device_register(&sub->dev);
if (0 != err) {
+ put_device(&sub->dev);
kfree(sub);
return err;
}
diff --git a/drivers/media/video/pvrusb2/pvrusb2-sysfs.c b/drivers/media/video/pvrusb2/pvrusb2-sysfs.c
index 299c1cb..4764e24 100644
--- a/drivers/media/video/pvrusb2/pvrusb2-sysfs.c
+++ b/drivers/media/video/pvrusb2/pvrusb2-sysfs.c
@@ -640,6 +640,7 @@ static void class_dev_create(struct pvr2_sysfs *sfp,
if (ret) {
pvr2_trace(PVR2_TRACE_ERROR_LEGS,
"device_register failed");
+ put_device(class_dev);
kfree(class_dev);
return;
}
diff --git a/drivers/media/video/v4l2-dev.c b/drivers/media/video/v4l2-dev.c
index 31eac66..d6d1132 100644
--- a/drivers/media/video/v4l2-dev.c
+++ b/drivers/media/video/v4l2-dev.c
@@ -522,6 +522,7 @@ int video_register_device_index(struct video_device *vdev, int type, int nr,
ret = device_register(&vdev->dev);
if (ret < 0) {
printk(KERN_ERR "%s: device_register failed\n", __func__);
+ put_device(&vdev->dev);
goto cleanup;
}
/* Register the release callback that will be called when the last
diff --git a/drivers/memstick/core/memstick.c b/drivers/memstick/core/memstick.c
index a5b448e..7400d2c 100644
--- a/drivers/memstick/core/memstick.c
+++ b/drivers/memstick/core/memstick.c
@@ -413,6 +413,7 @@ static struct memstick_dev *memstick_alloc_card(struct memstick_host *host)
return card;
err_out:
host->card = old_card;
+ put_device(&card->dev);
kfree(card);
return NULL;
}
diff --git a/drivers/message/i2o/device.c b/drivers/message/i2o/device.c
index 0ee4264..b22214c 100644
--- a/drivers/message/i2o/device.c
+++ b/drivers/message/i2o/device.c
@@ -300,8 +300,9 @@ rmlink1:
sysfs_remove_link(&i2o_dev->device.kobj, "user");
unreg_dev:
list_del(&i2o_dev->list);
- device_unregister(&i2o_dev->device);
+ device_del(&i2o_dev->device);
err:
+ put_device(&i2o_dev->device);
kfree(i2o_dev);
return rc;
}
diff --git a/drivers/mfd/mcp-core.c b/drivers/mfd/mcp-core.c
index 57271cb..1944ccf 100644
--- a/drivers/mfd/mcp-core.c
+++ b/drivers/mfd/mcp-core.c
@@ -214,8 +214,14 @@ EXPORT_SYMBOL(mcp_host_alloc);

int mcp_host_register(struct mcp *mcp)
{
+ int ret;
+
dev_set_name(&mcp->attached_device, "mcp0");
- return device_register(&mcp->attached_device);
+ ret = device_register(&mcp->attached_device);
+ if (ret)
+ put_device(&mcp->addtached_device);
+
+ return ret;
}
EXPORT_SYMBOL(mcp_host_register);

diff --git a/drivers/mfd/ucb1x00-core.c b/drivers/mfd/ucb1x00-core.c
index fea9085..e9acaa6 100644
--- a/drivers/mfd/ucb1x00-core.c
+++ b/drivers/mfd/ucb1x00-core.c
@@ -532,6 +532,7 @@ static int ucb1x00_probe(struct mcp *mcp)

err_irq:
free_irq(ucb->irq, ucb);
+ put_device(&ucb->dev);
err_free:
kfree(ucb);
err_disable:
diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 3cf61ec..ac579aa 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -135,6 +135,7 @@ enclosure_register(struct device *dev, const char *name, int components,

err:
put_device(edev->edev.parent);
+ put_device(&edev->edev);
kfree(edev);
return ERR_PTR(err);
}
@@ -264,8 +265,10 @@ enclosure_component_register(struct enclosure_device *edev,
cdev->groups = enclosure_groups;

err = device_register(cdev);
- if (err)
+ if (err) {
ERR_PTR(err);
+ put_device(cdev);
+ }

return ecomp;
}