(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Thu, 2 Jul 2009 01:22:24 GMT [email protected] wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13690
>
> Summary: nodes_clear cause hugepage unusable on non-NUMA
> machine
> Product: Platform Specific/Hardware
> Version: 2.5
> Kernel Version: 2.6.31-rc1
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: high
> Priority: P1
> Component: i386
> AssignedTo: [email protected]
> ReportedBy: [email protected]
> CC: [email protected]
> Regression: Yes
>
>
> 73d60b7f747176dbdff826c4127d22e1fd3f9f74 commit introduced a nodes_clear
> function for NUMA machine. But seems the commit omits non-NUMA machine.
> If find_zone_movable_pfns_for_nodes/early_calculate_totalpages has no
> chance to run. nodes_clear will block HUPEPAGE using in my specjbb2005
> testing on my Stoakely(i386/x86_64), waybridge(i386), IBM T61(i386)
>
> + /*
> + * find_zone_movable_pfns_for_nodes/early_calculate_totalpages init
> + * that node_mask, clear it at first
> + */
> + nodes_clear(node_states[N_HIGH_MEMORY]);
Thanks.
fyi, with recently-occurring bugs and regressions of this nature, it is (I
think) best to deal with them via email rather than bugzilla. Bugzilla is
better-suited to longer-lived bugs where we have a need to track them,
generate statistics, etc.
that looks strange...
config is 32bit.
the second patch only do save and restore. and should be right right.
please check following patch on today's linus tree. and send out /proc/iomem
Thanks
Yinghai
[PATCH] x86: add boundary check for 32bit res before expand e820 resource to alignment
fix hang with HIGHMEM_64G and 32bit resource.
according to hpa and Linus, use (resource_size_t)-1 to fend off big ranges.
analyized by hpa
Reported-and-tested-by: Mikael Pettersson <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/proto.h | 3 ---
arch/x86/kernel/e820.c | 20 ++++++++++++--------
include/linux/kernel.h | 5 +++++
3 files changed, 17 insertions(+), 11 deletions(-)
Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1367,9 +1367,9 @@ void __init e820_reserve_resources(void)
}
/* How much should we pad RAM ending depending on where it is? */
-static unsigned long ram_alignment(resource_size_t pos)
+static u64 ram_alignment(u64 pos)
{
- unsigned long mb = pos >> 20;
+ u64 mb = pos >> 20;
/* To 64kB in the first megabyte */
if (!mb)
@@ -1383,6 +1383,8 @@ static unsigned long ram_alignment(resou
return 32*1024*1024;
}
+#define MAX_RESOURCE_SIZE ((resource_size_t)-1)
+
void __init e820_reserve_resources_late(void)
{
int i;
@@ -1400,17 +1402,19 @@ void __init e820_reserve_resources_late(
* avoid stolen RAM:
*/
for (i = 0; i < e820.nr_map; i++) {
- struct e820entry *entry = &e820_saved.map[i];
- resource_size_t start, end;
+ struct e820entry *entry = &e820.map[i];
+ u64 start, end;
if (entry->type != E820_RAM)
continue;
start = entry->addr + entry->size;
- end = round_up(start, ram_alignment(start));
- if (start == end)
+ end = round_up(start, ram_alignment(start)) - 1;
+ if (end > MAX_RESOURCE_SIZE)
+ end = MAX_RESOURCE_SIZE;
+ if (start > end)
continue;
- reserve_region_with_split(&iomem_resource, start,
- end - 1, "RAM buffer");
+ reserve_region_with_split(&iomem_resource, start, end,
+ "RAM buffer");
}
}
The new patch works for my stoakley i386 machine. But for x86_64 machine
the specjbb2005 still can not run with hugepage. The specjbb2005 use the
same java setting as i386 system. After apply your patch, the iomem of
x86_64 is:
00000000-0000ffff : reserved
00010000-0009cbff : System RAM
0009cc00-0009ffff : reserved
000cc000-000cffff : reserved
000e0000-000fffff : reserved
00100000-cfefffff : System RAM
01000000-014eb53e : Kernel code
014eb53f-0177390f : Kernel data
01830000-018f583f : Kernel bss
cff00000-cff0afff : ACPI Tables
cff0b000-cff0bfff : ACPI Non-volatile Storage
cff0c000-cfffffff : reserved
d0000000-d7ffffff : PCI Bus 0000:08
d0000000-d7ffffff : 0000:08:01.0
d8000000-d81fffff : PCI Bus 0000:03
d8000000-d81fffff : PCI Bus 0000:06
d8000000-d80fffff : 0000:06:02.0
d8100000-d810ffff : 0000:06:01.0
d8200000-d84fffff : PCI Bus 0000:03
d8200000-d83fffff : PCI Bus 0000:06
d8200000-d82fffff : 0000:06:02.0
d8200000-d82fffff : e100
d8300000-d831ffff : 0000:06:01.0
d8300000-d831ffff : e1000
d8320000-d832ffff : 0000:06:01.0
d8320000-d832ffff : e1000
d8330000-d8330fff : 0000:06:02.0
d8330000-d8330fff : e100
d8500000-d87fffff : PCI Bus 0000:07
d8500000-d8503fff : 0000:07:00.0
d8504000-d8507fff : 0000:07:00.1
d8520000-d853ffff : 0000:07:00.0
d8540000-d855ffff : 0000:07:00.1
d8600000-d86fffff : 0000:07:00.0
d8700000-d87fffff : 0000:07:00.1
d8800000-d88fffff : PCI Bus 0000:08
d8800000-d880ffff : 0000:08:01.0
d8810000-d8813fff : 0000:08:08.0
d8814000-d88147ff : 0000:08:08.0
d8820000-d883ffff : 0000:08:01.0
d8904000-d8907fff : 0000:00:1b.0
d8908000-d89083ff : 0000:00:1d.7
d8908000-d89083ff : ehci_hcd
d8908400-d89087ff : 0000:00:1f.2
d8908400-d89087ff : ahci
d8a00000-d8bfffff : PCI Bus 0000:07
d8a00000-d8afffff : 0000:07:00.0
d8b00000-d8bfffff : 0000:07:00.1
e0000000-efffffff : reserved
e0000000-efffffff : pnp 00:01
e0000000-e07fffff : PCI MMCONFIG 0 [00-07]
fe000000-fe01ffff : pnp 00:01
fe000000-fe01ffff : i5k_amb
fe600000-fe6fffff : pnp 00:01
fe700000-fe703fff : 0000:00:0f.0
fec00000-fec0ffff : reserved
fec00000-fec00fff : IOAPIC 0
fec88000-fec88fff : IOAPIC 1
fec88000-fec88fff : pnp 00:01
fec89000-fec89fff : IOAPIC 2
fec89000-fec89fff : pnp 00:01
fed00000-fed003ff : HPET 0
fed1c000-fed1ffff : pnp 00:01
fed20000-fed44fff : pnp 00:01
fed45000-fed8ffff : pnp 00:01
fee00000-fee00fff : Local APIC
fee00000-fee00fff : reserved
ff000000-ffffffff : reserved
100000000-12fffffff : System RAM
====================
The iomem of i386 stoakley is:
--- stoakley.iomem.x86_64 2009-07-02 13:53:35.000000000 +0800
+++ stoakley.iomem.i386 2009-07-02 14:19:59.000000000 +0800
@@ -1,12 +1,15 @@
00000000-0000ffff : reserved
00010000-0009cbff : System RAM
0009cc00-0009ffff : reserved
+000a0000-000bffff : Video RAM area
+000c0000-000cafff : Video ROM
000cc000-000cffff : reserved
000e0000-000fffff : reserved
+ 000f0000-000fffff : System ROM
00100000-cfefffff : System RAM
- 01000000-014eb53e : Kernel code
- 014eb53f-0177390f : Kernel data
- 01830000-018f583f : Kernel bss
+ 00100000-00602876 : Kernel code
+ 00602877-008e49db : Kernel data
+ 00954000-009fe433 : Kernel bss
cff00000-cff0afff : ACPI Tables
cff0b000-cff0bfff : ACPI Non-volatile Storage
cff0c000-cfffffff : reserved
@@ -50,7 +53,6 @@
e0000000-efffffff : pnp 00:01
e0000000-e07fffff : PCI MMCONFIG 0 [00-07]
fe000000-fe01ffff : pnp 00:01
- fe000000-fe01ffff : i5k_amb
fe600000-fe6fffff : pnp 00:01
fe700000-fe703fff : 0000:00:0f.0
fec00000-fec0ffff : reserved
@@ -66,4 +68,3 @@
fee00000-fee00fff : Local APIC
fee00000-fee00fff : reserved
ff000000-ffffffff : reserved
-100000000-12fffffff : System RAM
Alex
On Thu, 2009-07-02 at 10:14 +0800, Yinghai Lu wrote:
> that looks strange...
>
> config is 32bit.
>
> the second patch only do save and restore. and should be right right.
>
> please check following patch on today's linus tree. and send out /proc/iomem
>
> Thanks
>
> Yinghai
>
> [PATCH] x86: add boundary check for 32bit res before expand e820 resource to alignment
>
> fix hang with HIGHMEM_64G and 32bit resource.
> according to hpa and Linus, use (resource_size_t)-1 to fend off big ranges.
>
> analyized by hpa
>
> Reported-and-tested-by: Mikael Pettersson <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> arch/x86/include/asm/proto.h | 3 ---
> arch/x86/kernel/e820.c | 20 ++++++++++++--------
> include/linux/kernel.h | 5 +++++
> 3 files changed, 17 insertions(+), 11 deletions(-)
>
> Index: linux-2.6/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/e820.c
> +++ linux-2.6/arch/x86/kernel/e820.c
> @@ -1367,9 +1367,9 @@ void __init e820_reserve_resources(void)
> }
>
> /* How much should we pad RAM ending depending on where it is? */
> -static unsigned long ram_alignment(resource_size_t pos)
> +static u64 ram_alignment(u64 pos)
> {
> - unsigned long mb = pos >> 20;
> + u64 mb = pos >> 20;
>
> /* To 64kB in the first megabyte */
> if (!mb)
> @@ -1383,6 +1383,8 @@ static unsigned long ram_alignment(resou
> return 32*1024*1024;
> }
>
> +#define MAX_RESOURCE_SIZE ((resource_size_t)-1)
> +
> void __init e820_reserve_resources_late(void)
> {
> int i;
> @@ -1400,17 +1402,19 @@ void __init e820_reserve_resources_late(
> * avoid stolen RAM:
> */
> for (i = 0; i < e820.nr_map; i++) {
> - struct e820entry *entry = &e820_saved.map[i];
> - resource_size_t start, end;
> + struct e820entry *entry = &e820.map[i];
> + u64 start, end;
>
> if (entry->type != E820_RAM)
> continue;
> start = entry->addr + entry->size;
> - end = round_up(start, ram_alignment(start));
> - if (start == end)
> + end = round_up(start, ram_alignment(start)) - 1;
> + if (end > MAX_RESOURCE_SIZE)
> + end = MAX_RESOURCE_SIZE;
> + if (start > end)
> continue;
> - reserve_region_with_split(&iomem_resource, start,
> - end - 1, "RAM buffer");
> + reserve_region_with_split(&iomem_resource, start, end,
> + "RAM buffer");
> }
> }
>
Alex Shi wrote:
> The new patch works for my stoakley i386 machine. But for x86_64 machine
> the specjbb2005 still can not run with hugepage. The specjbb2005 use the
> same java setting as i386 system. After apply your patch, the iomem of
> x86_64 is:
please check
[PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not compiled in
Alex found:
for x86_64 machine the specjbb2005 still can not run with hugepage
only happens when numa is not compiled in
the root cause: node_set_state will not set it back for us in that case
so don't clear that when numa is not select in config
Reported-by: Alex Shi <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -598,8 +598,14 @@ void __init paging_init(void)
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
- /* clear the default setting with node 0 */
+#if MAX_NUMNODES > 1
+ /*
+ * clear the default setting with node 0
+ * note: don't clear it, node_set_state will do nothing
+ * (aka set it back) when numa support is not compiled in
+ */
nodes_clear(node_states[N_NORMAL_MEMORY]);
+#endif
free_area_init_nodes(max_zone_pfns);
}
On Thu, 2 Jul 2009, Yinghai Lu wrote:
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -598,8 +598,14 @@ void __init paging_init(void)
>
> sparse_memory_present_with_active_regions(MAX_NUMNODES);
> sparse_init();
> - /* clear the default setting with node 0 */
> +#if MAX_NUMNODES > 1
> + /*
> + * clear the default setting with node 0
> + * note: don't clear it, node_set_state will do nothing
> + * (aka set it back) when numa support is not compiled in
> + */
> nodes_clear(node_states[N_NORMAL_MEMORY]);
The problem was that nodes_clear() does not fall back to a noop on !NUMA.
The node_set/clear_states() operations do become noops.
Could we make it more consistent by using only operations of the same
type? F.e. Add a node_clearall_states() in include/linux/nodemask.h that
falls back to a noop on !NUMA like the node_*_states operation?
Another options is to restore node_states[N_NORMAL_MEMORY] to its
initial condition. See the definition of node_states in page_alloc.c.
Yes, the patch fixes this bug!
Alex
>-----Original Message-----
>From: Yinghai Lu [mailto:[email protected]]
>Sent: 2009??7??2?? 16:51
>To: Shi, Alex; Andrew Morton; Ingo Molnar
>Cc: [email protected]; [email protected];
>Christoph Lameter; Mel Gorman; [email protected]; Zhang, Yanmin;
>Chen, Tim C
>Subject: Re: [Bugme-new] [Bug 13690] New: nodes_clear cause hugepage unusable
>on non-NUMA machine
>
>Alex Shi wrote:
>> The new patch works for my stoakley i386 machine. But for x86_64 machine
>> the specjbb2005 still can not run with hugepage. The specjbb2005 use the
>> same java setting as i386 system. After apply your patch, the iomem of
>> x86_64 is:
>
>please check
>
>[PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not
>compiled in
>
>Alex found:
>for x86_64 machine the specjbb2005 still can not run with hugepage
>
>only happens when numa is not compiled in
>
>the root cause: node_set_state will not set it back for us in that case
>
>so don't clear that when numa is not select in config
>
>Reported-by: Alex Shi <[email protected]>
>Signed-off-by: Yinghai Lu <[email protected]>
>
>---
> arch/x86/mm/init_64.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
>Index: linux-2.6/arch/x86/mm/init_64.c
>===================================================================
>--- linux-2.6.orig/arch/x86/mm/init_64.c
>+++ linux-2.6/arch/x86/mm/init_64.c
>@@ -598,8 +598,14 @@ void __init paging_init(void)
>
> sparse_memory_present_with_active_regions(MAX_NUMNODES);
> sparse_init();
>- /* clear the default setting with node 0 */
>+#if MAX_NUMNODES > 1
>+ /*
>+ * clear the default setting with node 0
>+ * note: don't clear it, node_set_state will do nothing
>+ * (aka set it back) when numa support is not compiled in
>+ */
> nodes_clear(node_states[N_NORMAL_MEMORY]);
>+#endif
> free_area_init_nodes(max_zone_pfns);
> }
>
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Alex found:
for x86_64 machine the specjbb2005 still can not run with hugepage
only happens when numa is not compiled in
the root cause: node_set_state will not set it back for us in that case
so don't clear that when numa is not select in config
v2: use node_clear_state instead
Reported-and-Tested-by: Alex Shi <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -598,8 +598,15 @@ void __init paging_init(void)
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
- /* clear the default setting with node 0 */
- nodes_clear(node_states[N_NORMAL_MEMORY]);
+
+ /*
+ * clear the default setting with node 0
+ * note: don't use nodes_clear here, that is really clearing when
+ * numa support is not compiled in, and later node_set_state
+ * will not set it back.
+ */
+ node_clear_state(0, N_NORMAL_MEMORY);
+
free_area_init_nodes(max_zone_pfns);
}
Christoph Lameter wrote:
> On Thu, 2 Jul 2009, Yinghai Lu wrote:
>
>> Index: linux-2.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_64.c
>> +++ linux-2.6/arch/x86/mm/init_64.c
>> @@ -598,8 +598,14 @@ void __init paging_init(void)
>>
>> sparse_memory_present_with_active_regions(MAX_NUMNODES);
>> sparse_init();
>> - /* clear the default setting with node 0 */
>> +#if MAX_NUMNODES > 1
>> + /*
>> + * clear the default setting with node 0
>> + * note: don't clear it, node_set_state will do nothing
>> + * (aka set it back) when numa support is not compiled in
>> + */
>> nodes_clear(node_states[N_NORMAL_MEMORY]);
>
> The problem was that nodes_clear() does not fall back to a noop on !NUMA.
> The node_set/clear_states() operations do become noops.
>
> Could we make it more consistent by using only operations of the same
> type? F.e. Add a node_clearall_states() in include/linux/nodemask.h that
> falls back to a noop on !NUMA like the node_*_states operation?
>
> Another options is to restore node_states[N_NORMAL_MEMORY] to its
> initial condition. See the definition of node_states in page_alloc.c.
could use node_clear_state(0, N_NORMAL_MEMORY) instead. because default one only have node 0 set in that mask.
YH
Yinghai:
The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
this patch did not appear on rc2 kernel. Are there some concern about
this?
BRG
Alex
On Thu, 2009-07-02 at 16:50 +0800, Yinghai Lu wrote:
> Alex Shi wrote:
> > The new patch works for my stoakley i386 machine. But for x86_64 machine
> > the specjbb2005 still can not run with hugepage. The specjbb2005 use the
> > same java setting as i386 system. After apply your patch, the iomem of
> > x86_64 is:
>
> please check
>
> [PATCH] x86: don't clear nodes_states[N_NORMAL_MEMORY] when numa is not compiled in
>
> Alex found:
> for x86_64 machine the specjbb2005 still can not run with hugepage
>
> only happens when numa is not compiled in
>
> the root cause: node_set_state will not set it back for us in that case
>
> so don't clear that when numa is not select in config
>
> Reported-by: Alex Shi <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> arch/x86/mm/init_64.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -598,8 +598,14 @@ void __init paging_init(void)
>
> sparse_memory_present_with_active_regions(MAX_NUMNODES);
> sparse_init();
> - /* clear the default setting with node 0 */
> +#if MAX_NUMNODES > 1
> + /*
> + * clear the default setting with node 0
> + * note: don't clear it, node_set_state will do nothing
> + * (aka set it back) when numa support is not compiled in
> + */
> nodes_clear(node_states[N_NORMAL_MEMORY]);
> +#endif
> free_area_init_nodes(max_zone_pfns);
> }
>
Alex Shi wrote:
> Yinghai:
>
> The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
> this patch did not appear on rc2 kernel. Are there some concern about
> this?
>
can you check
http://lkml.org/lkml/2009/7/2/326
YH
On Tue, 2009-07-07 at 08:07 +0800, Yinghai Lu wrote:
> Alex Shi wrote:
> > Yinghai:
> >
> > The 31-rc2 kernel still can not use hugepage on non-NUMA machine. And
> > this patch did not appear on rc2 kernel. Are there some concern about
> > this?
> >
>
> can you check
> http://lkml.org/lkml/2009/7/2/326
>
> YH
It works on my Stoakley i386 and x86_64 with latest Linus' kernel tree.
Alex
Alex found:
for x86_64 machine the specjbb2005 still can not run with hugepage
only happens when numa is not compiled in
the root cause: node_set_state will not set it back for us in that case
so don't clear that when numa is not select in config
v2: use node_clear_state instead
Reported-and-Tested-by: Alex Shi <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -598,8 +598,15 @@ void __init paging_init(void)
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
- /* clear the default setting with node 0 */
- nodes_clear(node_states[N_NORMAL_MEMORY]);
+
+ /*
+ * clear the default setting with node 0
+ * note: don't use nodes_clear here, that is really clearing when
+ * numa support is not compiled in, and later node_set_state
+ * will not set it back.
+ */
+ node_clear_state(0, N_NORMAL_MEMORY);
+
free_area_init_nodes(max_zone_pfns);
}
Reviewed-by: Christoph Lameter <[email protected]>