2011-06-21 15:42:11

by Conny Seidel

[permalink] [raw]
Subject: 32bit NUMA and fakeNUMA broken for AMD CPUs

Hi,

the commit 797390d8554b1e07aabea37d0140933b0412dba0 breaks 32bit on AMD
with native NUMA and fakeNUMA.

Native NUMA still boots, when the kernel parameter numa=off is added to
the cmdline.

[ 0.000000] BUG: unable to handle kernel paging request at 000012b0
[ 0.000000] IP: [<c1aa13ce>] memmap_init_zone+0x6c/0xf2
[ 0.000000] *pdpt = 0000000000000000 *pde = f000eef3f000ee00
[ 0.000000] Oops: 0000 [#1] SMP
[ 0.000000] last sysfs file:
[ 0.000000] Modules linked in:
[ 0.000000]
[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00164-g797390d #1 To Be Filled By O.E.M. To Be Filled By O.E.M./E350M1
[ 0.000000] EIP: 0060:[<c1aa13ce>] EFLAGS: 00010012 CPU: 0
[ 0.000000] EIP is at memmap_init_zone+0x6c/0xf2
[ 0.000000] EAX: 00000000 EBX: 000a8000 ECX: 000a7fff EDX: f2c00b80
[ 0.000000] ESI: 000a8000 EDI: f2c00800 EBP: c19ffe54 ESP: c19ffe34
[ 0.000000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 0.000000] Process swapper (pid: 0, ti=c19fe000 task=c1a07f60 task.ti=c19fe000)
[ 0.000000] Stack:
[ 0.000000] 00000002 00000000 0023f000 00000000 10000000 00000a00 f2c00000 f2c00b58
[ 0.000000] c19ffeb0 c1a80f24 000375fe 00000000 f2c00800 00000800 00000100 00000030
[ 0.000000] c1abb768 0000003c 00000000 00000000 00000004 00207a02 f2c00800 000375fe
[ 0.000000] Call Trace:
[ 0.000000] [<c1a80f24>] free_area_init_node+0x358/0x385
[ 0.000000] [<c1a81384>] free_area_init_nodes+0x420/0x487
[ 0.000000] [<c1637323>] ? printk+0x14/0x16
[ 0.000000] [<c102489e>] ? memory_present+0x66/0x6f
[ 0.000000] [<c1a79326>] paging_init+0x114/0x11b
[ 0.000000] [<c101742f>] ? native_apic_mem_read+0x8/0x19
[ 0.000000] [<c1a6cb13>] setup_arch+0xb37/0xc0a
[ 0.000000] [<c1638f6d>] ? _raw_spin_unlock_irqrestore+0x19/0x25
[ 0.000000] [<c1638f6d>] ? _raw_spin_unlock_irqrestore+0x19/0x25
[ 0.000000] [<c1637323>] ? printk+0x14/0x16
[ 0.000000] [<c1a69554>] start_kernel+0x76/0x316
[ 0.000000] [<c1a690a8>] i386_start_kernel+0xa8/0xb0
[ 0.000000] Code: 0a c1 e0 1d 89 45 ec 8b 45 e4 03 3c 85 e8 5b a6 c1 e9 8a 00 00 00 89 f0 89 f3 c1 e8 0e 0f be 80 a8 57 a6 c1 8b 04 85 e8 5b a6 c1 <2b> 98 b0 12 00 00 c1 e3 05 03 98 ac 12 00 00 8b 03 25 ff ff ff
[ 0.000000] EIP: [<c1aa13ce>] memmap_init_zone+0x6c/0xf2 SS:ESP 0068:c19ffe34
[ 0.000000] CR2: 00000000000012b0
[ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[ 0.000000] Pid: 0, comm: swapper Tainted: G D 2.6.39-rc5-00164-g797390d #1
[ 0.000000] Call Trace:
[ 0.000000] [<c1637213>] panic+0x55/0x151
[ 0.000000] [<c10507c9>] ? blocking_notifier_call_chain+0x11/0x13
[ 0.000000] [<c1038340>] do_exit+0x99/0x6fa
[ 0.000000] [<c1638f6d>] ? _raw_spin_unlock_irqrestore+0x19/0x25
[ 0.000000] [<c10356de>] ? kmsg_dump+0x3c/0xbe
[ 0.000000] [<c163a569>] oops_end+0x97/0x9f
[ 0.000000] [<c101e9a4>] no_context+0x144/0x14e
[ 0.000000] [<c101eada>] __bad_area_nosemaphore+0x12c/0x134
[ 0.000000] [<c1a83a75>] ? memblock_add_region+0xbf/0x4af
[ 0.000000] [<c101eaf4>] bad_area_nosemaphore+0x12/0x15
[ 0.000000] [<c163beb0>] do_page_fault+0x1e8/0x3c8
[ 0.000000] [<c1a82c5e>] ? __alloc_memory_core_early+0x86/0x94
[ 0.000000] [<c163bcc8>] ? spurious_fault+0xf2/0xf2
[ 0.000000] [<c1639c6b>] error_code+0x5f/0x64
[ 0.000000] [<c163bcc8>] ? spurious_fault+0xf2/0xf2
[ 0.000000] [<c1aa13ce>] ? memmap_init_zone+0x6c/0xf2
[ 0.000000] [<c1a80f24>] free_area_init_node+0x358/0x385
[ 0.000000] [<c1a81384>] free_area_init_nodes+0x420/0x487
[ 0.000000] [<c1637323>] ? printk+0x14/0x16
[ 0.000000] [<c102489e>] ? memory_present+0x66/0x6f
[ 0.000000] [<c1a79326>] paging_init+0x114/0x11b
[ 0.000000] [<c101742f>] ? native_apic_mem_read+0x8/0x19
[ 0.000000] [<c1a6cb13>] setup_arch+0xb37/0xc0a
[ 0.000000] [<c1638f6d>] ? _raw_spin_unlock_irqrestore+0x19/0x25
[ 0.000000] [<c1638f6d>] ? _raw_spin_unlock_irqrestore+0x19/0x25
[ 0.000000] [<c1637323>] ? printk+0x14/0x16
[ 0.000000] [<c1a69554>] start_kernel+0x76/0x316
[ 0.000000] [<c1a690a8>] i386_start_kernel+0xa8/0xb0



commit 797390d8554b1e07aabea37d0140933b0412dba0
Author: Tejun Heo <[email protected]>
Date: Mon May 2 14:18:52 2011 +0200

x86-32, NUMA: use sparse_memory_present_with_active_regions()

Instead of calling memory_present() for each region from NUMA init,
call sparse_memory_present_with_active_regions() from paging_init()
similarly to x86-64.

For flat and numaq, this results in exactly the same memory_present()
calls. For srat, if there are multiple memory chunks for a node,
after this change, memory_present() will be called separately for each
chunk instead of being called once to encompass the whole range, which
doesn't cause any harm and actually is the better behavior.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>


##
##################################################################
# Email : [email protected] GnuPG-Key : 0xA6AB055D #
# Fingerprint: 17C4 5DB2 7C4C C1C7 1452 8148 F139 7C09 A6AB 055D #
##################################################################
# Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach #
# General Managers: Alberto Bozzoi #
# Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen #
# HRB Nr. 43632 #
##################################################################


Attachments:
signature.asc (198.00 B)

2011-06-26 10:22:46

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

Hello,

On Tue, Jun 21, 2011 at 05:41:31PM +0200, Conny Seidel wrote:
> the commit 797390d8554b1e07aabea37d0140933b0412dba0 breaks 32bit on AMD
> with native NUMA and fakeNUMA.
>
> Native NUMA still boots, when the kernel parameter numa=off is added to
> the cmdline.

I've been looking at it without much success yet. Can you please
attach full kernel boot log and .config?

Thanks.

--
tejun

2011-06-29 09:44:58

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

(cc'ing x86 and lkml. Please keep them cc'd on x86 related issues).

Hello,

On Tue, Jun 28, 2011 at 07:46:14PM +0200, Hans Rosenfeld wrote:
> We found another related but different panic on a 4-socket 8-node system,
> caused by this commit:
>
> commit 2706a0bf7b02693ed88752df877f10c2206292ff
> Author: Tejun Heo <[email protected]>
> Date: Mon May 2 17:24:48 2011 +0200
>
> x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too
>
> Now that NUMA init path is unified, amdtopology can be enabled on
> 32bit. Make amdtopology.c safe on 32bit by explicitly using u64 and
> drop X86_64 dependency from Kconfig.
>
> Inclusion of bootmem.h is added for max_pfn declaration.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
>
>
> The fix for the other panic does not fix this one.
> Full bootlog and config are attached.

Hmmm, interesting.

> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: 0000000000000000 - 0000000000087800 (usable)
> [ 0.000000] BIOS-e820: 0000000000087800 - 00000000000a0000 (reserved)
> [ 0.000000] BIOS-e820: 00000000000cc000 - 0000000000100000 (reserved)
> [ 0.000000] BIOS-e820: 0000000000100000 - 00000000c7e70000 (usable)
> [ 0.000000] BIOS-e820: 00000000c7e70000 - 00000000c7e8c000 (ACPI data)
> [ 0.000000] BIOS-e820: 00000000c7e8c000 - 00000000c7e8e000 (ACPI NVS)
> [ 0.000000] BIOS-e820: 00000000c7e8e000 - 00000000c8000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
> [ 0.000000] BIOS-e820: 0000000100000000 - 0000001838000000 (usable)

Okay, a fairly large machine. Memory goes over PAE limit.

> [ 0.000000] Scanning NUMA topology in Northbridge 24
> [ 0.000000] Number of physical nodes 8
> [ 0.000000] Node 0 MemBase 0000000000000000 Limit 0000000238000000
> [ 0.000000] Node 1 MemBase 0000000238000000 Limit 0000000638000000
> [ 0.000000] Node 2 MemBase 0000000638000000 Limit 0000000838000000
> [ 0.000000] Node 3 MemBase 0000000838000000 Limit 0000000c38000000
> [ 0.000000] Node 4 MemBase 0000000c38000000 Limit 0000000e38000000
> [ 0.000000] Node 5 MemBase 0000000e38000000 Limit 0000001000000000
> [ 0.000000] Node 6 bogus settings 1238000000-1000000000.
> [ 0.000000] Node 7 bogus settings 1438000000-1000000000.

amdtopology code behaved correctly. It trimmed node 5 which spans
over the PAE limit and squashed nodes above that.

> [ 0.000000] BUG: Int 6: CR2 (null)
> [ 0.000000] EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
> [ 0.000000] EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
> [ 0.000000] err (null) EIP c16209aa CS 00000060 flg 00010002
> [ 0.000000] Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
> [ 0.000000] (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
> [ 0.000000] f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
> [ 0.000000] Call Trace:
> [ 0.000000] [<c136b1e5>] ? early_fault+0x2e/0x2e
> [ 0.000000] [<c16209aa>] ? mminit_verify_page_links+0x12/0x42
> [ 0.000000] [<c1620613>] ? memmap_init_zone+0xaf/0x10c
> [ 0.000000] [<c1620929>] ? free_area_init_node+0x2b9/0x2e3
> [ 0.000000] [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
> [ 0.000000] [<c1601d80>] ? paging_init+0x112/0x118
> [ 0.000000] [<c15f578d>] ? setup_arch+0x791/0x82f
> [ 0.000000] [<c15f43d9>] ? start_kernel+0x6a/0x257

But it later tripped in mminit_verify_page_links(). Maybe
page_to_nid() doesn't match?

Hmmm... I can't see how it would have worked before. amdtopology used
ulong for @end and would simply have been zero. Maybe NUMA config
failed and it booted as flatmem instead? Can you please post boot log
before the patch?

Also, can you please apply the following patch, reproduce the boot
failure and post the log? Thank you.


diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4e0e265..cb230bf 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -124,6 +124,12 @@ void __init mminit_verify_pageflags_layout(void)
void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
unsigned long nid, unsigned long pfn)
{
+ if (page_to_nid(page) != nid || page_zonenum(page) != zone ||
+ page_to_pfn(page) != pfn)
+ printk(KERN_CRIT "mminit_verify_page_links: nid=%lu/%lu zone=%d/%d pfn=0x%lx/0x%lx\n",
+ page_to_nid(page), nid, page_zonenum(page), zone,
+ page_to_pfn(page), pfn);
+
BUG_ON(page_to_nid(page) != nid);
BUG_ON(page_zonenum(page) != zone);
BUG_ON(page_to_pfn(page) != pfn);

2011-06-29 10:51:08

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

On Wed, Jun 29, 2011 at 11:44:51AM +0200, Tejun Heo wrote:
> Hmmm... I can't see how it would have worked before. amdtopology used
> ulong for @end and would simply have been zero. Maybe NUMA config
> failed and it booted as flatmem instead? Can you please post boot log
> before the patch?

Ooh, please forget about this one. I got confused and thought that
amdtopology was for 32bit only and then converted to apply to both 32
and 64. It was the other way around, so the machine didn't use to get
NUMA configuration at all.

--
tejun

2011-06-29 12:34:16

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

Hello, again.

I think I found what went wrong.

> > [ 0.000000] Node 0 MemBase 0000000000000000 Limit 0000000238000000
> > [ 0.000000] Node 1 MemBase 0000000238000000 Limit 0000000638000000
> > [ 0.000000] Node 2 MemBase 0000000638000000 Limit 0000000838000000
> > [ 0.000000] Node 3 MemBase 0000000838000000 Limit 0000000c38000000
> > [ 0.000000] Node 4 MemBase 0000000c38000000 Limit 0000000e38000000
> > [ 0.000000] Node 5 MemBase 0000000e38000000 Limit 0000001000000000
> > [ 0.000000] Node 6 bogus settings 1238000000-1000000000.
> > [ 0.000000] Node 7 bogus settings 1438000000-1000000000.

NUMA nodes are aligned to 27bit - 128MiB. SPARSEMEM is enabled but on
x86-32 w/ PAE SECTION_SIZE_BITS is 29 - 512MiB, which means that pages
living near the boundary will have wrong nid assigned to them.

> > [ 0.000000] BUG: Int 6: CR2 (null)
> > [ 0.000000] EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
> > [ 0.000000] EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
> > [ 0.000000] err (null) EIP c16209aa CS 00000060 flg 00010002
> > [ 0.000000] Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
> > [ 0.000000] (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
> > [ 0.000000] f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
> > [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
> > [ 0.000000] Call Trace:
> > [ 0.000000] [<c136b1e5>] ? early_fault+0x2e/0x2e
> > [ 0.000000] [<c16209aa>] ? mminit_verify_page_links+0x12/0x42

So, mminit_verify_page_links() detects it while the last 512MiB
highmem chunk of node 0 is being initialized and freaks out.

We definitely need a safe guard to check NUMA node alignment and
disable NUMA if it requires finer granuality than supported by the
memory model. If you use DISCONTIGMEM, which has 64MiB granuality,
instead, it works, right?

--
tejun

2011-06-29 12:55:24

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

On Wed, Jun 29, 2011 at 08:34:09AM -0400, Tejun Heo wrote:
> So, mminit_verify_page_links() detects it while the last 512MiB
> highmem chunk of node 0 is being initialized and freaks out.
>
> We definitely need a safe guard to check NUMA node alignment and
> disable NUMA if it requires finer granuality than supported by the
> memory model. If you use DISCONTIGMEM, which has 64MiB granuality,
> instead, it works, right?

I had DISCONTIGMEM enabled in the kernel config, it does not work.

Hans


--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

2011-06-29 13:03:56

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

On Wed, Jun 29, 2011 at 02:55:08PM +0200, Hans Rosenfeld wrote:
> On Wed, Jun 29, 2011 at 08:34:09AM -0400, Tejun Heo wrote:
> > So, mminit_verify_page_links() detects it while the last 512MiB
> > highmem chunk of node 0 is being initialized and freaks out.
> >
> > We definitely need a safe guard to check NUMA node alignment and
> > disable NUMA if it requires finer granuality than supported by the
> > memory model. If you use DISCONTIGMEM, which has 64MiB granuality,
> > instead, it works, right?
>
> I had DISCONTIGMEM enabled in the kernel config, it does not work.

Hmmm? The following is the relevant part from your .config.

CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y

And it selects SPARSEMEM via SPARSEMEM_MANUAL. You need to choose
DISCONTIGMEM_MANUAL in "Processor type and features" -> "Memory
Model".

Thanks.

--
tejun

2011-06-29 16:15:24

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

Hans, can you please apply the following patch and post the boot log
from both SPARSEMEM and DISCONTIGMEM kernels? On SPARSEMEM, it should
reject NUMA config and boot w/ flatmem.

Thanks.

diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
index 224e8c5..0b6c75b 100644
--- a/arch/x86/include/asm/mmzone_32.h
+++ b/arch/x86/include/asm/mmzone_32.h
@@ -34,15 +34,15 @@ static inline void resume_map_numa_kva(pgd_t *pgd) {}
* 64Gb / 4096bytes/page = 16777216 pages
*/
#define MAX_NR_PAGES 16777216
-#define MAX_ELEMENTS 1024
-#define PAGES_PER_ELEMENT (MAX_NR_PAGES/MAX_ELEMENTS)
+#define MAX_SECTIONS 1024
+#define PAGES_PER_SECTION (MAX_NR_PAGES/MAX_SECTIONS)

extern s8 physnode_map[];

static inline int pfn_to_nid(unsigned long pfn)
{
#ifdef CONFIG_NUMA
- return((int) physnode_map[(pfn) / PAGES_PER_ELEMENT]);
+ return((int) physnode_map[(pfn) / PAGES_PER_SECTIONS]);
#else
return 0;
#endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index f5510d8..9d643e2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -496,6 +496,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)

static int __init numa_register_memblks(struct numa_meminfo *mi)
{
+ unsigned long pfn_align;
int i, nid;

/* Account for nodes with cpus and no memory */
@@ -511,6 +512,15 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)

/* for out of order entries */
sort_node_map();
+
+ pfn_align = node_map_pfn_alignment();
+ if (pfn_align && pfn_align < PAGES_PER_SECTION) {
+ printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
+ (u64)pfn_align << PAGE_SHIFT >> 20,
+ (u64)PAGES_PER_SECTION << PAGE_SHIFT >> 20);
+ return -EINVAL;
+ }
+
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 849a975..3adebe7 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -41,7 +41,7 @@
* physnode_map[16-31] = 1;
* physnode_map[32- ] = -1;
*/
-s8 physnode_map[MAX_ELEMENTS] __read_mostly = { [0 ... (MAX_ELEMENTS - 1)] = -1};
+s8 physnode_map[MAX_SECTIONS] __read_mostly = { [0 ... (MAX_SECTIONS - 1)] = -1};
EXPORT_SYMBOL(physnode_map);

void memory_present(int nid, unsigned long start, unsigned long end)
@@ -52,8 +52,8 @@ void memory_present(int nid, unsigned long start, unsigned long end)
nid, start, end);
printk(KERN_DEBUG " Setting physnode_map array to node %d for pfns:\n", nid);
printk(KERN_DEBUG " ");
- for (pfn = start; pfn < end; pfn += PAGES_PER_ELEMENT) {
- physnode_map[pfn / PAGES_PER_ELEMENT] = nid;
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+ physnode_map[pfn / PAGES_PER_SECTION] = nid;
printk(KERN_CONT "%lx ", pfn);
}
printk(KERN_CONT "\n");
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..c70a326 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1313,6 +1313,7 @@ extern void remove_active_range(unsigned int nid, unsigned long start_pfn,
unsigned long end_pfn);
extern void remove_all_active_ranges(void);
void sort_node_map(void);
+unsigned long node_map_pfn_alignment(void);
unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
unsigned long end_pfn);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..2ae7dbc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4585,6 +4585,34 @@ void __init sort_node_map(void)
cmp_node_active_region, NULL);
}

+unsigned long __init node_map_pfn_alignment(void)
+{
+ unsigned long accl_mask = 0, last_end = 0;
+ int last_nid = -1;
+ int i;
+
+ for_each_active_range_index_in_nid(i, MAX_NUMNODES) {
+ int nid = early_node_map[i].nid;
+ unsigned long start = early_node_map[i].start_pfn;
+ unsigned long end = early_node_map[i].end_pfn;
+ unsigned long mask;
+
+ if (!start || last_nid < 0 || last_nid == nid) {
+ last_nid = nid;
+ last_end = end;
+ continue;
+ }
+
+ mask = ~((1 << __ffs(start)) - 1);
+ while (mask && last_end <= (start & (mask << 1)))
+ mask <<= 1;
+
+ accl_mask |= mask;
+ }
+
+ return ~accl_mask + 1;
+}
+
/* Find the lowest pfn for a node */
static unsigned long __init find_min_pfn_for_node(int nid)
{

2011-06-30 13:13:58

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

On Wed, Jun 29, 2011 at 12:15:17PM -0400, Tejun Heo wrote:
> Hans, can you please apply the following patch and post the boot log
> from both SPARSEMEM and DISCONTIGMEM kernels? On SPARSEMEM, it should
> reject NUMA config and boot w/ flatmem.

Bootlogs are attached. Now DISCONTIGMEM panics.

> diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
> index 224e8c5..0b6c75b 100644
> --- a/arch/x86/include/asm/mmzone_32.h
> +++ b/arch/x86/include/asm/mmzone_32.h
> @@ -34,15 +34,15 @@ static inline void resume_map_numa_kva(pgd_t *pgd) {}
> * 64Gb / 4096bytes/page = 16777216 pages
> */
> #define MAX_NR_PAGES 16777216
> -#define MAX_ELEMENTS 1024
> -#define PAGES_PER_ELEMENT (MAX_NR_PAGES/MAX_ELEMENTS)
> +#define MAX_SECTIONS 1024
> +#define PAGES_PER_SECTION (MAX_NR_PAGES/MAX_SECTIONS)
>
> extern s8 physnode_map[];
>
> static inline int pfn_to_nid(unsigned long pfn)
> {
> #ifdef CONFIG_NUMA
> - return((int) physnode_map[(pfn) / PAGES_PER_ELEMENT]);
> + return((int) physnode_map[(pfn) / PAGES_PER_SECTIONS]);

This probably should be PAGES_PER_SECTION.

> #else
> return 0;
> #endif

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


Attachments:
(No filename) (1.18 kB)
bootlog_discontigmem (19.00 kB)
bootlog_discontigmem
bootlog_sparsemem (6.60 kB)
bootlog_sparsemem
Download all attachments

2011-06-30 15:56:06

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

Hello,

On Thu, Jun 30, 2011 at 03:13:38PM +0200, Hans Rosenfeld wrote:
> On Wed, Jun 29, 2011 at 12:15:17PM -0400, Tejun Heo wrote:
> > Hans, can you please apply the following patch and post the boot log
> > from both SPARSEMEM and DISCONTIGMEM kernels? On SPARSEMEM, it should
> > reject NUMA config and boot w/ flatmem.
>
> Bootlogs are attached. Now DISCONTIGMEM panics.
...
> [ 0.000000] Linux version 2.6.39-rc5-00181-g2706a0b-dirty (root@worms) (gcc version 4.5.2 (Gentoo 4.5.2 p1.1, pie-0.4.5) ) #26 SMP Thu Jun 30 14:39:05 CEST 2011

Hmmm... it looks like the kernel is crashing from the other bug in
this thread. Can you please apply both patches on top of 3.0-rc5 and
re-test?

Thank you.

--
tejun

2011-06-30 16:32:51

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

On Thu, Jun 30, 2011 at 11:55:57AM -0400, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 30, 2011 at 03:13:38PM +0200, Hans Rosenfeld wrote:
> > On Wed, Jun 29, 2011 at 12:15:17PM -0400, Tejun Heo wrote:
> > > Hans, can you please apply the following patch and post the boot log
> > > from both SPARSEMEM and DISCONTIGMEM kernels? On SPARSEMEM, it should
> > > reject NUMA config and boot w/ flatmem.
> >
> > Bootlogs are attached. Now DISCONTIGMEM panics.
> ...
> > [ 0.000000] Linux version 2.6.39-rc5-00181-g2706a0b-dirty (root@worms) (gcc version 4.5.2 (Gentoo 4.5.2 p1.1, pie-0.4.5) ) #26 SMP Thu Jun 30 14:39:05 CEST 2011
>
> Hmmm... it looks like the kernel is crashing from the other bug in
> this thread. Can you please apply both patches on top of 3.0-rc5 and
> re-test?

Oh, thats why it looked so familiar :)

I wasn't able to reproduce this panic on this machine earlier without
DISCONTIGMEM.

It works now with both patches, bootlog is attached.


Hans


--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


Attachments:
(No filename) (1.01 kB)
bootlog_discontigmem (17.19 kB)
bootlog_discontigmem
Download all attachments

2011-06-30 16:42:21

by Tejun Heo

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

Hello,

On Thu, Jun 30, 2011 at 06:32:28PM +0200, Hans Rosenfeld wrote:
> On Thu, Jun 30, 2011 at 11:55:57AM -0400, Tejun Heo wrote:
> > Hmmm... it looks like the kernel is crashing from the other bug in
> > this thread. Can you please apply both patches on top of 3.0-rc5 and
> > re-test?
>
> Oh, thats why it looked so familiar :)
>
> I wasn't able to reproduce this panic on this machine earlier without
> DISCONTIGMEM.
>
> It works now with both patches, bootlog is attached.

Can you please attach boot log w/ SPARSEMEM? Let's see whether NUMA
config is being rejected correctly.

Thanks.

--
tejun

2011-06-30 17:05:10

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: 32bit NUMA and fakeNUMA broken for AMD CPUs

On Thu, Jun 30, 2011 at 12:42:16PM -0400, Tejun Heo wrote:
> Can you please attach boot log w/ SPARSEMEM? Let's see whether NUMA
> config is being rejected correctly.

I already sent it in the earlier mail, but here it is again. NUMA is
rejected.


Hans


--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


Attachments:
(No filename) (323.00 B)
bootlog_sparsemem (6.60 kB)
bootlog_sparsemem
Download all attachments