The setup_node_data() function allocates a pg_data_t object, inserts it
into the node_data[] array and initializes the following fields:
node_id, node_start_pfn and node_spanned_pages.
However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
different values. This means that the initialization done by
setup_node_data() is not used.
This causes a small glitch when running Linux as a hyperv numa guest:
[ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
[ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
[ 0.000000] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] NODE_DATA [mem 0x7ffec000-0x7ffeffff]
[ 0.000000] Initmem setup node 1 [mem 0x80800000-0x1081fffff]
[ 0.000000] NODE_DATA [mem 0x1081fa000-0x1081fdfff]
[ 0.000000] crashkernel: memory value expected
[ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ 0.000000] [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
[ 0.000000] Normal [mem 0x100000000-0x1081fffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009efff]
[ 0.000000] node 0: [mem 0x00100000-0x7ffeffff]
[ 0.000000] node 1: [mem 0x80200000-0xf7ffffff]
[ 0.000000] node 1: [mem 0x100000000-0x1081fffff]
[ 0.000000] On node 0 totalpages: 524174
[ 0.000000] DMA zone: 64 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3998 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 8128 pages used for memmap
[ 0.000000] DMA32 zone: 520176 pages, LIFO batch:31
[ 0.000000] On node 1 totalpages: 524288
[ 0.000000] DMA32 zone: 7672 pages used for memmap
[ 0.000000] DMA32 zone: 491008 pages, LIFO batch:31
[ 0.000000] Normal zone: 520 pages used for memmap
[ 0.000000] Normal zone: 33280 pages, LIFO batch:7
In this dmesg, the SRAT table reports that the memory range for node 1
starts at 0x80200000. However, the line starting with "Initmem" reports
that node 1 memory range starts at 0x80800000. The "Initmem" line is
reported by setup_node_data() and is wrong, because the kernel ends up
using the range as reported in the SRAT table.
This commit drops all that dead code from setup_node_data(), renames it
to alloc_node_data() and adds a printk() to free_area_init_node() so
that we report a node's memory range accurately.
Here's the same dmesg section with this patch applied:
[ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
[ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
[ 0.000000] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
[ 0.000000] NODE_DATA(0) allocated [mem 0x7ffec000-0x7ffeffff]
[ 0.000000] NODE_DATA(1) allocated [mem 0x1081fa000-0x1081fdfff]
[ 0.000000] crashkernel: memory value expected
[ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ 0.000000] [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
[ 0.000000] Normal [mem 0x100000000-0x1081fffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009efff]
[ 0.000000] node 0: [mem 0x00100000-0x7ffeffff]
[ 0.000000] node 1: [mem 0x80200000-0xf7ffffff]
[ 0.000000] node 1: [mem 0x100000000-0x1081fffff]
[ 0.000000] Node 0 memory range 0x00001000-0x7ffeffff
[ 0.000000] On node 0 totalpages: 524174
[ 0.000000] DMA zone: 64 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3998 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 8128 pages used for memmap
[ 0.000000] DMA32 zone: 520176 pages, LIFO batch:31
[ 0.000000] Node 1 memory range 0x80200000-0x1081fffff
[ 0.000000] On node 1 totalpages: 524288
[ 0.000000] DMA32 zone: 7672 pages used for memmap
[ 0.000000] DMA32 zone: 491008 pages, LIFO batch:31
[ 0.000000] Normal zone: 520 pages used for memmap
[ 0.000000] Normal zone: 33280 pages, LIFO batch:7
This commit was tested on a two node bare-metal NUMA machine and Linux
as a numa guest on hyperv and qemu/kvm.
PS: The wrong memory range reported by setup_node_data() seems to be
harmless in the current kernel because it's just not used. However,
that bad range is used in kernel 2.6.32 to initialize the old boot
memory allocator, which causes a crash during boot.
Signed-off-by: Luiz Capitulino <[email protected]>
---
arch/x86/include/asm/numa.h | 1 -
arch/x86/mm/numa.c | 34 ++++++++++++++--------------------
mm/page_alloc.c | 2 ++
3 files changed, 16 insertions(+), 21 deletions(-)
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 4064aca..01b493e 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -9,7 +9,6 @@
#ifdef CONFIG_NUMA
#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
-#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
/*
* Too small node sizes may confuse the VM badly. Usually they
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a32b706..d221374 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -185,8 +185,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
return numa_add_memblk_to(nid, start, end, &numa_meminfo);
}
-/* Initialize NODE_DATA for a node on the local memory */
-static void __init setup_node_data(int nid, u64 start, u64 end)
+/* Allocate NODE_DATA for a node on the local memory */
+static void __init alloc_node_data(int nid)
{
const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
u64 nd_pa;
@@ -194,18 +194,6 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
int tnid;
/*
- * Don't confuse VM with a node that doesn't have the
- * minimum amount of memory:
- */
- if (end && (end - start) < NODE_MIN_SIZE)
- return;
-
- start = roundup(start, ZONE_ALIGN);
-
- printk(KERN_INFO "Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
- nid, start, end - 1);
-
- /*
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/
@@ -222,7 +210,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
nd = __va(nd_pa);
/* report and initialize */
- printk(KERN_INFO " NODE_DATA [mem %#010Lx-%#010Lx]\n",
+ printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid,
nd_pa, nd_pa + nd_size - 1);
tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
if (tnid != nid)
@@ -230,9 +218,6 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
node_data[nid] = nd;
memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
- NODE_DATA(nid)->node_id = nid;
- NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
- NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
node_set_online(nid);
}
@@ -523,8 +508,17 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
end = max(mi->blk[i].end, end);
}
- if (start < end)
- setup_node_data(nid, start, end);
+ if (start >= end)
+ continue;
+
+ /*
+ * Don't confuse VM with a node that doesn't have the
+ * minimum amount of memory:
+ */
+ if (end && (end - start) < NODE_MIN_SIZE)
+ continue;
+
+ alloc_node_data(nid);
}
/* Dump memblock with node info and return. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..e57b7d3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4956,6 +4956,8 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
pgdat->node_start_pfn = node_start_pfn;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+ printk(KERN_INFO "Node %d memory range %#010Lx-%#010Lx\n", nid,
+ (u64) start_pfn << PAGE_SHIFT, (u64) (end_pfn << PAGE_SHIFT) - 1);
#endif
calculate_node_totalpages(pgdat, start_pfn, end_pfn,
zones_size, zholes_size);
--
1.9.3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06/19/2014 10:20 PM, Luiz Capitulino wrote:
> @@ -523,8 +508,17 @@ static int __init numa_register_memblks(struct
> numa_meminfo *mi) end = max(mi->blk[i].end, end); }
>
> - if (start < end) - setup_node_data(nid, start, end); + if
> (start >= end) + continue; + + /* + * Don't confuse VM with a
> node that doesn't have the + * minimum amount of memory: + */ +
> if (end && (end - start) < NODE_MIN_SIZE) + continue; + +
> alloc_node_data(nid); }
Minor nit. If we skip a too-small node, should we remember that we
did so, and add its memory to another node, assuming it is physically
contiguous memory?
Other than that...
Acked-by: Rik van Riel <[email protected]>
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJTrDNfAAoJEM553pKExN6DrNgH/j160OIey5moCEFMH51a1e3+
D6iOIXxsVii5/wqabYuA1DCQ8Asgd/UK2BWdxxRZVZuTHXXn97iifq1IkIPEQxXc
pjz25/ZFSpa3fgZk8iyUzOQjLukFfkiaO1mSopO7IWwUZoEa9fJ7bOBvwcnFU4oQ
uZAV375RpxiPEXNh2qQZXX0kNrycZd8S81jUSuQv3OLPRI1EQo+txOg/u7ir0pOJ
z1fkBK0hiSHziAzB/nyjR/RgSb23vpMlUlPoGMhwCMp08aJkL147bHZvsCtlg/w4
kBqq/zy9te4ecSicUsX/l16o0SJ9a1JtvFAlqz0iqlGcKQGCEw2P+y0ZyrhfvaE=
=NOgK
-----END PGP SIGNATURE-----
On Thu, 26 Jun 2014 10:51:11 -0400
Rik van Riel <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 06/19/2014 10:20 PM, Luiz Capitulino wrote:
>
> > @@ -523,8 +508,17 @@ static int __init numa_register_memblks(struct
> > numa_meminfo *mi) end = max(mi->blk[i].end, end); }
> >
> > - if (start < end) - setup_node_data(nid, start, end); + if
> > (start >= end) + continue; + + /* + * Don't confuse VM with a
> > node that doesn't have the + * minimum amount of memory: + */ +
> > if (end && (end - start) < NODE_MIN_SIZE) + continue; + +
> > alloc_node_data(nid); }
>
> Minor nit. If we skip a too-small node, should we remember that we
> did so, and add its memory to another node, assuming it is physically
> contiguous memory?
Interesting point. Honest question, please disregard if this doesn't
make sense: but won't this affect automatic numa performance? Because
the kernel won't know that that extra memory actually pertains to another
node and hence that extra memory will have a difference distance of the
node that's making use it of it.
If my thinking is wrong or if even then you believe this is a good feature,
I can work on it on a different patch, as this check is not being introduced
by this patch. Although I also wonder how many numa machines have such small
nodes...
> Other than that...
>
> Acked-by: Rik van Riel <[email protected]>
Thanks!
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJTrDNfAAoJEM553pKExN6DrNgH/j160OIey5moCEFMH51a1e3+
> D6iOIXxsVii5/wqabYuA1DCQ8Asgd/UK2BWdxxRZVZuTHXXn97iifq1IkIPEQxXc
> pjz25/ZFSpa3fgZk8iyUzOQjLukFfkiaO1mSopO7IWwUZoEa9fJ7bOBvwcnFU4oQ
> uZAV375RpxiPEXNh2qQZXX0kNrycZd8S81jUSuQv3OLPRI1EQo+txOg/u7ir0pOJ
> z1fkBK0hiSHziAzB/nyjR/RgSb23vpMlUlPoGMhwCMp08aJkL147bHZvsCtlg/w4
> kBqq/zy9te4ecSicUsX/l16o0SJ9a1JtvFAlqz0iqlGcKQGCEw2P+y0ZyrhfvaE=
> =NOgK
> -----END PGP SIGNATURE-----
>
On 06/26/2014 11:05 AM, Luiz Capitulino wrote:
> On Thu, 26 Jun 2014 10:51:11 -0400
> Rik van Riel <[email protected]> wrote:
>
> On 06/19/2014 10:20 PM, Luiz Capitulino wrote:
>
>>>> @@ -523,8 +508,17 @@ static int __init numa_register_memblks(struct
>>>> numa_meminfo *mi) end = max(mi->blk[i].end, end); }
>>>>
>>>> - if (start < end) - setup_node_data(nid, start, end); + if
>>>> (start >= end) + continue; + + /* + * Don't confuse VM with a
>>>> node that doesn't have the + * minimum amount of memory: + */ +
>>>> if (end && (end - start) < NODE_MIN_SIZE) + continue; + +
>>>> alloc_node_data(nid); }
>
> Minor nit. If we skip a too-small node, should we remember that we
> did so, and add its memory to another node, assuming it is physically
> contiguous memory?
>
>> Interesting point. Honest question, please disregard if this doesn't
>> make sense: but won't this affect automatic numa performance? Because
>> the kernel won't know that that extra memory actually pertains to another
>> node and hence that extra memory will have a difference distance of the
>> node that's making use it of it.
If there is so little memory the kernel is unwilling to turn
it into its own zone or node, it should not be enough to
affect placement policy at all.
Whether or not we use that last little bit of memory is probably
not very important, either :)
On Thu, 19 Jun 2014, Luiz Capitulino wrote:
> The setup_node_data() function allocates a pg_data_t object, inserts it
> into the node_data[] array and initializes the following fields:
> node_id, node_start_pfn and node_spanned_pages.
>
> However, a few function calls later during the kernel boot,
> free_area_init_node() re-initializes those fields, possibly with
> different values. This means that the initialization done by
> setup_node_data() is not used.
>
> This causes a small glitch when running Linux as a hyperv numa guest:
>
> [ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> [ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> [ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> [ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
> [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
> [ 0.000000] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
> [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
> [ 0.000000] NODE_DATA [mem 0x7ffec000-0x7ffeffff]
> [ 0.000000] Initmem setup node 1 [mem 0x80800000-0x1081fffff]
> [ 0.000000] NODE_DATA [mem 0x1081fa000-0x1081fdfff]
> [ 0.000000] crashkernel: memory value expected
> [ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
> [ 0.000000] [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x00001000-0x00ffffff]
> [ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
> [ 0.000000] Normal [mem 0x100000000-0x1081fffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x00001000-0x0009efff]
> [ 0.000000] node 0: [mem 0x00100000-0x7ffeffff]
> [ 0.000000] node 1: [mem 0x80200000-0xf7ffffff]
> [ 0.000000] node 1: [mem 0x100000000-0x1081fffff]
> [ 0.000000] On node 0 totalpages: 524174
> [ 0.000000] DMA zone: 64 pages used for memmap
> [ 0.000000] DMA zone: 21 pages reserved
> [ 0.000000] DMA zone: 3998 pages, LIFO batch:0
> [ 0.000000] DMA32 zone: 8128 pages used for memmap
> [ 0.000000] DMA32 zone: 520176 pages, LIFO batch:31
> [ 0.000000] On node 1 totalpages: 524288
> [ 0.000000] DMA32 zone: 7672 pages used for memmap
> [ 0.000000] DMA32 zone: 491008 pages, LIFO batch:31
> [ 0.000000] Normal zone: 520 pages used for memmap
> [ 0.000000] Normal zone: 33280 pages, LIFO batch:7
>
> In this dmesg, the SRAT table reports that the memory range for node 1
> starts at 0x80200000. However, the line starting with "Initmem" reports
> that node 1 memory range starts at 0x80800000. The "Initmem" line is
> reported by setup_node_data() and is wrong, because the kernel ends up
> using the range as reported in the SRAT table.
>
> This commit drops all that dead code from setup_node_data(), renames it
> to alloc_node_data() and adds a printk() to free_area_init_node() so
> that we report a node's memory range accurately.
>
> Here's the same dmesg section with this patch applied:
>
> [ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> [ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> [ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> [ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
> [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
> [ 0.000000] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
> [ 0.000000] NODE_DATA(0) allocated [mem 0x7ffec000-0x7ffeffff]
> [ 0.000000] NODE_DATA(1) allocated [mem 0x1081fa000-0x1081fdfff]
> [ 0.000000] crashkernel: memory value expected
> [ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
> [ 0.000000] [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x00001000-0x00ffffff]
> [ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
> [ 0.000000] Normal [mem 0x100000000-0x1081fffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x00001000-0x0009efff]
> [ 0.000000] node 0: [mem 0x00100000-0x7ffeffff]
> [ 0.000000] node 1: [mem 0x80200000-0xf7ffffff]
> [ 0.000000] node 1: [mem 0x100000000-0x1081fffff]
> [ 0.000000] Node 0 memory range 0x00001000-0x7ffeffff
> [ 0.000000] On node 0 totalpages: 524174
> [ 0.000000] DMA zone: 64 pages used for memmap
> [ 0.000000] DMA zone: 21 pages reserved
> [ 0.000000] DMA zone: 3998 pages, LIFO batch:0
> [ 0.000000] DMA32 zone: 8128 pages used for memmap
> [ 0.000000] DMA32 zone: 520176 pages, LIFO batch:31
> [ 0.000000] Node 1 memory range 0x80200000-0x1081fffff
> [ 0.000000] On node 1 totalpages: 524288
> [ 0.000000] DMA32 zone: 7672 pages used for memmap
> [ 0.000000] DMA32 zone: 491008 pages, LIFO batch:31
> [ 0.000000] Normal zone: 520 pages used for memmap
> [ 0.000000] Normal zone: 33280 pages, LIFO batch:7
>
> This commit was tested on a two node bare-metal NUMA machine and Linux
> as a numa guest on hyperv and qemu/kvm.
>
> PS: The wrong memory range reported by setup_node_data() seems to be
> harmless in the current kernel because it's just not used. However,
> that bad range is used in kernel 2.6.32 to initialize the old boot
> memory allocator, which causes a crash during boot.
>
> Signed-off-by: Luiz Capitulino <[email protected]>
With this patch, the dmesg changes break one of my scripts that we use to
determine the start and end address of a node (doubly bad because there's
no sysfs interface to determine this otherwise and we have to do this at
boot to acquire the system topology).
Specifically, the removal of the
"Initmem setup node X [mem 0xstart-0xend]"
lines that are replaced when each node is onlined to
"Node 0 memory range 0xstart-0xend"
And if I just noticed this breakage when booting the latest -mm kernel,
I'm assuming I'm not the only person who is going to run into it. Is it
possible to not change the dmesg output?
On Mon, 30 Jun 2014 16:42:39 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> On Thu, 19 Jun 2014, Luiz Capitulino wrote:
>
> > The setup_node_data() function allocates a pg_data_t object, inserts it
> > into the node_data[] array and initializes the following fields:
> > node_id, node_start_pfn and node_spanned_pages.
> >
> > However, a few function calls later during the kernel boot,
> > free_area_init_node() re-initializes those fields, possibly with
> > different values. This means that the initialization done by
> > setup_node_data() is not used.
> >
> > This causes a small glitch when running Linux as a hyperv numa guest:
> >
> > [ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> > [ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> > [ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> > [ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
> > [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> > [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
> > [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
> > [ 0.000000] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
> > [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
> > [ 0.000000] NODE_DATA [mem 0x7ffec000-0x7ffeffff]
> > [ 0.000000] Initmem setup node 1 [mem 0x80800000-0x1081fffff]
> > [ 0.000000] NODE_DATA [mem 0x1081fa000-0x1081fdfff]
> > [ 0.000000] crashkernel: memory value expected
> > [ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
> > [ 0.000000] [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x00001000-0x00ffffff]
> > [ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
> > [ 0.000000] Normal [mem 0x100000000-0x1081fffff]
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x00001000-0x0009efff]
> > [ 0.000000] node 0: [mem 0x00100000-0x7ffeffff]
> > [ 0.000000] node 1: [mem 0x80200000-0xf7ffffff]
> > [ 0.000000] node 1: [mem 0x100000000-0x1081fffff]
> > [ 0.000000] On node 0 totalpages: 524174
> > [ 0.000000] DMA zone: 64 pages used for memmap
> > [ 0.000000] DMA zone: 21 pages reserved
> > [ 0.000000] DMA zone: 3998 pages, LIFO batch:0
> > [ 0.000000] DMA32 zone: 8128 pages used for memmap
> > [ 0.000000] DMA32 zone: 520176 pages, LIFO batch:31
> > [ 0.000000] On node 1 totalpages: 524288
> > [ 0.000000] DMA32 zone: 7672 pages used for memmap
> > [ 0.000000] DMA32 zone: 491008 pages, LIFO batch:31
> > [ 0.000000] Normal zone: 520 pages used for memmap
> > [ 0.000000] Normal zone: 33280 pages, LIFO batch:7
> >
> > In this dmesg, the SRAT table reports that the memory range for node 1
> > starts at 0x80200000. However, the line starting with "Initmem" reports
> > that node 1 memory range starts at 0x80800000. The "Initmem" line is
> > reported by setup_node_data() and is wrong, because the kernel ends up
> > using the range as reported in the SRAT table.
> >
> > This commit drops all that dead code from setup_node_data(), renames it
> > to alloc_node_data() and adds a printk() to free_area_init_node() so
> > that we report a node's memory range accurately.
> >
> > Here's the same dmesg section with this patch applied:
> >
> > [ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> > [ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> > [ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> > [ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
> > [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> > [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
> > [ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
> > [ 0.000000] NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
> > [ 0.000000] NODE_DATA(0) allocated [mem 0x7ffec000-0x7ffeffff]
> > [ 0.000000] NODE_DATA(1) allocated [mem 0x1081fa000-0x1081fdfff]
> > [ 0.000000] crashkernel: memory value expected
> > [ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
> > [ 0.000000] [ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x00001000-0x00ffffff]
> > [ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
> > [ 0.000000] Normal [mem 0x100000000-0x1081fffff]
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x00001000-0x0009efff]
> > [ 0.000000] node 0: [mem 0x00100000-0x7ffeffff]
> > [ 0.000000] node 1: [mem 0x80200000-0xf7ffffff]
> > [ 0.000000] node 1: [mem 0x100000000-0x1081fffff]
> > [ 0.000000] Node 0 memory range 0x00001000-0x7ffeffff
> > [ 0.000000] On node 0 totalpages: 524174
> > [ 0.000000] DMA zone: 64 pages used for memmap
> > [ 0.000000] DMA zone: 21 pages reserved
> > [ 0.000000] DMA zone: 3998 pages, LIFO batch:0
> > [ 0.000000] DMA32 zone: 8128 pages used for memmap
> > [ 0.000000] DMA32 zone: 520176 pages, LIFO batch:31
> > [ 0.000000] Node 1 memory range 0x80200000-0x1081fffff
> > [ 0.000000] On node 1 totalpages: 524288
> > [ 0.000000] DMA32 zone: 7672 pages used for memmap
> > [ 0.000000] DMA32 zone: 491008 pages, LIFO batch:31
> > [ 0.000000] Normal zone: 520 pages used for memmap
> > [ 0.000000] Normal zone: 33280 pages, LIFO batch:7
> >
> > This commit was tested on a two node bare-metal NUMA machine and Linux
> > as a numa guest on hyperv and qemu/kvm.
> >
> > PS: The wrong memory range reported by setup_node_data() seems to be
> > harmless in the current kernel because it's just not used. However,
> > that bad range is used in kernel 2.6.32 to initialize the old boot
> > memory allocator, which causes a crash during boot.
> >
> > Signed-off-by: Luiz Capitulino <[email protected]>
>
> With this patch, the dmesg changes break one of my scripts that we use to
> determine the start and end address of a node (doubly bad because there's
> no sysfs interface to determine this otherwise and we have to do this at
> boot to acquire the system topology).
>
> Specifically, the removal of the
>
> "Initmem setup node X [mem 0xstart-0xend]"
>
> lines that are replaced when each node is onlined to
>
> "Node 0 memory range 0xstart-0xend"
>
> And if I just noticed this breakage when booting the latest -mm kernel,
> I'm assuming I'm not the only person who is going to run into it. Is it
> possible to not change the dmesg output?
Sure. I can add back the original text. The only detail is that with this
patch that line is now printed a little bit later during boot and the
NODA_DATA lines also changed. Are you OK with that?
What's the guidelines on changing what's printed in dmesg?
On Wed, 2 Jul 2014, Luiz Capitulino wrote:
> > With this patch, the dmesg changes break one of my scripts that we use to
> > determine the start and end address of a node (doubly bad because there's
> > no sysfs interface to determine this otherwise and we have to do this at
> > boot to acquire the system topology).
> >
> > Specifically, the removal of the
> >
> > "Initmem setup node X [mem 0xstart-0xend]"
> >
> > lines that are replaced when each node is onlined to
> >
> > "Node 0 memory range 0xstart-0xend"
> >
> > And if I just noticed this breakage when booting the latest -mm kernel,
> > I'm assuming I'm not the only person who is going to run into it. Is it
> > possible to not change the dmesg output?
>
> Sure. I can add back the original text. The only detail is that with this
> patch that line is now printed a little bit later during boot and the
> NODA_DATA lines also changed. Are you OK with that?
>
Yes, please. I think it should be incremental on your patch since it's
already in -mm with " fix" appended so the title of the patch would be
"x86: numa: setup_node_data(): drop dead code and rename function fix" and
then Andrew can fold it into the original when sending it to the x86
maintainers.
> What's the guidelines on changing what's printed in dmesg?
>
That's the scary part, there doesn't seem to be any. It's especially
crucial for things that only get printed once and aren't available
anywhere else at runtime; there was talk of adding a sysfs interface that
defines the start and end addresses of nodes but it's complicated because
nodes can overlap each other. If that had been available years ago then I
don't think anybody would raise their hand about this issue.
These lines went under a smaller change a few years ago for
s/Bootmem/Initmem/. I don't even have to look at the git history to know
that because it broke our scripts back then as well. You just happened to
touch lines that I really care about and breaks my topology information :)
I wouldn't complain if it was just my userspace, but I have no doubt
others have parsed their dmesg in a similar way because people have
provided me with data that they retrieved by scraping the kernel log.
On Wed, 2 Jul 2014 16:20:47 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> On Wed, 2 Jul 2014, Luiz Capitulino wrote:
>
> > > With this patch, the dmesg changes break one of my scripts that we use to
> > > determine the start and end address of a node (doubly bad because there's
> > > no sysfs interface to determine this otherwise and we have to do this at
> > > boot to acquire the system topology).
> > >
> > > Specifically, the removal of the
> > >
> > > "Initmem setup node X [mem 0xstart-0xend]"
> > >
> > > lines that are replaced when each node is onlined to
> > >
> > > "Node 0 memory range 0xstart-0xend"
> > >
> > > And if I just noticed this breakage when booting the latest -mm kernel,
> > > I'm assuming I'm not the only person who is going to run into it. Is it
> > > possible to not change the dmesg output?
> >
> > Sure. I can add back the original text. The only detail is that with this
> > patch that line is now printed a little bit later during boot and the
> > NODA_DATA lines also changed. Are you OK with that?
> >
>
> Yes, please. I think it should be incremental on your patch since it's
> already in -mm with " fix" appended so the title of the patch would be
> "x86: numa: setup_node_data(): drop dead code and rename function fix" and
> then Andrew can fold it into the original when sending it to the x86
> maintainers.
>
> > What's the guidelines on changing what's printed in dmesg?
> >
>
> That's the scary part, there doesn't seem to be any. It's especially
> crucial for things that only get printed once and aren't available
> anywhere else at runtime; there was talk of adding a sysfs interface that
> defines the start and end addresses of nodes but it's complicated because
> nodes can overlap each other. If that had been available years ago then I
> don't think anybody would raise their hand about this issue.
>
> These lines went under a smaller change a few years ago for
> s/Bootmem/Initmem/. I don't even have to look at the git history to know
> that because it broke our scripts back then as well. You just happened to
> touch lines that I really care about and breaks my topology information :)
> I wouldn't complain if it was just my userspace, but I have no doubt
> others have parsed their dmesg in a similar way because people have
> provided me with data that they retrieved by scraping the kernel log.
No problem. I'll send a patch shortly as you suggested.