2007-08-24 16:28:25

by mel

[permalink] [raw]
Subject: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

Currently NUMA kernels generally do not boot on non-NUMA machines in some
situations. This patch addresses one such boot problem on x86 machines
running a NUMA kernel with the DISCONTIG memory model.

On 32-bit NUMA, the memmap representing struct pages on each node is allocated
from node-local memory. As only node-0 has memory from ZONE_NORMAL, the memmap
must be mapped into low memory. This is done by reserving space in the Kernel
Virtual Area (KVA) for the memmap belonging to other nodes by taking pages
from the end of ZONE_NORMAL and remapping the other nodes memmap into those
virtual addresses. The node boundaries are then adjusted so that the region
of pages is not used and it is marked as reserved in the bootmem allocator.

This reserved portion of the KVA must be PMD aligned. The problem is that
when aligned, there may be a portion of ZONE_NORMAL at the end that is not
used for memmap and does not have an initialised memmap nor is it marked
reserved in the bootmem allocator. Later in the boot process, these pages
are freed and a storm of Bad page state messages result.

This patch marks these pages reserved that are wasted due to alignment
in the bootmem allocator so they are not accidently freed. It is worth
noting that memory from node-0 is wasted where it could have been put into
ZONE_HIGHMEM on NUMA machines. Worse, the KVA is always reserved from the
location of real memory even when there is plenty of spare virtual address
space. It's likely not worth fixing this up as SPARSEMEM will hopefully
replace DISCONTIG some time in the future.

This patch also makes sure that reserve_bootmem() is not called with a
0-length size in numa_kva_reserve(). When this happens, it usually means
that a kernel built for Summit is being booted on a normal machine. The
resulting BUG_ON() is misleading so it's caught here.

This patch allows the following NUMA configuration to boot on normal hardware
and qemu;

Processor type: Generic architecture (Summit, bigsmp, ES7000, default)
Memory model: DISCONTIG
High memory: 64GB
NUMA support: on

SPARSEMEM memory model is already working.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andy Whitcroft <[email protected]>

---
discontig.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-mm1-clean/arch/i386/mm/discontig.c linux-2.6.23-rc3-mm1-numaboot/arch/i386/mm/discontig.c
--- linux-2.6.23-rc3-mm1-clean/arch/i386/mm/discontig.c 2007-08-22 11:32:10.000000000 +0100
+++ linux-2.6.23-rc3-mm1-numaboot/arch/i386/mm/discontig.c 2007-08-24 17:00:54.000000000 +0100
@@ -198,11 +198,12 @@ void __init remap_numa_kva(void)
}
}

-static unsigned long calculate_numa_remap_pages(void)
+static unsigned long calculate_numa_remap_pages(unsigned long *wasted_pages)
{
int nid;
unsigned long size, reserve_pages = 0;
unsigned long pfn;
+ *wasted_pages = 0;

for_each_online_node(nid) {
unsigned old_end_pfn = node_end_pfn[nid];
@@ -252,6 +253,15 @@ static unsigned long calculate_numa_rema
printk("Shrinking node %d further by %ld pages for proper alignment\n",
nid, node_end_pfn[nid] & (PTRS_PER_PTE-1));
size += node_end_pfn[nid] & (PTRS_PER_PTE-1);
+
+ /*
+ * We are going to end up wasting pages past
+ * the KVA for no good reason other than how
+ * the KVA is located. This is bad.
+ */
+ if (nid == 0)
+ *wasted_pages = node_end_pfn[nid] &
+ (PTRS_PER_PTE - 1);
}

node_end_pfn[nid] -= size;
@@ -268,6 +278,7 @@ unsigned long __init setup_memory(void)
{
int nid;
unsigned long system_start_pfn, system_max_low_pfn;
+ unsigned long wasted_pages;

/*
* When mapping a NUMA machine we allocate the node_mem_map arrays
@@ -279,7 +290,7 @@ unsigned long __init setup_memory(void)
find_max_pfn();
get_memcfg_numa();

- kva_pages = calculate_numa_remap_pages();
+ kva_pages = calculate_numa_remap_pages(&wasted_pages);

/* partially used pages are not usable - thus round upwards */
system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
@@ -340,12 +351,18 @@ unsigned long __init setup_memory(void)
memset(NODE_DATA(0), 0, sizeof(struct pglist_data));
NODE_DATA(0)->bdata = &node0_bdata;
setup_bootmem_allocator();
+
+ if (wasted_pages)
+ reserve_bootmem(
+ PFN_PHYS(node_remap_start_pfn[0] + node_remap_size[0]),
+ PFN_PHYS(wasted_pages));
return max_low_pfn;
}

void __init numa_kva_reserve(void)
{
- reserve_bootmem(PFN_PHYS(kva_start_pfn),PFN_PHYS(kva_pages));
+ if (kva_pages)
+ reserve_bootmem(PFN_PHYS(kva_start_pfn), PFN_PHYS(kva_pages));
}

void __init zone_sizes_init(void)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab


2007-08-24 16:35:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

> This reserved portion of the KVA must be PMD aligned.

Why do they need to be PMD aligned?

-Andi

2007-08-24 16:53:18

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

Andi Kleen wrote:
>> This reserved portion of the KVA must be PMD aligned.
>
> Why do they need to be PMD aligned?

That comes from the fact that the KVA in x86 has traditionally been
mapped with huge pages where at all possible, for performance reasons.
The purpose of the remap itself always has been performance based, we
are remapping node-local memory into KVA to hold the memmap in part to
exploit locality of process to its memory, and to in part to distribute
the load on the NUMA memory infrastructure by "striping" the storage.
As a result it makes sense to map these remapped areas with huge pages
also. As is evidenced by the fact this bug is only coming to light now,
it is somewhat rare for the end of a NODE to be miss-aligned below the
huge page level (2/4MB).

-apw

2007-08-24 17:07:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

On Fri, Aug 24, 2007 at 05:52:31PM +0100, Andy Whitcroft wrote:
> Andi Kleen wrote:
> >> This reserved portion of the KVA must be PMD aligned.
> >
> > Why do they need to be PMD aligned?
>
> That comes from the fact that the KVA in x86 has traditionally been

Where does this KVA acronym come from? In Linux this is traditionally
called direct or linear mapping. KVA sounds foreign.

> mapped with huge pages where at all possible, for performance reasons.

It was partly a rhetorical question.

My point is that we don't make any effort to PMD align end_pfn,
so there is also no reason to PMD align any of the other boundaries.

The only reason in theory is to avoid virtual aliases with
uncached areas, but there are no uncached areas in highmem
so this shouldn't be a concern.

There might be overlap into the PCI hole though which is uncached
and needs care rgarding virtual aliases, but that could be handled
by teaching change_page_attr() to handle the overlap too.

I think that would be a better fix -- do that and then
drop that PMD align requirement. Essentially you need a
end_pfn_map like x86_64 has and use that in change_page_attr().

-Andi

2007-08-24 17:26:35

by mel

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

On (24/08/07 19:07), Andi Kleen didst pronounce:
> On Fri, Aug 24, 2007 at 05:52:31PM +0100, Andy Whitcroft wrote:
> > Andi Kleen wrote:
> > >> This reserved portion of the KVA must be PMD aligned.
> > >
> > > Why do they need to be PMD aligned?
> >
> > That comes from the fact that the KVA in x86 has traditionally been
>
> Where does this KVA acronym come from? In Linux this is traditionally
> called direct or linear mapping. KVA sounds foreign.
>

The KVA acronym is being used in the x86 discontig code. The terms direct
or linear mappings are not perfectly accurate either because the direct
mappings are being altered in a way that pages that would normally be in
highmem are now directly mapped for the lifetime of the system.

> > mapped with huge pages where at all possible, for performance reasons.
>
> It was partly a rhetorical question.
>
> My point is that we don't make any effort to PMD align end_pfn,
> so there is also no reason to PMD align any of the other boundaries.
>

Other than the fact that the memmap must be PMD aligned to use hugepage
entries for the memmap. It could be mapped with small pages in corner cases
but the complexity worth it?

> The only reason in theory is to avoid virtual aliases with
> uncached areas, but there are no uncached areas in highmem
> so this shouldn't be a concern.
>
> There might be overlap into the PCI hole though which is uncached
> and needs care rgarding virtual aliases, but that could be handled
> by teaching change_page_attr() to handle the overlap too.
>
> I think that would be a better fix -- do that and then
> drop that PMD align requirement. Essentially you need a
> end_pfn_map like x86_64 has and use that in change_page_attr().
>

I can't see this type of lifting being done any time soon. As SPARSEMEM works
and there is hope with the vmemmap work that DISCONTIG will finally go away,
it may not be the best investment of time.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-24 17:38:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

> Other than the fact that the memmap must be PMD aligned to use hugepage
> entries for the memmap.

Why is that so? mem_map should be just part of lowmem anyways.

> It could be mapped with small pages in corner cases
> but the complexity worth it?

You don't need to map it with small pages in the normal case,
the only requirement is that c_p_a() is aware of it so it can
split it if needed.

> I can't see this type of lifting being done any time soon. As SPARSEMEM works
> and there is hope with the vmemmap work that DISCONTIG will finally go away,
> it may not be the best investment of time.

It's a trivial change, probably less code than your original patch.

-Andi

2007-08-24 17:44:49

by mel

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

On (24/08/07 19:38), Andi Kleen didst pronounce:
> > Other than the fact that the memmap must be PMD aligned to use hugepage
> > entries for the memmap.
>
> Why is that so? mem_map should be just part of lowmem anyways.
>

Not in this case. memmap is allocated node local and mapped in the virtual
memory area normally occupied by the end of low memory. The objective was
to have memmap for the struct pages node-local. Hence, portions of
memmap are really in highmem.

> > It could be mapped with small pages in corner cases
> > but the complexity worth it?
>
> You don't need to map it with small pages in the normal case,
> the only requirement is that c_p_a() is aware of it so it can
> split it if needed.
>
> > I can't see this type of lifting being done any time soon. As SPARSEMEM works
> > and there is hope with the vmemmap work that DISCONTIG will finally go away,
> > it may not be the best investment of time.
>
> It's a trivial change, probably less code than your original patch.
>

I'll have to take your word for it because I haven't looked closely
enough. I'll try and find time to look at it but the earliest I'll get around
to it is post kernel-summit. In the meantime, SPARSEMEM works.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-24 17:53:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

On Fri, Aug 24, 2007 at 06:44:38PM +0100, Mel Gorman wrote:
> On (24/08/07 19:38), Andi Kleen didst pronounce:
> > > Other than the fact that the memmap must be PMD aligned to use hugepage
> > > entries for the memmap.
> >
> > Why is that so? mem_map should be just part of lowmem anyways.
> >
>
> Not in this case. memmap is allocated node local and mapped in the virtual
> memory area normally occupied by the end of low memory. The objective was
> to have memmap for the struct pages node-local. Hence, portions of
> memmap are really in highmem.

Ok, but that still doesn't mean it has to be PMD aligned,
as long as illegal virtual aliases are prevent in the overlap
(which is not very hard)

> > > It could be mapped with small pages in corner cases
> > > but the complexity worth it?
> >
> > You don't need to map it with small pages in the normal case,
> > the only requirement is that c_p_a() is aware of it so it can
> > split it if needed.
> >
> > > I can't see this type of lifting being done any time soon. As SPARSEMEM works
> > > and there is hope with the vmemmap work that DISCONTIG will finally go away,
> > > it may not be the best investment of time.
> >
> > It's a trivial change, probably less code than your original patch.
> >
>
> I'll have to take your word for it because I haven't looked closely
> enough. I'll try and find time to look at it but the earliest I'll get around
> to it is post kernel-summit. In the meantime, SPARSEMEM works.

Ok, so we disable DISCONTIG i386 NUMA because there's nobody willing
to maintain it?

I'll take your word SPARSEMEM works, although I was told DISCONTIG NUMA
works too and then my testing told a quite different story.

-Andi

2007-08-24 18:02:37

by mel

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

On (24/08/07 19:53), Andi Kleen didst pronounce:
> On Fri, Aug 24, 2007 at 06:44:38PM +0100, Mel Gorman wrote:
> > On (24/08/07 19:38), Andi Kleen didst pronounce:
> > > > Other than the fact that the memmap must be PMD aligned to use hugepage
> > > > entries for the memmap.
> > >
> > > Why is that so? mem_map should be just part of lowmem anyways.
> > >
> >
> > Not in this case. memmap is allocated node local and mapped in the virtual
> > memory area normally occupied by the end of low memory. The objective was
> > to have memmap for the struct pages node-local. Hence, portions of
> > memmap are really in highmem.
>
> Ok, but that still doesn't mean it has to be PMD aligned,

Indeed, only the huge mappings require that.

> as long as illegal virtual aliases are prevent in the overlap
> (which is not very hard)
>
> > > > It could be mapped with small pages in corner cases
> > > > but the complexity worth it?
> > >
> > > You don't need to map it with small pages in the normal case,
> > > the only requirement is that c_p_a() is aware of it so it can
> > > split it if needed.
> > >
> > > > I can't see this type of lifting being done any time soon. As SPARSEMEM works
> > > > and there is hope with the vmemmap work that DISCONTIG will finally go away,
> > > > it may not be the best investment of time.
> > >
> > > It's a trivial change, probably less code than your original patch.
> > >
> >
> > I'll have to take your word for it because I haven't looked closely
> > enough. I'll try and find time to look at it but the earliest I'll get around
> > to it is post kernel-summit. In the meantime, SPARSEMEM works.
>
> Ok, so we disable DISCONTIG i386 NUMA because there's nobody willing
> to maintain it?
>

That is a bit of an over-reaction. A problem was reported, a fix was
suggested. I'm simply stating that it'll be post kernel-summit before I can
revisit this issue as there are more pressing bugs right now.

Disabling i386 DISCONTIG on NUMA is drastically overkill because it works on
NUMA machines where the node ends are PMD aligned or this would have shown
up on test.kernel.org a long time ago. Maybe it would fail on a real NUMA
machine with less than 1GB of RAM, I don't know but it's a possibility.

> I'll take your word SPARSEMEM works, although I was told DISCONTIG NUMA
> works too and then my testing told a quite different story.
>

SPARSEMEM booted on a plain old laptop and looked ok in qemu. I didn't
extensively test it, just plain boot.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-25 12:34:37

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] x86 Boot NUMA kernels on non-NUMA hardware with DISCONTIG memory model

Andi Kleen wrote:
> On Fri, Aug 24, 2007 at 06:44:38PM +0100, Mel Gorman wrote:
>> On (24/08/07 19:38), Andi Kleen didst pronounce:
>>>> Other than the fact that the memmap must be PMD aligned to use hugepage
>>>> entries for the memmap.
>>> Why is that so? mem_map should be just part of lowmem anyways.
>>>
>> Not in this case. memmap is allocated node local and mapped in the virtual
>> memory area normally occupied by the end of low memory. The objective was
>> to have memmap for the struct pages node-local. Hence, portions of
>> memmap are really in highmem.
>
> Ok, but that still doesn't mean it has to be PMD aligned,
> as long as illegal virtual aliases are prevent in the overlap
> (which is not very hard)
>
>>>> It could be mapped with small pages in corner cases
>>>> but the complexity worth it?
>>> You don't need to map it with small pages in the normal case,
>>> the only requirement is that c_p_a() is aware of it so it can
>>> split it if needed.
>>>
>>>> I can't see this type of lifting being done any time soon. As SPARSEMEM works
>>>> and there is hope with the vmemmap work that DISCONTIG will finally go away,
>>>> it may not be the best investment of time.
>>> It's a trivial change, probably less code than your original patch.
>>>
>> I'll have to take your word for it because I haven't looked closely
>> enough. I'll try and find time to look at it but the earliest I'll get around
>> to it is post kernel-summit. In the meantime, SPARSEMEM works.
>
> Ok, so we disable DISCONTIG i386 NUMA because there's nobody willing
> to maintain it?
>
> I'll take your word SPARSEMEM works, although I was told DISCONTIG NUMA
> works too and then my testing told a quite different story.

That sounds like over kill to me. The code unfixed works for all actual
NUMA systems I am aware of, else we would have had reports of this
problem before in the years that this code has been in the kernel. The
fix Mel sent up fixes the code so that it works on systems with
unaligned node ends (which is what triggers the issue). It does mean
that a little memory is wasted when this kernel is used on a non-NUMA
systems with unaligned node ends (only), but it works as designed at
that point. To be honest it looks very much that only a very small
memory systems is going to trip this, and we have traditionally used
non-NUMA kernels on non-NUMA systems so there is almost zero exposure in
our install base.

Does this sudden interest in this combination, indicate a distro driven
change to using NUMA kernels on non-NUMA systems??

Having been involved in the development of the code originally, I think
Mel's fix is a good compromise to fix the immediate problem. Clearly
there are bigger problems in this code that need clearing up if we are
to use this code as it is on small memory non-NUMA systems. For one the
change merged to fix the "memmap overlapping initrd allocation" severely
wastes memory by pushing the memmap into ZONE_NORMAL even when there is
spare Kernel Virtual Address space available, and also looses the memory
under it where it used to shift to HIGHMEM.

I think that most of this can become moot if we simply pull node-0 out
of this remap scheme, as node-0's memory is already local and the
problem only occurs on node-0. I have a todo item to look over this,
but as Mel has indicated its probabally not going to be immediate.

I think it makes sense to take Mel's fix as the smallest repair and
we'll spend some time sorting it out cleanly soon.

-apw