2006-03-17 08:21:28

by Yasunori Goto

[permalink] [raw]
Subject: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

This patch changes name of old add_memory() to arch_add_memory.
and use node id to get pgdat for the node at NODE_DATA().

Note: Powerpc's old add_memory() is defined as __devinit. However,
add_memory() is usually called only after bootup.
I suppose it may be redundant. But, I'm not sure about powerpc.
So, I keep it. (But, __meminit is better than __devinit at least.)

Signed-off-by: Yasunori Goto <[email protected]>

arch/i386/mm/init.c | 2 +-
arch/ia64/mm/init.c | 4 ++--
arch/powerpc/mm/mem.c | 4 +---
arch/x86_64/mm/init.c | 4 ++--
4 files changed, 6 insertions(+), 8 deletions(-)

Index: pgdat8/arch/i386/mm/init.c
===================================================================
--- pgdat8.orig/arch/i386/mm/init.c 2006-03-17 12:15:27.574691472 +0900
+++ pgdat8/arch/i386/mm/init.c 2006-03-17 12:16:06.189821080 +0900
@@ -652,7 +652,7 @@ void __init mem_init(void)
* memory to the highmem for now.
*/
#ifndef CONFIG_NEED_MULTIPLE_NODES
-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
{
struct pglist_data *pgdata = &contig_page_data;
struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
Index: pgdat8/arch/ia64/mm/init.c
===================================================================
--- pgdat8.orig/arch/ia64/mm/init.c 2006-03-17 12:15:27.574691472 +0900
+++ pgdat8/arch/ia64/mm/init.c 2006-03-17 12:16:06.190820928 +0900
@@ -646,7 +646,7 @@ void online_page(struct page *page)
num_physpages++;
}

-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
{
pg_data_t *pgdat;
struct zone *zone;
@@ -654,7 +654,7 @@ int add_memory(u64 start, u64 size)
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- pgdat = NODE_DATA(0);
+ pgdat = NODE_DATA(nid);

zone = pgdat->node_zones + ZONE_NORMAL;
ret = __add_pages(zone, start_pfn, nr_pages);
Index: pgdat8/arch/powerpc/mm/mem.c
===================================================================
--- pgdat8.orig/arch/powerpc/mm/mem.c 2006-03-17 12:15:27.575691320 +0900
+++ pgdat8/arch/powerpc/mm/mem.c 2006-03-17 12:16:06.190820928 +0900
@@ -114,15 +114,13 @@ void online_page(struct page *page)
num_physpages++;
}

-int __devinit add_memory(u64 start, u64 size)
+int __meminit arch_add_memory(int nid, u64 start, u64 size)
{
struct pglist_data *pgdata;
struct zone *zone;
- int nid;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

- nid = hot_add_scn_to_nid(start);
pgdata = NODE_DATA(nid);

start = __va(start);
Index: pgdat8/arch/x86_64/mm/init.c
===================================================================
--- pgdat8.orig/arch/x86_64/mm/init.c 2006-03-17 12:15:27.575691320 +0900
+++ pgdat8/arch/x86_64/mm/init.c 2006-03-17 12:16:06.191820776 +0900
@@ -493,9 +493,9 @@ void online_page(struct page *page)
num_physpages++;
}

-int add_memory(u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size)
{
- struct pglist_data *pgdat = NODE_DATA(0);
+ struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones + MAX_NR_ZONES-2;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

--
Yasunori Goto



2006-03-17 17:13:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

On Fri, 2006-03-17 at 17:20 +0900, Yasunori Goto wrote:
> This patch changes name of old add_memory() to arch_add_memory.
> and use node id to get pgdat for the node at NODE_DATA().
>
> Note: Powerpc's old add_memory() is defined as __devinit. However,
> add_memory() is usually called only after bootup.
> I suppose it may be redundant. But, I'm not sure about powerpc.
> So, I keep it. (But, __meminit is better than __devinit at least.)

My thoughts when originally designing the API were that the architecture
may be the only bit that actually knows where the memory _is_. So, we
shouldn't involve the generic code in figuring this out.

You can see the result of this in the next patch because there is a new
function introduced to hide the arch-specific node lookup. If that was
simply done in the already arch-specific add_memory() function, then you
wouldn't need arch_nid_probe() and its related #ifdefs at all.

-- Dave

2006-03-18 01:26:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

Hi,
On Fri, 17 Mar 2006 09:12:18 -0800
Dave Hansen <[email protected]> wrote:

> On Fri, 2006-03-17 at 17:20 +0900, Yasunori Goto wrote:
> > This patch changes name of old add_memory() to arch_add_memory.
> > and use node id to get pgdat for the node at NODE_DATA().
> >
> > Note: Powerpc's old add_memory() is defined as __devinit. However,
> > add_memory() is usually called only after bootup.
> > I suppose it may be redundant. But, I'm not sure about powerpc.
> > So, I keep it. (But, __meminit is better than __devinit at least.)
>
> My thoughts when originally designing the API were that the architecture
> may be the only bit that actually knows where the memory _is_. So, we
> shouldn't involve the generic code in figuring this out.
>
> You can see the result of this in the next patch because there is a new
> function introduced to hide the arch-specific node lookup. If that was
> simply done in the already arch-specific add_memory() function, then you
> wouldn't need arch_nid_probe() and its related #ifdefs at all.
>
If *determine node* function is moved to arch specific parts,
memory hot add need more and more codes to determine paddr -> nid in arch
specific codes. Then, we have to add new paddr->nid function even if new nid is
passed by firmware. We *lose* useful information of nid from firmware if
add_memory() has just 2 args, (start, end).

Considering ACPI, ia64/x86_64 can share codes and it depends on ACPI , not on arch.


It's trade-off between add special tiny #ifdef for probe
(not-firmware-supported memory hotplug) and add many codes for firmware-supported
memory hotplug function. Again, we *lose* information of nid from firmware
if add_memory() has just 2 args (start, end). This is bad thing.

-- Kame

2006-03-21 18:02:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

On Sat, 2006-03-18 at 10:26 +0900, KAMEZAWA Hiroyuki wrote:
> If *determine node* function is moved to arch specific parts,
> memory hot add need more and more codes to determine paddr -> nid in arch
> specific codes. Then, we have to add new paddr->nid function even if new nid is
> passed by firmware. We *lose* useful information of nid from firmware if
> add_memory() has just 2 args, (start, end).

What I'm saying is that I'd like add_memory() to be just that, for
adding memory.

At some point in the process, you need to export the NUMA node layout to
the rest of the system, to say which pages go in which node. I'm just
saying that you should do that _before_ add_memory().

add_memory() should support adding memory to more than one node. If any
hypervisor or hardware happens to have memory added in one contiguous
chunk, it can not simply call add_memory(). _That_ firmware would be
forced to do the NUMA parsing and figure out how many times to call
add_memory().

Let me reiterate: the process of telling the system which pages are in
which node should be separate from telling the system that there *are*
currently pages there now.

-- Dave

2006-03-22 00:05:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

On Tue, 21 Mar 2006 10:00:12 -0800
Dave Hansen <[email protected]> wrote:

> On Sat, 2006-03-18 at 10:26 +0900, KAMEZAWA Hiroyuki wrote:
> > If *determine node* function is moved to arch specific parts,
> > memory hot add need more and more codes to determine paddr -> nid in arch
> > specific codes. Then, we have to add new paddr->nid function even if new nid is
> > passed by firmware. We *lose* useful information of nid from firmware if
> > add_memory() has just 2 args, (start, end).
>
> What I'm saying is that I'd like add_memory() to be just that, for
> adding memory.
>
> At some point in the process, you need to export the NUMA node layout to
> the rest of the system, to say which pages go in which node. I'm just
> saying that you should do that _before_ add_memory().
>

To do so, we have to maintain new pfn_to_nid() function.
We have to maintain a new table/list and have to consider name of it :).
And, add_memory() has to check whether a node which belongs exists ot not, again.
I don't want these kind of things.

With current kernel, we have to add new *pgdat* to node when adding a new node.
(If we don't, the kernel goes panic()) And we have to allocate a pgdat/zones
in a local node in future. So adding a node before adding memory is not good.
(current code uses kmalloc() just for reducing complexity.)

> add_memory() should support adding memory to more than one node. If any
> hypervisor or hardware happens to have memory added in one contiguous
> chunk, it can not simply call add_memory(). _That_ firmware would be
> forced to do the NUMA parsing and figure out how many times to call
> add_memory().
I don't think the firmware adds memory of multiple nodes at once.
It's crazy.

>
> Let me reiterate: the process of telling the system which pages are in
> which node should be separate from telling the system that there *are*
> currently pages there now.

Considering "cpu only node', "check and add new node" function can be separated,
like add_memory_less_node().(But pgdat/zone etc.. will be allocated in out of node.)

Bye.
-- Kame

2006-03-22 01:10:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

On Wed, 2006-03-22 at 09:05 +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 21 Mar 2006 10:00:12 -0800
> Dave Hansen <[email protected]> wrote:
> > At some point in the process, you need to export the NUMA node layout to
> > the rest of the system, to say which pages go in which node. I'm just
> > saying that you should do that _before_ add_memory().
> >
> To do so, we have to maintain new pfn_to_nid() function.
> We have to maintain a new table/list and have to consider name of it :).

I completely spaced out, and forgot that we use sparsemem and 'struct
pages' for pfn_to_nid() now. I've been buried too deep in the i386
discontigmem physnode_map[]. Sorry.

If I missed it before, please refresh my memory. But, if we're
providing arch_nid_probe(addr), then why don't we just call it inside of
add_memory() on the start address, instead of in the generic code?

I'm also getting a bit confused in your patches whether add_memory() is
the _original_ add_memory(), or the new one. It tends to get lost in 17
patches. :(

I don't really like the arch_nid_probe() name. We need to make it very
apparent that it is to be used _only_ for memory hotplug operations. It
has no meaning for anything else.

hotplug_physaddr_to_nid()?

Maybe with a "memory_" in front. Maybe even
memory_add_physaddr_to_nid()?

It was probably to keep from changing as little code as possible, but
please convert the u64 values to pfns as soon as possible. I noticed
that hotadd_new_pgdat() still deals with them, and does the shift as
well. Is that really necessary.

The u64s should not be kept for more than one level of calls. That
level of calls should be the firmware. So, let the firmware call into
the VM code with u64s, then have all of the plain VM code deal in pfns.

-- Dave

2006-03-22 01:38:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

On Tue, 21 Mar 2006 17:08:18 -0800
Dave Hansen <[email protected]> wrote:
> If I missed it before, please refresh my memory. But, if we're
> providing arch_nid_probe(addr), then why don't we just call it inside of
> add_memory() on the start address, instead of in the generic code?
>
I think just *probe* needs it. The firmware which supports memory-hotplug, ACPI,
i386/x86_64/ia64, can tell node number as pxm.(proximity domain)

add_memory() can pass address of paddr as args, but can't pass *pxm*.

We already maintain pxm <-> nid map. But paddr <-> nid map isn't now.
If add_memory() doesn't has nid as args, we have to maintain
(1) pxm <-> nid map.
(2) paddr <-> nid map.
Becasue pfn_to_nid() is maintained by SPARSEMEM itself now, *new* paddr<->nid map
is redundant, I think. And I think the firmware already has the map before calling
add_memory().

> I'm also getting a bit confused in your patches whether add_memory() is
> the _original_ add_memory(), or the new one. It tends to get lost in 17
> patches. :(
>
maybe a big patch :(, We (I and Goto-san) are discussing to split them to
a bit small chunks.

> I don't really like the arch_nid_probe() name. We need to make it very
> apparent that it is to be used _only_ for memory hotplug operations. It
> has no meaning for anything else.
>
> hotplug_physaddr_to_nid()?
>
> Maybe with a "memory_" in front. Maybe even
> memory_add_physaddr_to_nid()?
>
Okay, we'll rename it.

> It was probably to keep from changing as little code as possible, but
> please convert the u64 values to pfns as soon as possible. I noticed
> that hotadd_new_pgdat() still deals with them, and does the shift as
> well. Is that really necessary.
>
> The u64s should not be kept for more than one level of calls. That
> level of calls should be the firmware. So, let the firmware call into
> the VM code with u64s, then have all of the plain VM code deal in pfns.
>
Okay.

-- Kame

2006-03-22 06:07:58

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH: 002/017]Memory hotplug for new nodes v.4.(change name old add_memory() to arch_add_memory())

> On Tue, 21 Mar 2006 17:08:18 -0800
> Dave Hansen <[email protected]> wrote:
> > If I missed it before, please refresh my memory. But, if we're
> > providing arch_nid_probe(addr), then why don't we just call it inside of
> > add_memory() on the start address, instead of in the generic code?
> >
> I think just *probe* needs it. The firmware which supports memory-hotplug, ACPI,
> i386/x86_64/ia64, can tell node number as pxm.(proximity domain)
>
> add_memory() can pass address of paddr as args, but can't pass *pxm*.
>
> We already maintain pxm <-> nid map. But paddr <-> nid map isn't now.
> If add_memory() doesn't has nid as args, we have to maintain
> (1) pxm <-> nid map.
> (2) paddr <-> nid map.
> Becasue pfn_to_nid() is maintained by SPARSEMEM itself now, *new* paddr<->nid map
> is redundant, I think. And I think the firmware already has the map before calling
> add_memory().

In ACPI's case, kernel has to do 3 steps to get node id from physicall
address.
1) parse DSDT to get device handle of ACPI for new memory.
This step can be described 3 detail steps.
1-1) search memory device in DSDT.
1-2) get _CRS of its memory device to see physicall address.
1-3) compare 1-2)'s address with required paddr.
(its memory device might be for other memory.)
If it is for new memory, then goto step 2),
else goto 1-1) again.
2) get pxm against its device handle.
It is just calling acpi_get_pxm().
3) get node from pxm. If pxm is new one, new node id is assigned.

Step 1) is a bit complicated.
But, when notify of memory hot-add event reaches via ACPI,
its handle is already obtained. So, step 2) and 3) are enough to
get node id. If node id can be passed at add_memory(), that is all.
If not, kernel losts memory device handle information or node id once
at calling add_memory(), and search memory device handle again by step 1).



In addition, if paddr to node id translation is called in add_memory(),
then its code for -each arch- will be like followings.
This will be close from ugly "copy and paste" again....... :-(

add_memory(paddr)
{
nid = paddr_to_node_id(paddr)
if (!node_online(nid)) {
pgdat = hotadd_new_pgdat(nid, start);
if (!pgdat)
return -ENOMEM;
}
:

Bye.

--
Yasunori Goto