2009-06-30 00:47:41

by Shaohua Li

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Tue, Jun 30, 2009 at 06:07:16AM +0800, Christoph Lameter wrote:
> On Mon, 29 Jun 2009, [email protected] wrote:
>
> > To initialize hotadded node, some pages are allocated. At that time, the
> > node hasn't memory, this makes the allocation always fail. In such case,
> > let's allocate pages from other nodes.
>
> Thats bad. Could you populate the buddy list with some large pages from
> the beginning of the node instead of doing this special casing? The
> vmemmap and other stuff really should come from the node that is added.
> Otherwise off node memory accesses will occur constantly for processors on
> that node.
Ok, this is preferred. But the node hasn't any memory present at that time,
let me check how could we do it.

Thanks,
Shaohua


2009-07-01 02:56:08

by Shaohua Li

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Tue, Jun 30, 2009 at 08:47:35AM +0800, Shaohua Li wrote:
> On Tue, Jun 30, 2009 at 06:07:16AM +0800, Christoph Lameter wrote:
> > On Mon, 29 Jun 2009, [email protected] wrote:
> >
> > > To initialize hotadded node, some pages are allocated. At that time, the
> > > node hasn't memory, this makes the allocation always fail. In such case,
> > > let's allocate pages from other nodes.
> >
> > Thats bad. Could you populate the buddy list with some large pages from
> > the beginning of the node instead of doing this special casing? The
> > vmemmap and other stuff really should come from the node that is added.
> > Otherwise off node memory accesses will occur constantly for processors on
> > that node.
> Ok, this is preferred. But the node hasn't any memory present at that time,
> let me check how could we do it.
Hi Christoph,
Looks this is quite hard. Memory of the node isn't added into buddy. At that
time (sparse-vmmem init) buddy for the node isn't initialized and even page struct
for the hotadded memory isn't prepared too. We need something like bootmem
allocator to get memory ...

2009-07-01 03:38:18

by Zhao, Yakui

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Wed, 2009-07-01 at 10:55 +0800, Li, Shaohua wrote:
> On Tue, Jun 30, 2009 at 08:47:35AM +0800, Shaohua Li wrote:
> > On Tue, Jun 30, 2009 at 06:07:16AM +0800, Christoph Lameter wrote:
> > > On Mon, 29 Jun 2009, [email protected] wrote:
> > >
> > > > To initialize hotadded node, some pages are allocated. At that time, the
> > > > node hasn't memory, this makes the allocation always fail. In such case,
> > > > let's allocate pages from other nodes.
> > >
> > > Thats bad. Could you populate the buddy list with some large pages from
> > > the beginning of the node instead of doing this special casing? The
> > > vmemmap and other stuff really should come from the node that is added.
> > > Otherwise off node memory accesses will occur constantly for processors on
> > > that node.
> > Ok, this is preferred. But the node hasn't any memory present at that time,
> > let me check how could we do it.
> Hi Christoph,
> Looks this is quite hard. Memory of the node isn't added into buddy. At that
> time (sparse-vmmem init) buddy for the node isn't initialized and even page struct
> for the hotadded memory isn't prepared too. We need something like bootmem
> allocator to get memory ...
Agree with what Shaohua said.
If we can't allocate memory from other node when there is no memory on
this node, we will have to do something like the bootmem allocator.
After the memory page is added to the system memory, we will have to
free the memory space used by the memory allocator. At the same time we
will have to assure that the hot-plugged memory exists physically.

thanks.
Yakui

2009-07-01 17:22:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Wed, 1 Jul 2009, yakui wrote:

> If we can't allocate memory from other node when there is no memory on
> this node, we will have to do something like the bootmem allocator.
> After the memory page is added to the system memory, we will have to
> free the memory space used by the memory allocator. At the same time we
> will have to assure that the hot-plugged memory exists physically.

The bootmem allocator must stick around it seems. Its more like a node
bootstrap allocator then.

Maybe we can generalize that. The bootstrap allocator may only need to be
able boot one node (which simplifies design). During system bringup only
the boot node is brought up.

Then the other nodes are hotplugged later all in turn using the bootstrap
allocator for their node setup?

There are a couple of things where one would want to spread out memory
across the nodes at boot time. How would node hotplugging handle that
situation?

2009-07-02 01:10:26

by Zhao, Yakui

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Thu, 2009-07-02 at 01:22 +0800, Christoph Lameter wrote:
> On Wed, 1 Jul 2009, yakui wrote:
>
> > If we can't allocate memory from other node when there is no memory on
> > this node, we will have to do something like the bootmem allocator.
> > After the memory page is added to the system memory, we will have to
> > free the memory space used by the memory allocator. At the same time we
> > will have to assure that the hot-plugged memory exists physically.
>
> The bootmem allocator must stick around it seems. Its more like a node
> bootstrap allocator then.
>
> Maybe we can generalize that. The bootstrap allocator may only need to be
> able boot one node (which simplifies design). During system bringup only
> the boot node is brought up.
>
> Then the other nodes are hotplugged later all in turn using the bootstrap
> allocator for their node setup?
Your idea looks fragrant. But it seems that it is difficult to realize.
In the boot phase the bootmem allocator is initialized. And after the
page buddy mechanism is enabled, the memory space used by bootmem
allocator will be freed.

If we also do the similar thing for the hotplugged node, how and when to
free the memory space used by the bootstrap allocator? It seems that we
will have to wait before all the memory sections are onlined for this
hotplugged node. And before all the memory sections are onlined, the
bootstrap allocator and buddy page allocator will co-exist.

thanks.
>
> There are a couple of things where one would want to spread out memory
> across the nodes at boot time. How would node hotplugging handle that
> situation?

2009-07-02 01:24:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Thu, 02 Jul 2009 09:11:13 +0800
yakui <[email protected]> wrote:

> On Thu, 2009-07-02 at 01:22 +0800, Christoph Lameter wrote:
> > On Wed, 1 Jul 2009, yakui wrote:
> >
> > > If we can't allocate memory from other node when there is no memory on
> > > this node, we will have to do something like the bootmem allocator.
> > > After the memory page is added to the system memory, we will have to
> > > free the memory space used by the memory allocator. At the same time we
> > > will have to assure that the hot-plugged memory exists physically.
> >
> > The bootmem allocator must stick around it seems. Its more like a node
> > bootstrap allocator then.
> >
> > Maybe we can generalize that. The bootstrap allocator may only need to be
> > able boot one node (which simplifies design). During system bringup only
> > the boot node is brought up.
> >
> > Then the other nodes are hotplugged later all in turn using the bootstrap
> > allocator for their node setup?
> Your idea looks fragrant. But it seems that it is difficult to realize.
> In the boot phase the bootmem allocator is initialized. And after the
> page buddy mechanism is enabled, the memory space used by bootmem
> allocator will be freed.
>
> If we also do the similar thing for the hotplugged node, how and when to
> free the memory space used by the bootstrap allocator? It seems that we
> will have to wait before all the memory sections are onlined for this
> hotplugged node. And before all the memory sections are onlined, the
> bootstrap allocator and buddy page allocator will co-exist.
>

When I was an eager developper of memory hotplug, I planned that.
A special page allocater which works from allocating pgdat until memmap setup.
But there were problems.
example)
1. We wanted to reuse bootmem.c but it was difficult.
2. IBM guys uses 16MB section. Then, they cannot allocate local pgdat/memmap
as other platform which have larger section size.
3. At memory hotplug, "memory section which includes pgdat for a node should be
removed after all other sections on the node are removed"
There is the same problem to memmap.

Because current memory hotplug works sane and above problem was too complicated for
me, I stopped. But there are more NUMAs than we implemented memory hotplug initially.
I hope someone fixes this mis-allocation problem.

IIUC, "3" is the worst problem. It creates dependency among memory.

Thanks,
-Kame







> thanks.
> >
> > There are a couple of things where one would want to spread out memory
> > across the nodes at boot time. How would node hotplugging handle that
> > situation?
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2009-07-02 05:59:52

by Yasunori Goto

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

> On Thu, 02 Jul 2009 09:11:13 +0800
> yakui <[email protected]> wrote:
>
> > On Thu, 2009-07-02 at 01:22 +0800, Christoph Lameter wrote:
> > > On Wed, 1 Jul 2009, yakui wrote:
> > >
> > > > If we can't allocate memory from other node when there is no memory on
> > > > this node, we will have to do something like the bootmem allocator.
> > > > After the memory page is added to the system memory, we will have to
> > > > free the memory space used by the memory allocator. At the same time we
> > > > will have to assure that the hot-plugged memory exists physically.
> > >
> > > The bootmem allocator must stick around it seems. Its more like a node
> > > bootstrap allocator then.
> > >
> > > Maybe we can generalize that. The bootstrap allocator may only need to be
> > > able boot one node (which simplifies design). During system bringup only
> > > the boot node is brought up.
> > >
> > > Then the other nodes are hotplugged later all in turn using the bootstrap
> > > allocator for their node setup?
> > Your idea looks fragrant. But it seems that it is difficult to realize.
> > In the boot phase the bootmem allocator is initialized. And after the
> > page buddy mechanism is enabled, the memory space used by bootmem
> > allocator will be freed.
> >
> > If we also do the similar thing for the hotplugged node, how and when to
> > free the memory space used by the bootstrap allocator? It seems that we
> > will have to wait before all the memory sections are onlined for this
> > hotplugged node. And before all the memory sections are onlined, the
> > bootstrap allocator and buddy page allocator will co-exist.
> >
>
> When I was an eager developper of memory hotplug, I planned that.
> A special page allocater which works from allocating pgdat until memmap setup.
> But there were problems.
> example)
> 1. We wanted to reuse bootmem.c but it was difficult.
> 2. IBM guys uses 16MB section. Then, they cannot allocate local pgdat/memmap
> as other platform which have larger section size.
> 3. At memory hotplug, "memory section which includes pgdat for a node should be
> removed after all other sections on the node are removed"
> There is the same problem to memmap.
>
> Because current memory hotplug works sane and above problem was too complicated for
> me, I stopped. But there are more NUMAs than we implemented memory hotplug initially.
> I hope someone fixes this mis-allocation problem.
>
> IIUC, "3" is the worst problem. It creates dependency among memory.

I made tiny basic functions to make it 1 or 2 years ago.
get_page_bootmem() record section/node id or counting up
how many other pages use it. It would be used for dependency
checking when removing memory.
I was going to make new allocator with those information.
(put_page_bootmem() is to free them.)

However, I don't enough time for memory hotplug now,
and they are just redundant functions now.
If someone create new allocator (and unifying bootmem allocator),
I'm very glad. :-)


Bye.


--
Yasunori Goto

2009-07-02 13:31:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Thu, 2 Jul 2009, Yasunori Goto wrote:

> However, I don't enough time for memory hotplug now,
> and they are just redundant functions now.
> If someone create new allocator (and unifying bootmem allocator),
> I'm very glad. :-)

"Senior"ities all around.... A move like that would require serious
commitment of time. None of us older developers can take that on it
seems.

Do we need to accept that the zone and page metadata are living on another
node?

2009-07-02 23:57:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Thu, 2 Jul 2009 09:31:04 -0400 (EDT)
Christoph Lameter <[email protected]> wrote:

> On Thu, 2 Jul 2009, Yasunori Goto wrote:
>
> > However, I don't enough time for memory hotplug now,
> > and they are just redundant functions now.
> > If someone create new allocator (and unifying bootmem allocator),
> > I'm very glad. :-)
>
> "Senior"ities all around.... A move like that would require serious
> commitment of time. None of us older developers can take that on it
> seems.
>
> Do we need to accept that the zone and page metadata are living on another
> node?
>
I don't think so. Someone should do. I just think I can't do it _now_.
(because I have more things to do for cgroup..)

And, if not node-hotplug, memmap is allocated from local memory if possible.
"We should _never_ allow fallback to other nodes or not" is problem ?
I think we should allow fallback.
About pgdat, zones, I hope they will be on-cache...

Maybe followings are necessary for allocating pgdat/zones from local node
at node-hotplug.

a) Add new tiny functions to alloacate memory from not-initialized area.
allocate pgdat/memmap from here if necessary.
b) leave allocated memory from (a) as PG_reserved at onlining.
c) There will be "not unpluggable" section after (b). We should show this to
users.
d) For removal, we have to keep precise trace of PG_reserved pages.
e) vmemmap removal, which uses large page for vmemmap, is a problem.
edges of section memmap is not aligned to large pages. Then we need
some clever trick to handle this.

Allocationg memmap from its own section was an idea (I love this) but
IBM's 16MB memory section doesn't allow this.

Thanks,
-Kame




2009-07-03 09:12:18

by Shaohua Li

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Fri, Jul 03, 2009 at 07:55:56AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 2 Jul 2009 09:31:04 -0400 (EDT)
> Christoph Lameter <[email protected]> wrote:
>
> > On Thu, 2 Jul 2009, Yasunori Goto wrote:
> >
> > > However, I don't enough time for memory hotplug now,
> > > and they are just redundant functions now.
> > > If someone create new allocator (and unifying bootmem allocator),
> > > I'm very glad. :-)
> >
> > "Senior"ities all around.... A move like that would require serious
> > commitment of time. None of us older developers can take that on it
> > seems.
> >
> > Do we need to accept that the zone and page metadata are living on another
> > node?
> >
> I don't think so. Someone should do. I just think I can't do it _now_.
> (because I have more things to do for cgroup..)
>
> And, if not node-hotplug, memmap is allocated from local memory if possible.
> "We should _never_ allow fallback to other nodes or not" is problem ?
> I think we should allow fallback.
> About pgdat, zones, I hope they will be on-cache...
>
> Maybe followings are necessary for allocating pgdat/zones from local node
> at node-hotplug.
>
> a) Add new tiny functions to alloacate memory from not-initialized area.
> allocate pgdat/memmap from here if necessary.
> b) leave allocated memory from (a) as PG_reserved at onlining.
> c) There will be "not unpluggable" section after (b). We should show this to
> users.
> d) For removal, we have to keep precise trace of PG_reserved pages.
> e) vmemmap removal, which uses large page for vmemmap, is a problem.
> edges of section memmap is not aligned to large pages. Then we need
> some clever trick to handle this.
>
> Allocationg memmap from its own section was an idea (I love this) but
> IBM's 16MB memory section doesn't allow this.
Adding code for allocation should not be hard, but hard to make the memory
unpluggable. For example, the vmemmap page table pages can map several
sections and even several nodes (a pgd page). This will make some sections
completely not unpluggable if the sections have page table pages.
Is it possible we can merge the workaround temporarily? Without it, the hotplug
fails immediately in our side.

Thanks,
Shaohua

2009-07-05 23:49:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: + memory-hotplug-alloc-page-from-other-node-in-memory-online.patch added to -mm tree

On Fri, 3 Jul 2009 17:12:06 +0800
Shaohua Li <[email protected]> wrote:

> On Fri, Jul 03, 2009 at 07:55:56AM +0800, KAMEZAWA Hiroyuki wrote:
> > On Thu, 2 Jul 2009 09:31:04 -0400 (EDT)
> > Christoph Lameter <[email protected]> wrote:
> >
> > > On Thu, 2 Jul 2009, Yasunori Goto wrote:
> > >
> > > > However, I don't enough time for memory hotplug now,
> > > > and they are just redundant functions now.
> > > > If someone create new allocator (and unifying bootmem allocator),
> > > > I'm very glad. :-)
> > >
> > > "Senior"ities all around.... A move like that would require serious
> > > commitment of time. None of us older developers can take that on it
> > > seems.
> > >
> > > Do we need to accept that the zone and page metadata are living on another
> > > node?
> > >
> > I don't think so. Someone should do. I just think I can't do it _now_.
> > (because I have more things to do for cgroup..)
> >
> > And, if not node-hotplug, memmap is allocated from local memory if possible.
> > "We should _never_ allow fallback to other nodes or not" is problem ?
> > I think we should allow fallback.
> > About pgdat, zones, I hope they will be on-cache...
> >
> > Maybe followings are necessary for allocating pgdat/zones from local node
> > at node-hotplug.
> >
> > a) Add new tiny functions to alloacate memory from not-initialized area.
> > allocate pgdat/memmap from here if necessary.
> > b) leave allocated memory from (a) as PG_reserved at onlining.
> > c) There will be "not unpluggable" section after (b). We should show this to
> > users.
> > d) For removal, we have to keep precise trace of PG_reserved pages.
> > e) vmemmap removal, which uses large page for vmemmap, is a problem.
> > edges of section memmap is not aligned to large pages. Then we need
> > some clever trick to handle this.
> >
> > Allocationg memmap from its own section was an idea (I love this) but
> > IBM's 16MB memory section doesn't allow this.
> Adding code for allocation should not be hard, but hard to make the memory
> unpluggable. For example, the vmemmap page table pages can map several
> sections and even several nodes (a pgd page). This will make some sections
> completely not unpluggable if the sections have page table pages.
> Is it possible we can merge the workaround temporarily? Without it, the hotplug
> fails immediately in our side.
>
ZONE_MOVABLE is for that. I wonder current ZONE_MOVABLE interface is not enough.
If section should be removable later, the section should be onlined as ZONE_MOVABLE
as following.

example)
echo removable_online > /sys/devices/system/memory/memoryXXX/online


thx,
-Kame

> Thanks,
> Shaohua
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>