2008-07-22 16:55:33

by Gerald Schaefer

[permalink] [raw]
Subject: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

I've been testing memory hotplug on s390, on a system that starts w/o
memory in ZONE_MOVABLE at first, and then some memory chunks will be
added to ZONE_MOVABLE via memory hot-add. Now I observe the following
problem:

Memory hot-remove of the lowest memory chunk in ZONE_MOVABLE will fail
because of some reserved pages at the beginning of each zone
(MIGRATE_RESERVED).

During memory hot-add, setup_per_zone_pages_min() will be called from
online_pages() to redistribute/recalculate the reserved page blocks.
This will mark some page blocks at the beginning of each zone as
MIGRATE_RESERVE. Now, the memory chunk containing these blocks cannot
be set offline again, because only MIGRATE_MOVABLE pages can be isolated
(offline_pages -> start_isolate_page_range).

So you cannot remove all the memory chunks that have been added via
memory hotplug. I'm not sure if I am missing something here, or if this
really is a bug. Any thoughts?

Thanks,
Gerald


2008-07-23 02:51:07

by Yasunori Goto

[permalink] [raw]
Subject: Re: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

Hi.


> I've been testing memory hotplug on s390, on a system that starts w/o
> memory in ZONE_MOVABLE at first, and then some memory chunks will be
> added to ZONE_MOVABLE via memory hot-add. Now I observe the following
> problem:
>
> Memory hot-remove of the lowest memory chunk in ZONE_MOVABLE will fail
> because of some reserved pages at the beginning of each zone
> (MIGRATE_RESERVED).
>
> During memory hot-add, setup_per_zone_pages_min() will be called from
> online_pages() to redistribute/recalculate the reserved page blocks.
> This will mark some page blocks at the beginning of each zone as
> MIGRATE_RESERVE. Now, the memory chunk containing these blocks cannot
> be set offline again, because only MIGRATE_MOVABLE pages can be isolated
> (offline_pages -> start_isolate_page_range).
>
> So you cannot remove all the memory chunks that have been added via
> memory hotplug. I'm not sure if I am missing something here, or if this
> really is a bug. Any thoughts?

I believe you are right. Current hot-remove code is NOT perfect.
You may remove some sections, but may not other sections,
because there are some un-removable pages by some reasons
(not only MIGRATE_RESERVED).

I think MIGRATE_RESERVED pages should be move to MIGRATE_MOVABLE when
those pages must be removed, and should recalculate MIGRATE_RESERVED pages.

Bye.

--
Yasunori Goto

2008-07-29 16:08:51

by Gerald Schaefer

[permalink] [raw]
Subject: Re: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

On Wed, 2008-07-23 at 11:48 +0900, Yasunori Goto wrote:
> > Memory hot-remove of the lowest memory chunk in ZONE_MOVABLE will fail
> > because of some reserved pages at the beginning of each zone
> > (MIGRATE_RESERVED).
> >
> I believe you are right. Current hot-remove code is NOT perfect.
> You may remove some sections, but may not other sections,
> because there are some un-removable pages by some reasons
> (not only MIGRATE_RESERVED).
>
> I think MIGRATE_RESERVED pages should be move to MIGRATE_MOVABLE when
> those pages must be removed, and should recalculate MIGRATE_RESERVED pages.

Hi,

Would it be an option to set pages_min to 0 for ZONE_MOVABLE in
setup_per_zone_pages_min()? This would avoid the MIGRATE_RESERVED vs.
MIGRATE_MOVABLE conflict on memory hot-remove. If I understand it
correctly, the kernel wouldn't be able to use the reserved pages in
ZONE_MOVABLE for __GFP_HIGH and PF_MEMALLOC allocations anyway, right?

At the moment, ZONE_MOVABLE pages will also account for the lowmem_pages
calculation in setup_per_zone_pages_min(). The recalculation will then
redistribute and reduce the amount of reserved pages for the other zones.
Won't this effectively reduce the amount of reserved min_free_kbytes memory
that is available to the kernel, even getting worse the more memory is
added to ZONE_MOVABLE?

With the following patch, ZONE_MOVABLE will be skipped for the
lowmem_pages calculation, just like it is already done for highmem.
It will also set pages_min to 0 for ZONE_MOVABLE. But I have an uneasy
feeling about this, because I may be missing side effects from this.
Any opinions?

Thanks,
Gerald

---
include/linux/mmzone.h | 5 +++++
mm/page_alloc.c | 4 ++--
2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -660,6 +660,11 @@ static inline int is_dma(struct zone *zo
#endif
}

+static inline int is_movable(struct zone *zone)
+{
+ return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
+}
+
/* These two functions are used to setup the per zone pages min values */
struct ctl_table;
struct file;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -4210,7 +4210,7 @@ void setup_per_zone_pages_min(void)

/* Calculate total number of !ZONE_HIGHMEM pages */
for_each_zone(zone) {
- if (!is_highmem(zone))
+ if (!is_highmem(zone) && !is_movable(zone))
lowmem_pages += zone->present_pages;
}

@@ -4243,7 +4243,7 @@ void setup_per_zone_pages_min(void)
* If it's a lowmem zone, reserve a number of pages
* proportionate to the zone's size.
*/
- zone->pages_min = tmp;
+ zone->pages_min = is_movable(zone) ? 0 : tmp;
}

zone->pages_low = zone->pages_min + (tmp >> 2);

2008-07-30 03:26:40

by Yasunori Goto

[permalink] [raw]
Subject: Re: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

> On Wed, 2008-07-23 at 11:48 +0900, Yasunori Goto wrote:
> > > Memory hot-remove of the lowest memory chunk in ZONE_MOVABLE will fail
> > > because of some reserved pages at the beginning of each zone
> > > (MIGRATE_RESERVED).
> > >
> > I believe you are right. Current hot-remove code is NOT perfect.
> > You may remove some sections, but may not other sections,
> > because there are some un-removable pages by some reasons
> > (not only MIGRATE_RESERVED).
> >
> > I think MIGRATE_RESERVED pages should be move to MIGRATE_MOVABLE when
> > those pages must be removed, and should recalculate MIGRATE_RESERVED pages.
>
> Hi,
>
> Would it be an option to set pages_min to 0 for ZONE_MOVABLE in
> setup_per_zone_pages_min()? This would avoid the MIGRATE_RESERVED vs.
> MIGRATE_MOVABLE conflict on memory hot-remove. If I understand it
> correctly, the kernel wouldn't be able to use the reserved pages in
> ZONE_MOVABLE for __GFP_HIGH and PF_MEMALLOC allocations anyway, right?
>
> At the moment, ZONE_MOVABLE pages will also account for the lowmem_pages
> calculation in setup_per_zone_pages_min(). The recalculation will then
> redistribute and reduce the amount of reserved pages for the other zones.
> Won't this effectively reduce the amount of reserved min_free_kbytes memory
> that is available to the kernel, even getting worse the more memory is
> added to ZONE_MOVABLE?
>
> With the following patch, ZONE_MOVABLE will be skipped for the
> lowmem_pages calculation, just like it is already done for highmem.
> It will also set pages_min to 0 for ZONE_MOVABLE. But I have an uneasy
> feeling about this, because I may be missing side effects from this.
> Any opinions?

Well, I didn't mean changing pages_min value. There may be side effect as
you are saying.
I meant if some pages were MIGRATE_RESERVE attribute when hot-remove are
-executing-, their attribute should be changed.

For example, how is like following dummy code? Is it impossible?
(Not only here, some places will have to be modified..)

Thanks.

---
mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-29 22:17:54.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-30 12:04:03.000000000 +0900
@@ -4828,7 +4828,9 @@ int set_migratetype_isolate(struct page
/*
* In future, more migrate types will be able to be isolation target.
*/
- if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE)
+ if ((get_pageblock_migratetype(page) != MIGRATE_MOVABLE) ||
+ !((removing section is the last section on the zone) &&
+ get_pageblock_migratetype(page) == MIGRATE_RESREVE))
goto out;
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
move_freepages_block(zone, page, MIGRATE_ISOLATE);


--
Yasunori Goto

2008-07-30 12:16:25

by Gerald Schaefer

[permalink] [raw]
Subject: Re: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

On Wed, 2008-07-30 at 12:16 +0900, Yasunori Goto wrote:
> Well, I didn't mean changing pages_min value. There may be side effect as
> you are saying.
> I meant if some pages were MIGRATE_RESERVE attribute when hot-remove are
> -executing-, their attribute should be changed.
>
> For example, how is like following dummy code? Is it impossible?
> (Not only here, some places will have to be modified..)

Right, this should be possible. I was somewhat wandering from the subject,
because I noticed that there may be a bigger problem with MIGRATE_RESERVE
pages in ZONE_MOVABLE, and that we may not want to have them in the first
place.

The more memory we add to ZONE_MOVABLE, the less reserved pages will
remain to the other zones. In setup_per_zone_pages_min(), min_free_kbytes
will be redistributed to a zone where the kernel cannot make any use of
it, effectively reducing the available min_free_kbytes. This just doesn't
sound right. I believe that a similar situation is the reason why highmem
pages are skipped in the calculation and I think that we need that for
ZONE_MOVABLE too. Any thoughts on that problem?

Setting pages_min to 0 for ZONE_MOVABLE, while not capping pages_low
and pages_high, could be an option. I don't have a sufficient memory
managment overview to tell if that has negative side effects, maybe
someone with a deeper insight could comment on that.

Thanks,
Gerald

2008-07-31 05:22:18

by Yasunori Goto

[permalink] [raw]
Subject: Re: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

> On Wed, 2008-07-30 at 12:16 +0900, Yasunori Goto wrote:
> > Well, I didn't mean changing pages_min value. There may be side effect as
> > you are saying.
> > I meant if some pages were MIGRATE_RESERVE attribute when hot-remove are
> > -executing-, their attribute should be changed.
> >
> > For example, how is like following dummy code? Is it impossible?
> > (Not only here, some places will have to be modified..)
>
> Right, this should be possible. I was somewhat wandering from the subject,
> because I noticed that there may be a bigger problem with MIGRATE_RESERVE
> pages in ZONE_MOVABLE, and that we may not want to have them in the first
> place.
>
> The more memory we add to ZONE_MOVABLE, the less reserved pages will
> remain to the other zones. In setup_per_zone_pages_min(), min_free_kbytes
> will be redistributed to a zone where the kernel cannot make any use of
> it, effectively reducing the available min_free_kbytes. This just doesn't
> sound right. I believe that a similar situation is the reason why highmem
> pages are skipped in the calculation and I think that we need that for
> ZONE_MOVABLE too. Any thoughts on that problem?
>
> Setting pages_min to 0 for ZONE_MOVABLE, while not capping pages_low
> and pages_high, could be an option. I don't have a sufficient memory
> managment overview to tell if that has negative side effects, maybe
> someone with a deeper insight could comment on that.

At least, pages_min should not be 0. It is used as watermark when
memory shortage situation. If it is 0, kernel will misunderstand
shortage situation. Certainly, pages_min value may be not appropriate value
for ZONE_MOVABLE. But it is not memory-hotplug issue.

True your question is why ZONE_MOVABLE has MIGRATE_RESREVE pages, right?
However, I think it is intended for emergency pool of memory shortage situation
for ZONE_MOVABLE via fallback[]. If not, these MIGRATE_RESERVE pages are not made
originally.
It is why I wrote previous mail.

Mel Gormal-san knows around here very well. He may explain its detail more.

Bye.

--
Yasunori Goto

2008-07-31 13:22:31

by Mel Gorman

[permalink] [raw]
Subject: Re: memory hotplug: hot-remove fails on lowest chunk in ZONE_MOVABLE

On (30/07/08 14:16), Gerald Schaefer didst pronounce:
> On Wed, 2008-07-30 at 12:16 +0900, Yasunori Goto wrote:
> > Well, I didn't mean changing pages_min value. There may be side effect as
> > you are saying.
> > I meant if some pages were MIGRATE_RESERVE attribute when hot-remove are
> > -executing-, their attribute should be changed.
> >
> > For example, how is like following dummy code? Is it impossible?
> > (Not only here, some places will have to be modified..)
>
> Right, this should be possible. I was somewhat wandering from the subject,
> because I noticed that there may be a bigger problem with MIGRATE_RESERVE
> pages in ZONE_MOVABLE, and that we may not want to have them in the first
> place.
>

MIGRATE_RESERVE is of large importance to ZONE_DMA32 and ZONE_NORMAL, to
a much lesser extent to ZONE_HIGHMEM and almost irrevelant to
ZONE_MOVABLE. However, nothing about MIGRATE_RESERVE should prevent the
hot-remove of the section. If the section is totally free, it is
considered removable according to is_mem_section_removable(). If other
parts of memory hot-remove are deliberately ignoring the RESERVE
sections, they should stop that.

I haven't read the whole thread, but in your original mail, you say that
ZONE_MOVABLE is populated by memory hot-add. Are there really PageReserved()
pages there? If so, is there any chance or other management structures
are being allocated within the section you are hot-adding? If so and they
are not getting freed, that might be why hot-remove is failing. If they
are not PageReserved() pages and this is an -mm kernel, I would enable
CONFIG_PAGE_OWNER and see who really reallocated those problem pages that
are not freeing.

> The more memory we add to ZONE_MOVABLE, the less reserved pages will
> remain to the other zones. In setup_per_zone_pages_min(), min_free_kbytes
> will be redistributed to a zone where the kernel cannot make any use of
> it, effectively reducing the available min_free_kbytes.

I'm not sure what you mean by "available min_free_kbytes". The overall value
for min_free_kbytes should be approximately the same whether the zone exists
or not. However, you're right in that the distribution of minimum free pages
changes with ZONE_MOVABLE because the zones are different sizes now. This
affects reclaim, not memory hot-remove.

> This just doesn't
> sound right. I believe that a similar situation is the reason why highmem
> pages are skipped in the calculation and I think that we need that for
> ZONE_MOVABLE too. Any thoughts on that problem?
>

is_highmem(ZONE_MOVABLE) should be returning true if the zone is really
part of himem.

> Setting pages_min to 0 for ZONE_MOVABLE, while not capping pages_low
> and pages_high, could be an option. I don't have a sufficient memory
> managment overview to tell if that has negative side effects, maybe
> someone with a deeper insight could comment on that.
>

pages_min of 0 means the other values would be 0 as well. This means that
kswapd may never be woken up to free pages within that zone and lead to
poor utilisation of the zone as allocators fallback to other zones to
avoid direct reclaim. I don't think that is your intention nor will it
help memory hot-remove.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-07-31 17:45:59

by Gerald Schaefer

[permalink] [raw]
Subject: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes

On Thu, 2008-07-31 at 14:22 +0100, Mel Gorman wrote:
> > The more memory we add to ZONE_MOVABLE, the less reserved pages will
> > remain to the other zones. In setup_per_zone_pages_min(), min_free_kbytes
> > will be redistributed to a zone where the kernel cannot make any use of
> > it, effectively reducing the available min_free_kbytes.
>
> I'm not sure what you mean by "available min_free_kbytes". The overall value
> for min_free_kbytes should be approximately the same whether the zone exists
> or not. However, you're right in that the distribution of minimum free pages
> changes with ZONE_MOVABLE because the zones are different sizes now. This
> affects reclaim, not memory hot-remove.

Sorry for mixing things up in this thread, the min_free_kbytes issue is
not related to memory hot-remove, but rather to hot-add and the things that
happen in setup_per_zone_pages_min(), which is called from online_pages().
It may well be that my assumptions are wrong, but I'd like to explain my
concerns again:

If we have a system with 1 GB of memory, min_free_kbytes will be calculated
to 4 MB for ZONE_NORMAL, for example. Now, if we add 3 GB of hotplug memory
to ZONE_MOVABLE, the total min_free_kbytes will still remain 4 MB but it
will be distributed differently: ZONE_NORMAL will now have only 1 MB of
MIGRATE_RESERVE memory left, while ZONE_MOVABLE will have 3 MB, e.g.

My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
usable by the kernel anymore, e.g. for PF_MEMALLOC, because it is in
ZONE_MOVABLE now. This is what I mean with "effectively reducing the
available min_free_kbytes". The system would now behave in the same way
as a system which only had 1 MB of min_free_kbytes, although
/proc/sys/vm/min_free_kbytes would still say 4 MB. After all, this tunable
can have a rather negative impact on a system, especially if it is too
low, hence my concerns.


> > This just doesn't
> > sound right. I believe that a similar situation is the reason why highmem
> > pages are skipped in the calculation and I think that we need that for
> > ZONE_MOVABLE too. Any thoughts on that problem?
> >
>
> is_highmem(ZONE_MOVABLE) should be returning true if the zone is really
> part of himem.

We don't have highmem on s390, I was just trying to give an example: I
noticed that there is special treatment for highmem pages in
setup_per_zone_pages_min(), and thought that we may also need to handle
ZONE_MOVABLE in a special way.


> > Setting pages_min to 0 for ZONE_MOVABLE, while not capping pages_low
> > and pages_high, could be an option. I don't have a sufficient memory
> > managment overview to tell if that has negative side effects, maybe
> > someone with a deeper insight could comment on that.
> >
>
> pages_min of 0 means the other values would be 0 as well. This means that
> kswapd may never be woken up to free pages within that zone and lead to
> poor utilisation of the zone as allocators fallback to other zones to
> avoid direct reclaim. I don't think that is your intention nor will it
> help memory hot-remove.

Do you mean pages_low and pages_high? In setup_per_zone_pages_min(),
those would not be set to 0, even if we set pages_min to 0. Again, a
similar strategy is being used for highmem in that function, only that
pages_min is set to a small value instead of 0 in that case. So it should
not affect kswapd but only __GFP_HIGH and PF_MEMALLOC allocations, which
won't be allocated from ZONE_MOVABLE anyway if I understood that right.


Thanks,
Gerald

2008-08-01 11:29:19

by Yasunori Goto

[permalink] [raw]
Subject: Re: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes


> Sorry for mixing things up in this thread, the min_free_kbytes issue is
> not related to memory hot-remove, but rather to hot-add and the things that
> happen in setup_per_zone_pages_min(), which is called from online_pages().
> It may well be that my assumptions are wrong, but I'd like to explain my
> concerns again:
>
> If we have a system with 1 GB of memory, min_free_kbytes will be calculated
> to 4 MB for ZONE_NORMAL, for example. Now, if we add 3 GB of hotplug memory
> to ZONE_MOVABLE, the total min_free_kbytes will still remain 4 MB but it
> will be distributed differently: ZONE_NORMAL will now have only 1 MB of
> MIGRATE_RESERVE memory left, while ZONE_MOVABLE will have 3 MB, e.g.
>

Right.

> My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
> usable by the kernel anymore, e.g. for PF_MEMALLOC, because it is in
> ZONE_MOVABLE now.

I don't make sense here. I suppose there is no relationship between
ZONE_MOVABLE, PF_MEMALLOC and MIGRATE_RESERVE pages.
Could you tell me more?


> This is what I mean with "effectively reducing the
> available min_free_kbytes". The system would now behave in the same way
> as a system which only had 1 MB of min_free_kbytes, although
> /proc/sys/vm/min_free_kbytes would still say 4 MB. After all, this tunable
> can have a rather negative impact on a system, especially if it is too
> low, hence my concerns.
>
> > > Setting pages_min to 0 for ZONE_MOVABLE, while not capping pages_low
> > > and pages_high, could be an option. I don't have a sufficient memory
> > > managment overview to tell if that has negative side effects, maybe
> > > someone with a deeper insight could comment on that.
> > >
> >
> > pages_min of 0 means the other values would be 0 as well. This means that
> > kswapd may never be woken up to free pages within that zone and lead to
> > poor utilisation of the zone as allocators fallback to other zones to
> > avoid direct reclaim. I don't think that is your intention nor will it
> > help memory hot-remove.
>
> Do you mean pages_low and pages_high? In setup_per_zone_pages_min(),
> those would not be set to 0, even if we set pages_min to 0. Again, a
> similar strategy is being used for highmem in that function, only that
> pages_min is set to a small value instead of 0 in that case. So it should
> not affect kswapd but only __GFP_HIGH and PF_MEMALLOC allocations, which
> won't be allocated from ZONE_MOVABLE anyway if I understood that right.


pages_min seems to be used in get_pages_from_freelist().
Do you mean following is not executed?


if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min; <------!!!
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low;
else
mark = zone->pages_high;
if (!zone_watermark_ok(zone, order, mark, <-----!!!
classzone_idx, alloc_flags)) {
if (!zone_reclaim_mode ||
!zone_reclaim(zone, gfp_mask, order))
goto this_zone_full;
}
}

But even if pages_min is not used as you said, I suppose it is
accidental by changing source code.
It should work as watermark to keep its meaning.
If not, it would be cause of bug in the future by misunderstanding.


Bye.

--
Yasunori Goto

2008-08-01 16:05:22

by Gerald Schaefer

[permalink] [raw]
Subject: Re: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes

On Fri, 2008-08-01 at 20:16 +0900, Yasunori Goto wrote:
> > My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
> > usable by the kernel anymore, e.g. for PF_MEMALLOC, because it is in
> > ZONE_MOVABLE now.
>
> I don't make sense here. I suppose there is no relationship between
> ZONE_MOVABLE, PF_MEMALLOC and MIGRATE_RESERVE pages.
> Could you tell me more?

Ok, I thought that PF_MEMALLOC allocations work on the MIGRATE_RESERVE
pageblocks, and that only kernel allocations can use PF_MEMALLOC. I also
thought that kernel allocations cannot use ZONE_MOVABLE, e.g. for page
cache memory, because such pages would not be migratable. So I assumed
that MIGRATE_RESERVE pageblocks in ZONE_MOVABLE would not be available
for PF_MEMALLOC allocations.

With this assumption, which can be totally wrong, the redistribution
of MIGRATE_RESERVE pageblocks in setup_per_zone_pages_min() looks like
it will take away reserved pageblocks that should be available to the
kernel in emergency situations.

Maybe I should have explained this assumption earlier, because my whole
min_free_kbytes issue depends on it. If I'm wrong, I apologize for
confusing you all with this "issue", and I will go back to the original
problem with removing the lowest memory chunk in ZONE_MOVABLE...

Thanks,
Gerald

2008-08-01 16:27:17

by Mel Gorman

[permalink] [raw]
Subject: Re: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes

On (31/07/08 19:45), Gerald Schaefer didst pronounce:
> On Thu, 2008-07-31 at 14:22 +0100, Mel Gorman wrote:
> > > The more memory we add to ZONE_MOVABLE, the less reserved pages will
> > > remain to the other zones. In setup_per_zone_pages_min(), min_free_kbytes
> > > will be redistributed to a zone where the kernel cannot make any use of
> > > it, effectively reducing the available min_free_kbytes.
> >
> > I'm not sure what you mean by "available min_free_kbytes". The overall value
> > for min_free_kbytes should be approximately the same whether the zone exists
> > or not. However, you're right in that the distribution of minimum free pages
> > changes with ZONE_MOVABLE because the zones are different sizes now. This
> > affects reclaim, not memory hot-remove.
>
> Sorry for mixing things up in this thread, the min_free_kbytes issue is
> not related to memory hot-remove, but rather to hot-add and the things that
> happen in setup_per_zone_pages_min(), which is called from online_pages().
> It may well be that my assumptions are wrong, but I'd like to explain my
> concerns again:
>
> If we have a system with 1 GB of memory, min_free_kbytes will be calculated
> to 4 MB for ZONE_NORMAL, for example. Now, if we add 3 GB of hotplug memory
> to ZONE_MOVABLE, the total min_free_kbytes will still remain 4 MB but it
> will be distributed differently: ZONE_NORMAL will now have only 1 MB of
> MIGRATE_RESERVE memory left, while ZONE_MOVABLE will have 3 MB, e.g.
>

Ok, I haven't double checked your figures but lets go with the assumption -
adding memory means min_free_kbytes will be distributed differently.

> My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
> usable by the kernel anymore, e.g. for PF_MEMALLOC, because it is in
> ZONE_MOVABLE now.

Nothing stops PF_MEMALLOC being used and the only thing that stops 3MB
being used in ZONE_MOVABLE is min_free_kbytes, not the fact there is a
MIGRATE_RESERVE there. PF_MEMALLOC and MIGRATE_RESERVE are not related.

I think you are confusing what MIGRATE_RESERVE is for. A number of pageblocks
at the start of a zone are marked MIGRATE_RESERVE depending on the size of
min_free_kbytes for that value. The kernel will try avoiding allocating from
there so that high-order-atomic-allocatons have a chance of succeeding from
there. It's not kept aside for emergency-allocations.

> This is what I mean with "effectively reducing the
> available min_free_kbytes". The system would now behave in the same way
> as a system which only had 1 MB of min_free_kbytes, although
> /proc/sys/vm/min_free_kbytes would still say 4 MB. After all, this tunable
> can have a rather negative impact on a system, especially if it is too
> low, hence my concerns.
>

Increase min_free_kbytes on memory hot-add?

>
> > > This just doesn't
> > > sound right. I believe that a similar situation is the reason why highmem
> > > pages are skipped in the calculation and I think that we need that for
> > > ZONE_MOVABLE too. Any thoughts on that problem?
> > >
> >
> > is_highmem(ZONE_MOVABLE) should be returning true if the zone is really
> > part of himem.
>
> We don't have highmem on s390, I was just trying to give an example: I
> noticed that there is special treatment for highmem pages in
> setup_per_zone_pages_min(), and thought that we may also need to handle
> ZONE_MOVABLE in a special way.
>

ZONE_MOVABLE should be treated the same as highmem would be in terms of
tuning

>
> > > Setting pages_min to 0 for ZONE_MOVABLE, while not capping pages_low
> > > and pages_high, could be an option. I don't have a sufficient memory
> > > managment overview to tell if that has negative side effects, maybe
> > > someone with a deeper insight could comment on that.
> > >
> >
> > pages_min of 0 means the other values would be 0 as well. This means that
> > kswapd may never be woken up to free pages within that zone and lead to
> > poor utilisation of the zone as allocators fallback to other zones to
> > avoid direct reclaim. I don't think that is your intention nor will it
> > help memory hot-remove.
>
> Do you mean pages_low and pages_high? In setup_per_zone_pages_min(),
> those would not be set to 0, even if we set pages_min to 0. Again, a
> similar strategy is being used for highmem in that function, only that
> pages_min is set to a small value instead of 0 in that case. So it should
> not affect kswapd but only __GFP_HIGH and PF_MEMALLOC allocations, which
> won't be allocated from ZONE_MOVABLE anyway if I understood that right.
>

Ok, I'm losing track here, maybe it's just too late on a friday. right now,
ZONE_MOVABLE should be setup similar to what HIGHMEM would have been. It
shouldn't get its pages_min value set to 0 and even if it did, it would not
help memory hot-remove.

Also, nothing stops __GFP_HIGH or PF_MEMALLOC using ZONE_MOVABLE as long as the
caller is using __GFP_MOVABLE. However, as it is unlikely that combination
of flags would occur I'd be open to examining how min_free_kbytes gets
distibuted. It is an independent topic to why the beginning of the zone is
not removable though. I suspect MIGRATE_RESERVE is a red herring.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab