From: Michal Hocko <[email protected]>
movable_node kernel parameter allows to make hotplugable NUMA
nodes to put all the hotplugable memory into movable zone which
allows more or less reliable memory hotremove. At least this
is the case for the NUMA nodes present during the boot (see
find_zone_movable_pfns_for_nodes).
This is not the case for the memory hotplug, though.
echo online > /sys/devices/system/memory/memoryXYZ/status
will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node. The only option currently is to have a
special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy. Not the mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.
It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to
make them use the same code path because the runtime hotplug doesn't
play with the memblock allocator at all.
Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone. This should provide a reasonable default onlining strategy.
Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW. From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely). If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks
but let's keep this simple now.
Signed-off-by: Michal Hocko <[email protected]>
---
Hi Andrew,
I've posted this as an RFC previously [1] and there haven't been any
objections to the approach so I've dropped the RFC and sending it for
inclusion. The only change since the last time is the update of the
documentation to clarify the semantic as suggested by Reza Arbab.
[1] http://lkml.kernel.org/r/[email protected]
Documentation/memory-hotplug.txt | 12 +++++++++---
mm/memory_hotplug.c | 19 ++++++++++++++++---
2 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 670f3ded0802..5c628e19d6cd 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -282,20 +282,26 @@ offlined it is possible to change the individual block's state by writing to the
% echo online > /sys/devices/system/memory/memoryXXX/state
This onlining will not change the ZONE type of the target memory block,
-If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+If the memory block doesn't belong to any zone an appropriate kernel zone
+(usually ZONE_NORMAL) will be used unless movable_node kernel command line
+option is specified when ZONE_MOVABLE will be used.
+
+You can explicitly request to associate it with ZONE_MOVABLE by
% echo online_movable > /sys/devices/system/memory/memoryXXX/state
(NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE)
-And if the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:
% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
(NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL)
+An explicit zone onlining can fail (e.g. when the range is already within
+and existing and incompatible zone already).
+
After this, memory block XXX's state will be 'online' and the amount of
available memory will be increased.
-Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA).
This may be changed in future.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b98fb0b3ae11..74d75583736c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -943,6 +943,19 @@ struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
return &pgdat->node_zones[ZONE_NORMAL];
}
+static inline bool movable_pfn_range(int nid, struct zone *default_zone,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
+ MMOP_ONLINE_KERNEL))
+ return true;
+
+ if (!movable_node_is_enabled())
+ return false;
+
+ return !zone_intersects(default_zone, start_pfn, nr_pages);
+}
+
/*
* Associates the given pfn range with the given node and the zone appropriate
* for the given online type.
@@ -958,10 +971,10 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
/*
* MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use
* movable zone if that is not possible (e.g. we are within
- * or past the existing movable zone)
+ * or past the existing movable zone). movable_node overrides
+ * this default and defaults to movable zone
*/
- if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
- MMOP_ONLINE_KERNEL))
+ if (movable_pfn_range(nid, zone, start_pfn, nr_pages))
zone = movable_zone;
} else if (online_type == MMOP_ONLINE_MOVABLE) {
zone = &pgdat->node_zones[ZONE_MOVABLE];
--
2.11.0
On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>From: Michal Hocko <[email protected]>
>
>movable_node kernel parameter allows to make hotplugable NUMA
>nodes to put all the hotplugable memory into movable zone which
>allows more or less reliable memory hotremove. At least this
>is the case for the NUMA nodes present during the boot (see
>find_zone_movable_pfns_for_nodes).
>
>This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
^^^
Hmm, one typo I think
s/status/state/
--
Wei Yang
Help you, Help me
On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>From: Michal Hocko <[email protected]>
>
>movable_node kernel parameter allows to make hotplugable NUMA
>nodes to put all the hotplugable memory into movable zone which
>allows more or less reliable memory hotremove. At least this
>is the case for the NUMA nodes present during the boot (see
>find_zone_movable_pfns_for_nodes).
>
>This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
>will default to a kernel zone (usually ZONE_NORMAL) unless the
>particular memblock is already in the movable zone range which is not
>the case normally when onlining the memory from the udev rule context
>for a freshly hotadded NUMA node. The only option currently is to have a
>special udev rule to echo online_movable to all memblocks belonging to
>such a node which is rather clumsy. Not the mention this is inconsistent
>as well because what ended up in the movable zone during the boot will
>end up in a kernel zone after hotremove & hotadd without special care.
>
A kernel zone here means? Which is the counterpart in zone_type? or a
combination of several zone_type?
--
Wei Yang
Help you, Help me
On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>From: Michal Hocko <[email protected]>
>
>movable_node kernel parameter allows to make hotplugable NUMA
>nodes to put all the hotplugable memory into movable zone which
>allows more or less reliable memory hotremove. At least this
>is the case for the NUMA nodes present during the boot (see
>find_zone_movable_pfns_for_nodes).
>
When movable_node is enabled, we would have overlapped zones, right?
To be specific, only ZONE_MOVABLE could have memory ranges belongs to other
zones.
This looks a little different in the whole ZONE design.
>This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
>will default to a kernel zone (usually ZONE_NORMAL) unless the
>particular memblock is already in the movable zone range which is not
^^^
Here is memblock or a memory_block?
>the case normally when onlining the memory from the udev rule context
>for a freshly hotadded NUMA node. The only option currently is to have a
So the semantic you want to change here is to make the memory_block in
ZONE_MOVABLE when movable_node is enabled.
Besides this, movable_node is enabled, what other requirements? Like, this
memory_block should next to current ZONE_MOVABLE ? or something else?
>special udev rule to echo online_movable to all memblocks belonging to
>such a node which is rather clumsy. Not the mention this is inconsistent
^^^
Hmm... "Not to mentions" looks more understandable.
BTW, I am not a native speaker. If this usage is correct, just ignore this
comment.
>as well because what ended up in the movable zone during the boot will
>end up in a kernel zone after hotremove & hotadd without special care.
>
>It would be nice to reuse memblock_is_hotpluggable but the runtime
>hotplug doesn't have that information available because the boot and
>hotplug paths are not shared and it would be really non trivial to
>make them use the same code path because the runtime hotplug doesn't
>play with the memblock allocator at all.
>
>Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
>movable_node is enabled and the range doesn't overlap with the existing
>normal zone. This should provide a reasonable default onlining strategy.
>
>Strictly speaking the semantic is not identical with the boot time
>initialization because find_zone_movable_pfns_for_nodes covers only the
>hotplugable range as described by the BIOS/FW. From my experience this
>is usually a full node though (except for Node0 which is special and
>never goes away completely). If this turns out to be a problem in the
>real life we can tweak the code to store hotplug flag into memblocks
>but let's keep this simple now.
>
Let me try to understand your purpose of this change.
If a memblock has MEMBLOCK_HOTPLU set, it would be in ZONE_MOVABLE during
bootup. While a hotplugged memory_block would be in ZONE_NORMAL without
special care.
So you want to make sure when movable_node is enabled, the hotplugged
memory_block would be in ZONE_MOVABLE. Is this correct?
One more thing is do we have MEMBLOCK_HOTPLU for a hotplugged memory_block?
>Signed-off-by: Michal Hocko <[email protected]>
>---
>
--
Wei Yang
Help you, Help me
On Sat 10-06-17 22:33:56, Wei Yang wrote:
> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >movable_node kernel parameter allows to make hotplugable NUMA
> >nodes to put all the hotplugable memory into movable zone which
> >allows more or less reliable memory hotremove. At least this
> >is the case for the NUMA nodes present during the boot (see
> >find_zone_movable_pfns_for_nodes).
> >
> >This is not the case for the memory hotplug, though.
> >
> > echo online > /sys/devices/system/memory/memoryXYZ/status
> ^^^
>
> Hmm, one typo I think
>
> s/status/state/
right! Thanks for spotting that. I guess Andrew can update the changelog
or should I resubmit?
--
Michal Hocko
SUSE Labs
On Sun 11-06-17 09:45:35, Wei Yang wrote:
> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >movable_node kernel parameter allows to make hotplugable NUMA
> >nodes to put all the hotplugable memory into movable zone which
> >allows more or less reliable memory hotremove. At least this
> >is the case for the NUMA nodes present during the boot (see
> >find_zone_movable_pfns_for_nodes).
> >
> >This is not the case for the memory hotplug, though.
> >
> > echo online > /sys/devices/system/memory/memoryXYZ/status
> >
> >will default to a kernel zone (usually ZONE_NORMAL) unless the
> >particular memblock is already in the movable zone range which is not
> >the case normally when onlining the memory from the udev rule context
> >for a freshly hotadded NUMA node. The only option currently is to have a
> >special udev rule to echo online_movable to all memblocks belonging to
> >such a node which is rather clumsy. Not the mention this is inconsistent
> >as well because what ended up in the movable zone during the boot will
> >end up in a kernel zone after hotremove & hotadd without special care.
> >
>
> A kernel zone here means? Which is the counterpart in zone_type? or a
> combination of several zone_type?
Any zone but < ZONE_HIGHMEM. The specific zone depends on the placement.
But it is ZONE_NORMAL in most situations.
--
Michal Hocko
SUSE Labs
On Mon 12-06-17 12:28:32, Wei Yang wrote:
> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >movable_node kernel parameter allows to make hotplugable NUMA
> >nodes to put all the hotplugable memory into movable zone which
> >allows more or less reliable memory hotremove. At least this
> >is the case for the NUMA nodes present during the boot (see
> >find_zone_movable_pfns_for_nodes).
> >
>
> When movable_node is enabled, we would have overlapped zones, right?
It won't based on this patch. See movable_pfn_range
> To be specific, only ZONE_MOVABLE could have memory ranges belongs to other
> zones.
>
> This looks a little different in the whole ZONE design.
>
> >This is not the case for the memory hotplug, though.
> >
> > echo online > /sys/devices/system/memory/memoryXYZ/status
> >
> >will default to a kernel zone (usually ZONE_NORMAL) unless the
> >particular memblock is already in the movable zone range which is not
> ^^^
>
> Here is memblock or a memory_block?
yes
>
> >the case normally when onlining the memory from the udev rule context
> >for a freshly hotadded NUMA node. The only option currently is to have a
>
> So the semantic you want to change here is to make the memory_block in
> ZONE_MOVABLE when movable_node is enabled.
Yes, by default when there the specific range is not associated with any
other zone.
> Besides this, movable_node is enabled, what other requirements? Like, this
> memory_block should next to current ZONE_MOVABLE ? or something else?
no other requirements.
> >special udev rule to echo online_movable to all memblocks belonging to
> >such a node which is rather clumsy. Not the mention this is inconsistent
> ^^^
>
> Hmm... "Not to mentions" looks more understandable.
yes this is a typo
> BTW, I am not a native speaker. If this usage is correct, just ignore this
> comment.
>
> >as well because what ended up in the movable zone during the boot will
> >end up in a kernel zone after hotremove & hotadd without special care.
> >
> >It would be nice to reuse memblock_is_hotpluggable but the runtime
> >hotplug doesn't have that information available because the boot and
> >hotplug paths are not shared and it would be really non trivial to
> >make them use the same code path because the runtime hotplug doesn't
> >play with the memblock allocator at all.
> >
> >Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
> >movable_node is enabled and the range doesn't overlap with the existing
> >normal zone. This should provide a reasonable default onlining strategy.
> >
> >Strictly speaking the semantic is not identical with the boot time
> >initialization because find_zone_movable_pfns_for_nodes covers only the
> >hotplugable range as described by the BIOS/FW. From my experience this
> >is usually a full node though (except for Node0 which is special and
> >never goes away completely). If this turns out to be a problem in the
> >real life we can tweak the code to store hotplug flag into memblocks
> >but let's keep this simple now.
> >
>
> Let me try to understand your purpose of this change.
>
> If a memblock has MEMBLOCK_HOTPLU set, it would be in ZONE_MOVABLE during
> bootup. While a hotplugged memory_block would be in ZONE_NORMAL without
> special care.
>
> So you want to make sure when movable_node is enabled, the hotplugged
> memory_block would be in ZONE_MOVABLE. Is this correct?
yes
> One more thing is do we have MEMBLOCK_HOTPLU for a hotplugged memory_block?
No, we do not, as the changelog mentions. This flag is set in the
memblock allocator (do not confuse that with the memory_block hotplug
works with - yeah quite confusing) and that is a boot only thing. We do
not use it during runtime memory hotplug.
--
Michal Hocko
SUSE Labs
On 06/08/2017 02:23 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> movable_node kernel parameter allows to make hotplugable NUMA
> nodes to put all the hotplugable memory into movable zone which
> allows more or less reliable memory hotremove. At least this
> is the case for the NUMA nodes present during the boot (see
> find_zone_movable_pfns_for_nodes).
>
> This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
> will default to a kernel zone (usually ZONE_NORMAL) unless the
> particular memblock is already in the movable zone range which is not
> the case normally when onlining the memory from the udev rule context
> for a freshly hotadded NUMA node. The only option currently is to have a
> special udev rule to echo online_movable to all memblocks belonging to
> such a node which is rather clumsy. Not the mention this is inconsistent
> as well because what ended up in the movable zone during the boot will
> end up in a kernel zone after hotremove & hotadd without special care.
>
> It would be nice to reuse memblock_is_hotpluggable but the runtime
> hotplug doesn't have that information available because the boot and
> hotplug paths are not shared and it would be really non trivial to
> make them use the same code path because the runtime hotplug doesn't
> play with the memblock allocator at all.
>
> Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
> movable_node is enabled and the range doesn't overlap with the existing
> normal zone. This should provide a reasonable default onlining strategy.
>
> Strictly speaking the semantic is not identical with the boot time
> initialization because find_zone_movable_pfns_for_nodes covers only the
> hotplugable range as described by the BIOS/FW. From my experience this
> is usually a full node though (except for Node0 which is special and
> never goes away completely). If this turns out to be a problem in the
> real life we can tweak the code to store hotplug flag into memblocks
> but let's keep this simple now.
>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
OK, so here is v2 which fixes 2 typos in the changelog spotted by Wei
Yang and Acked-by from Vlastimil added. No functional changes added.
---
>From d08c94a4e15e774504f3acdad026d104b18d6543 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 23 May 2017 14:01:24 +0200
Subject: [PATCH] mm, memory_hotplug: support movable_node for hotplugable
nodes
movable_node kernel parameter allows to make hotplugable NUMA
nodes to put all the hotplugable memory into movable zone which
allows more or less reliable memory hotremove. At least this
is the case for the NUMA nodes present during the boot (see
find_zone_movable_pfns_for_nodes).
This is not the case for the memory hotplug, though.
echo online > /sys/devices/system/memory/memoryXYZ/state
will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node. The only option currently is to have a
special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy. Not to mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.
It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to
make them use the same code path because the runtime hotplug doesn't
play with the memblock allocator at all.
Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone. This should provide a reasonable default onlining strategy.
Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW. From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely). If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks
but let's keep this simple now.
Changes since v1
- fixed to typose on the changelog as per Wei Yang
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/memory-hotplug.txt | 12 +++++++++---
mm/memory_hotplug.c | 19 ++++++++++++++++---
2 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 670f3ded0802..5c628e19d6cd 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -282,20 +282,26 @@ offlined it is possible to change the individual block's state by writing to the
% echo online > /sys/devices/system/memory/memoryXXX/state
This onlining will not change the ZONE type of the target memory block,
-If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
+If the memory block doesn't belong to any zone an appropriate kernel zone
+(usually ZONE_NORMAL) will be used unless movable_node kernel command line
+option is specified when ZONE_MOVABLE will be used.
+
+You can explicitly request to associate it with ZONE_MOVABLE by
% echo online_movable > /sys/devices/system/memory/memoryXXX/state
(NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE)
-And if the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
+Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:
% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
(NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL)
+An explicit zone onlining can fail (e.g. when the range is already within
+and existing and incompatible zone already).
+
After this, memory block XXX's state will be 'online' and the amount of
available memory will be increased.
-Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA).
This may be changed in future.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b98fb0b3ae11..74d75583736c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -943,6 +943,19 @@ struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
return &pgdat->node_zones[ZONE_NORMAL];
}
+static inline bool movable_pfn_range(int nid, struct zone *default_zone,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
+ MMOP_ONLINE_KERNEL))
+ return true;
+
+ if (!movable_node_is_enabled())
+ return false;
+
+ return !zone_intersects(default_zone, start_pfn, nr_pages);
+}
+
/*
* Associates the given pfn range with the given node and the zone appropriate
* for the given online type.
@@ -958,10 +971,10 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
/*
* MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use
* movable zone if that is not possible (e.g. we are within
- * or past the existing movable zone)
+ * or past the existing movable zone). movable_node overrides
+ * this default and defaults to movable zone
*/
- if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
- MMOP_ONLINE_KERNEL))
+ if (movable_pfn_range(nid, zone, start_pfn, nr_pages))
zone = movable_zone;
} else if (online_type == MMOP_ONLINE_MOVABLE) {
zone = &pgdat->node_zones[ZONE_MOVABLE];
--
2.11.0
--
Michal Hocko
SUSE Labs
On Mon, Jun 12, 2017 at 08:45:02AM +0200, Michal Hocko wrote:
>On Mon 12-06-17 12:28:32, Wei Yang wrote:
>> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>> >From: Michal Hocko <[email protected]>
>> >
>> >movable_node kernel parameter allows to make hotplugable NUMA
>> >nodes to put all the hotplugable memory into movable zone which
>> >allows more or less reliable memory hotremove. At least this
>> >is the case for the NUMA nodes present during the boot (see
>> >find_zone_movable_pfns_for_nodes).
>> >
>>
>> When movable_node is enabled, we would have overlapped zones, right?
>
>It won't based on this patch. See movable_pfn_range
I did grep in source code, but not find movable_pfn_range.
Could you share some light on that?
>
--
Wei Yang
Help you, Help me
On 06/14/2017 11:06 AM, Wei Yang wrote:
> On Mon, Jun 12, 2017 at 08:45:02AM +0200, Michal Hocko wrote:
>> On Mon 12-06-17 12:28:32, Wei Yang wrote:
>>> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>>>> From: Michal Hocko <[email protected]>
>>>>
>>>> movable_node kernel parameter allows to make hotplugable NUMA
>>>> nodes to put all the hotplugable memory into movable zone which
>>>> allows more or less reliable memory hotremove. At least this
>>>> is the case for the NUMA nodes present during the boot (see
>>>> find_zone_movable_pfns_for_nodes).
>>>>
>>>
>>> When movable_node is enabled, we would have overlapped zones, right?
>>
>> It won't based on this patch. See movable_pfn_range
>
> I did grep in source code, but not find movable_pfn_range.
This patch is adding it.
> Could you share some light on that?
>
>>
On Wed, Jun 14, 2017 at 11:07:31AM +0200, Vlastimil Babka wrote:
>On 06/14/2017 11:06 AM, Wei Yang wrote:
>> On Mon, Jun 12, 2017 at 08:45:02AM +0200, Michal Hocko wrote:
>>> On Mon 12-06-17 12:28:32, Wei Yang wrote:
>>>> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>>>>> From: Michal Hocko <[email protected]>
>>>>>
>>>>> movable_node kernel parameter allows to make hotplugable NUMA
>>>>> nodes to put all the hotplugable memory into movable zone which
>>>>> allows more or less reliable memory hotremove. At least this
>>>>> is the case for the NUMA nodes present during the boot (see
>>>>> find_zone_movable_pfns_for_nodes).
>>>>>
>>>>
>>>> When movable_node is enabled, we would have overlapped zones, right?
>>>
>>> It won't based on this patch. See movable_pfn_range
>>
>> I did grep in source code, but not find movable_pfn_range.
>
>This patch is adding it.
>
Oops, what a shame.
>> Could you share some light on that?
>>
>>>
--
Wei Yang
Help you, Help me
On Mon, Jun 12, 2017 at 08:45:02AM +0200, Michal Hocko wrote:
>On Mon 12-06-17 12:28:32, Wei Yang wrote:
>> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>> >From: Michal Hocko <[email protected]>
>> >
>> >movable_node kernel parameter allows to make hotplugable NUMA
>> >nodes to put all the hotplugable memory into movable zone which
>> >allows more or less reliable memory hotremove. At least this
>> >is the case for the NUMA nodes present during the boot (see
>> >find_zone_movable_pfns_for_nodes).
>> >
>>
>> When movable_node is enabled, we would have overlapped zones, right?
>
>It won't based on this patch. See movable_pfn_range
>
Ok, I went through the code and here maybe a question not that close related
to this patch.
I did some experiment with qemu+kvm and see this.
Guest config: 8G RAM, 2 nodes with 4G on each
Guest kernel: 4.11
Guest kernel command: kernelcore=1G
The log message in kernel is:
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x000000023fffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Node 0: 0x0000000100000000
[ 0.000000] Node 1: 0x0000000140000000
We see on node 2, ZONE_NORMAL overlap with ZONE_MOVABLE.
[0x0000000140000000 - 0x000000023fffffff] belongs to both ZONE.
My confusion is:
After we enable ZONE_MOVABLE, no matter whether it is enabled by
"movable_node" or "kernelcore", we would face this kind of overlap?
Finally, the pages in the overlapped range be belongs to which ZONE?
--
Wei Yang
Help you, Help me
On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>From: Michal Hocko <[email protected]>
>
>movable_node kernel parameter allows to make hotplugable NUMA
>nodes to put all the hotplugable memory into movable zone which
>allows more or less reliable memory hotremove. At least this
>is the case for the NUMA nodes present during the boot (see
>find_zone_movable_pfns_for_nodes).
>
>This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
>will default to a kernel zone (usually ZONE_NORMAL) unless the
>particular memblock is already in the movable zone range which is not
>the case normally when onlining the memory from the udev rule context
>for a freshly hotadded NUMA node. The only option currently is to have a
>special udev rule to echo online_movable to all memblocks belonging to
>such a node which is rather clumsy. Not the mention this is inconsistent
>as well because what ended up in the movable zone during the boot will
>end up in a kernel zone after hotremove & hotadd without special care.
>
>It would be nice to reuse memblock_is_hotpluggable but the runtime
>hotplug doesn't have that information available because the boot and
>hotplug paths are not shared and it would be really non trivial to
>make them use the same code path because the runtime hotplug doesn't
>play with the memblock allocator at all.
>
>Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
>movable_node is enabled and the range doesn't overlap with the existing
>normal zone. This should provide a reasonable default onlining strategy.
>
>Strictly speaking the semantic is not identical with the boot time
>initialization because find_zone_movable_pfns_for_nodes covers only the
>hotplugable range as described by the BIOS/FW. From my experience this
>is usually a full node though (except for Node0 which is special and
>never goes away completely). If this turns out to be a problem in the
>real life we can tweak the code to store hotplug flag into memblocks
>but let's keep this simple now.
>
>Signed-off-by: Michal Hocko <[email protected]>
>---
>
>Hi Andrew,
>I've posted this as an RFC previously [1] and there haven't been any
>objections to the approach so I've dropped the RFC and sending it for
>inclusion. The only change since the last time is the update of the
>documentation to clarify the semantic as suggested by Reza Arbab.
>
>[1] http://lkml.kernel.org/r/[email protected]
>
> Documentation/memory-hotplug.txt | 12 +++++++++---
> mm/memory_hotplug.c | 19 ++++++++++++++++---
> 2 files changed, 25 insertions(+), 6 deletions(-)
>
>diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
>index 670f3ded0802..5c628e19d6cd 100644
>--- a/Documentation/memory-hotplug.txt
>+++ b/Documentation/memory-hotplug.txt
>@@ -282,20 +282,26 @@ offlined it is possible to change the individual block's state by writing to the
> % echo online > /sys/devices/system/memory/memoryXXX/state
>
> This onlining will not change the ZONE type of the target memory block,
>-If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
>+If the memory block doesn't belong to any zone an appropriate kernel zone
>+(usually ZONE_NORMAL) will be used unless movable_node kernel command line
>+option is specified when ZONE_MOVABLE will be used.
>+
>+You can explicitly request to associate it with ZONE_MOVABLE by
>
> % echo online_movable > /sys/devices/system/memory/memoryXXX/state
> (NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE)
>
>-And if the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
>+Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:
>
> % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
> (NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL)
>
>+An explicit zone onlining can fail (e.g. when the range is already within
>+and existing and incompatible zone already).
>+
> After this, memory block XXX's state will be 'online' and the amount of
> available memory will be increased.
>
>-Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA).
> This may be changed in future.
>
>
>diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>index b98fb0b3ae11..74d75583736c 100644
>--- a/mm/memory_hotplug.c
>+++ b/mm/memory_hotplug.c
>@@ -943,6 +943,19 @@ struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
> return &pgdat->node_zones[ZONE_NORMAL];
> }
>
>+static inline bool movable_pfn_range(int nid, struct zone *default_zone,
>+ unsigned long start_pfn, unsigned long nr_pages)
>+{
>+ if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
>+ MMOP_ONLINE_KERNEL))
>+ return true;
>+
>+ if (!movable_node_is_enabled())
>+ return false;
>+
>+ return !zone_intersects(default_zone, start_pfn, nr_pages);
>+}
>+
To be honest, I don't understand this clearly.
move_pfn_range() will choose and move the range to a zone based on the
online_type, where we have two cases:
1. ONLINE_MOVABLE -> ZONE_MOVABLE will be chosen
2. ONLINE_KEEP -> ZONE_NORMAL is the default while ZONE_MOVABLE will be
chosen in case movable_pfn_range() returns true.
There are three conditions in movable_pfn_range():
1. Not allowed in kernel_zone, returns true
2. Movable_node not enabled, return false
3. Range [start_pfn, start_pfn + nr_pages) doesn't intersect with
default_zone, return true
The first one is inherited from original code, so lets look at the other two.
Number 3 is easy to understand, if the hot-added range is already part of
ZONE_NORMAL, use it.
Number 2 makes me confused. If movable_node is not enabled, ZONE_NORMAL will
be chosen. If movable_node is enabled, it still depends on other two
condition. So how a memory_block is onlined to ZONE_MOVABLE because
movable_node is enabled? What I see is you would forbid a memory_block to be
onlined to ZONE_MOVABLE when movable_node is not enabled. Instead of you would
online a memory_block to ZONE_MOVABLE when movable_node is enabled, which is
implied in your change log.
BTW, would you mind giving me these two information?
1. Which branch your code is based on? I have cloned your
git(//git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git), while still see
some difference.
2. Any example or test case I could try your patch and see the difference? It
would be better if it could run in qemu+kvm.
> /*
> * Associates the given pfn range with the given node and the zone appropriate
> * for the given online type.
>@@ -958,10 +971,10 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
> /*
> * MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use
> * movable zone if that is not possible (e.g. we are within
>- * or past the existing movable zone)
>+ * or past the existing movable zone). movable_node overrides
>+ * this default and defaults to movable zone
> */
>- if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
>- MMOP_ONLINE_KERNEL))
>+ if (movable_pfn_range(nid, zone, start_pfn, nr_pages))
> zone = movable_zone;
> } else if (online_type == MMOP_ONLINE_MOVABLE) {
> zone = &pgdat->node_zones[ZONE_MOVABLE];
>--
>2.11.0
--
Wei Yang
Help you, Help me
On Thu 15-06-17 11:13:54, Wei Yang wrote:
> On Mon, Jun 12, 2017 at 08:45:02AM +0200, Michal Hocko wrote:
> >On Mon 12-06-17 12:28:32, Wei Yang wrote:
> >> On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
> >> >From: Michal Hocko <[email protected]>
> >> >
> >> >movable_node kernel parameter allows to make hotplugable NUMA
> >> >nodes to put all the hotplugable memory into movable zone which
> >> >allows more or less reliable memory hotremove. At least this
> >> >is the case for the NUMA nodes present during the boot (see
> >> >find_zone_movable_pfns_for_nodes).
> >> >
> >>
> >> When movable_node is enabled, we would have overlapped zones, right?
> >
> >It won't based on this patch. See movable_pfn_range
> >
>
> Ok, I went through the code and here maybe a question not that close related
> to this patch.
Please start a new thread with unrelated questions
> I did some experiment with qemu+kvm and see this.
>
> Guest config: 8G RAM, 2 nodes with 4G on each
> Guest kernel: 4.11
> Guest kernel command: kernelcore=1G
>
> The log message in kernel is:
>
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.000000] Normal [mem 0x0000000100000000-0x000000023fffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Node 0: 0x0000000100000000
> [ 0.000000] Node 1: 0x0000000140000000
>
> We see on node 2, ZONE_NORMAL overlap with ZONE_MOVABLE.
> [0x0000000140000000 - 0x000000023fffffff] belongs to both ZONE.
Not really. The above output is just confusing a bit. Zone ranges print
arch_zone_{lowest,highest}_possible_pfn range while the Movable zone
is excluded from that in adjust_zone_range_for_zone_movable
--
Michal Hocko
SUSE Labs
On Thu 15-06-17 11:29:27, Wei Yang wrote:
[...]
> >+static inline bool movable_pfn_range(int nid, struct zone *default_zone,
> >+ unsigned long start_pfn, unsigned long nr_pages)
> >+{
> >+ if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
> >+ MMOP_ONLINE_KERNEL))
> >+ return true;
> >+
> >+ if (!movable_node_is_enabled())
> >+ return false;
> >+
> >+ return !zone_intersects(default_zone, start_pfn, nr_pages);
> >+}
> >+
>
> To be honest, I don't understand this clearly.
>
> move_pfn_range() will choose and move the range to a zone based on the
> online_type, where we have two cases:
> 1. ONLINE_MOVABLE -> ZONE_MOVABLE will be chosen
> 2. ONLINE_KEEP -> ZONE_NORMAL is the default while ZONE_MOVABLE will be
> chosen in case movable_pfn_range() returns true.
>
> There are three conditions in movable_pfn_range():
> 1. Not allowed in kernel_zone, returns true
> 2. Movable_node not enabled, return false
> 3. Range [start_pfn, start_pfn + nr_pages) doesn't intersect with
> default_zone, return true
>
> The first one is inherited from original code, so lets look at the other two.
>
> Number 3 is easy to understand, if the hot-added range is already part of
> ZONE_NORMAL, use it.
>
> Number 2 makes me confused. If movable_node is not enabled, ZONE_NORMAL will
> be chosen. If movable_node is enabled, it still depends on other two
> condition. So how a memory_block is onlined to ZONE_MOVABLE because
> movable_node is enabled?
This is simple. If the movable_node is set then ONLINE_KEEP defaults to
the movable zone unless the range is already covered by a kernel zone
(read Normal zone most of the time).
> What I see is you would forbid a memory_block to be
> onlined to ZONE_MOVABLE when movable_node is not enabled.
Please note that this is ONLINE_KEEP not ONLINE_MOVABLE and as such the
movable zone is used only if we are withing the movable zone range
already (test 1).
> Instead of you would
> online a memory_block to ZONE_MOVABLE when movable_node is enabled, which is
> implied in your change log.
>
> BTW, would you mind giving me these two information?
> 1. Which branch your code is based on? I have cloned your
> git(//git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git), while still see
> some difference.
yes this is based on the mmotm tree (use since-4.11 or auto-latest
branch)
> 2. Any example or test case I could try your patch and see the difference? It
> would be better if it could run in qemu+kvm.
See http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs
On Thu, Jun 08, 2017 at 02:23:18PM +0200, Michal Hocko wrote:
>movable_node kernel parameter allows to make hotplugable NUMA
>nodes to put all the hotplugable memory into movable zone which
>allows more or less reliable memory hotremove. At least this
>is the case for the NUMA nodes present during the boot (see
>find_zone_movable_pfns_for_nodes).
>
>This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
>will default to a kernel zone (usually ZONE_NORMAL) unless the
>particular memblock is already in the movable zone range which is not
>the case normally when onlining the memory from the udev rule context
>for a freshly hotadded NUMA node. The only option currently is to have a
>special udev rule to echo online_movable to all memblocks belonging to
>such a node which is rather clumsy. Not the mention this is inconsistent
>as well because what ended up in the movable zone during the boot will
>end up in a kernel zone after hotremove & hotadd without special care.
>
>It would be nice to reuse memblock_is_hotpluggable but the runtime
>hotplug doesn't have that information available because the boot and
>hotplug paths are not shared and it would be really non trivial to
>make them use the same code path because the runtime hotplug doesn't
>play with the memblock allocator at all.
>
>Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
>movable_node is enabled and the range doesn't overlap with the existing
>normal zone. This should provide a reasonable default onlining strategy.
>
>Strictly speaking the semantic is not identical with the boot time
>initialization because find_zone_movable_pfns_for_nodes covers only the
>hotplugable range as described by the BIOS/FW. From my experience this
>is usually a full node though (except for Node0 which is special and
>never goes away completely). If this turns out to be a problem in the
>real life we can tweak the code to store hotplug flag into memblocks
>but let's keep this simple now.
>
>Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Reza Arbab <[email protected]>
--
Reza Arbab