2017-06-01 12:20:33

by Michal Hocko

[permalink] [raw]
Subject: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

From: Michal Hocko <[email protected]>

movable_node kernel parameter allows to make hotplugable NUMA
nodes to put all the hotplugable memory into movable zone which
allows more or less reliable memory hotremove. At least this
is the case for the NUMA nodes present during the boot (see
find_zone_movable_pfns_for_nodes).

This is not the case for the memory hotplug, though.

echo online > /sys/devices/system/memory/memoryXYZ/status

will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node. The only option currently is to have a
special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy. Not the mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.

It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to
make them use the same code path because the runtime hotplug doesn't
play with the memblock allocator at all.

Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone. This should provide a reasonable default onlining strategy.

Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW. From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely). If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks
but let's keep this simple now.

Signed-off-by: Michal Hocko <[email protected]>
---

Hi,
I am sending this as an RFC because this is a user visible change change
of behavior, strictly speaking. I believe it is a desirable change of
behavior, thought, and it an explicit opt-in (kernel parameter) is
required to see the change so I do not expect any breakage. I would
still like to hear what other people think about this shift. I have
tested it on a memory hotplug capable HW where the whole numa node can
be hotremove/added.

Does anybody see any problem with the proposed semantic?

mm/memory_hotplug.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b98fb0b3ae11..74d75583736c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -943,6 +943,19 @@ struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
return &pgdat->node_zones[ZONE_NORMAL];
}

+static inline bool movable_pfn_range(int nid, struct zone *default_zone,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
+ MMOP_ONLINE_KERNEL))
+ return true;
+
+ if (!movable_node_is_enabled())
+ return false;
+
+ return !zone_intersects(default_zone, start_pfn, nr_pages);
+}
+
/*
* Associates the given pfn range with the given node and the zone appropriate
* for the given online type.
@@ -958,10 +971,10 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
/*
* MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use
* movable zone if that is not possible (e.g. we are within
- * or past the existing movable zone)
+ * or past the existing movable zone). movable_node overrides
+ * this default and defaults to movable zone
*/
- if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
- MMOP_ONLINE_KERNEL))
+ if (movable_pfn_range(nid, zone, start_pfn, nr_pages))
zone = movable_zone;
} else if (online_type == MMOP_ONLINE_MOVABLE) {
zone = &pgdat->node_zones[ZONE_MOVABLE];
--
2.11.0


2017-06-01 14:12:08

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On 06/01/2017 02:20 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> movable_node kernel parameter allows to make hotplugable NUMA
> nodes to put all the hotplugable memory into movable zone which
> allows more or less reliable memory hotremove. At least this
> is the case for the NUMA nodes present during the boot (see
> find_zone_movable_pfns_for_nodes).
>
> This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
> will default to a kernel zone (usually ZONE_NORMAL) unless the
> particular memblock is already in the movable zone range which is not
> the case normally when onlining the memory from the udev rule context
> for a freshly hotadded NUMA node. The only option currently is to have a
> special udev rule to echo online_movable to all memblocks belonging to
> such a node which is rather clumsy. Not the mention this is inconsistent
> as well because what ended up in the movable zone during the boot will
> end up in a kernel zone after hotremove & hotadd without special care.

Yeah, it would be better if movable_node worked consistently for both
boot and runtime hotplug.

> It would be nice to reuse memblock_is_hotpluggable but the runtime
> hotplug doesn't have that information available because the boot and
> hotplug paths are not shared and it would be really non trivial to
> make them use the same code path because the runtime hotplug doesn't
> play with the memblock allocator at all.
>
> Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
> movable_node is enabled and the range doesn't overlap with the existing
> normal zone. This should provide a reasonable default onlining strategy.
>
> Strictly speaking the semantic is not identical with the boot time
> initialization because find_zone_movable_pfns_for_nodes covers only the
> hotplugable range as described by the BIOS/FW. From my experience this
> is usually a full node though (except for Node0 which is special and
> never goes away completely). If this turns out to be a problem in the
> real life we can tweak the code to store hotplug flag into memblocks
> but let's keep this simple now.

Simple should work, hopefully.
- if memory is hotplugged, it's obviously hotplugable, so we don't have
to rely on BIOS description.
- there shouldn't be a reason to offline a non-removable (part of) node
and online it back (which would move it from Normal to Movable after
your patch?), right?

Vlastimil

2017-06-01 14:22:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu 01-06-17 16:11:55, Vlastimil Babka wrote:
> On 06/01/2017 02:20 PM, Michal Hocko wrote:
[...]
> > Strictly speaking the semantic is not identical with the boot time
> > initialization because find_zone_movable_pfns_for_nodes covers only the
> > hotplugable range as described by the BIOS/FW. From my experience this
> > is usually a full node though (except for Node0 which is special and
> > never goes away completely). If this turns out to be a problem in the
> > real life we can tweak the code to store hotplug flag into memblocks
> > but let's keep this simple now.
>
> Simple should work, hopefully.
> - if memory is hotplugged, it's obviously hotplugable, so we don't have
> to rely on BIOS description.

Not sure I understand. We do not have any information about the hotplug
status at the time we do online.

> - there shouldn't be a reason to offline a non-removable (part of) node
> and online it back (which would move it from Normal to Movable after
> your patch?), right?

not really. If the memblock was inside a kernel zone it will stay there
with a new online operation because we check for that explicitly.
--
Michal Hocko
SUSE Labs

2017-06-01 15:19:48

by Reza Arbab

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu, Jun 01, 2017 at 04:22:28PM +0200, Michal Hocko wrote:
>On Thu 01-06-17 16:11:55, Vlastimil Babka wrote:
>> Simple should work, hopefully.
>> - if memory is hotplugged, it's obviously hotplugable, so we don't have
>> to rely on BIOS description.
>
>Not sure I understand. We do not have any information about the hotplug
>status at the time we do online.

The x86 SRAT (or the dt, on other platforms) can describe memory as
hotpluggable. See memblock_mark_hotplug(). That's only for memory
present at boot, though.

He's saying that since the memory was added after boot, it is by
definition hotpluggable. There's no need to check for that
marking/description.

--
Reza Arbab

2017-06-01 15:38:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu 01-06-17 10:19:36, Reza Arbab wrote:
> On Thu, Jun 01, 2017 at 04:22:28PM +0200, Michal Hocko wrote:
> >On Thu 01-06-17 16:11:55, Vlastimil Babka wrote:
> >>Simple should work, hopefully.
> >>- if memory is hotplugged, it's obviously hotplugable, so we don't have
> >>to rely on BIOS description.
> >
> >Not sure I understand. We do not have any information about the hotplug
> >status at the time we do online.
>
> The x86 SRAT (or the dt, on other platforms) can describe memory as
> hotpluggable. See memblock_mark_hotplug(). That's only for memory present at
> boot, though.

Yes but lose that information after the memblock is gone and numa fully
initialized. Or can we reconstruct that somehow?

> He's saying that since the memory was added after boot, it is by definition
> hotpluggable. There's no need to check for that marking/description.

Yes, but we do not know whether we are onlining memblocks from a boot
time numa node or a fresh one which has been hotadded.
--
Michal Hocko
SUSE Labs

2017-06-01 15:48:03

by Reza Arbab

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu, Jun 01, 2017 at 05:38:38PM +0200, Michal Hocko wrote:
>On Thu 01-06-17 10:19:36, Reza Arbab wrote:
>> The x86 SRAT (or the dt, on other platforms) can describe memory as
>> hotpluggable. See memblock_mark_hotplug(). That's only for memory present at
>> boot, though.
>
>Yes but lose that information after the memblock is gone and numa fully
>initialized. Or can we reconstruct that somehow?

I'm not sure you'd have to. At boot time, those markings are used to
determine the initial boundaries of ZONE_MOVABLE. So if you removed
these memblocks, then readded them, they would still be in ZONE_MOVABLE.

>> He's saying that since the memory was added after boot, it is by
>> definition hotpluggable. There's no need to check for that
>> marking/description.
>
>Yes, but we do not know whether we are onlining memblocks from a boot
>time numa node or a fresh one which has been hotadded.

That's true.

--
Reza Arbab

2017-06-01 15:52:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu 01-06-17 10:47:46, Reza Arbab wrote:
> On Thu, Jun 01, 2017 at 05:38:38PM +0200, Michal Hocko wrote:
> >On Thu 01-06-17 10:19:36, Reza Arbab wrote:
> >>The x86 SRAT (or the dt, on other platforms) can describe memory as
> >>hotpluggable. See memblock_mark_hotplug(). That's only for memory present at
> >>boot, though.
> >
> >Yes but lose that information after the memblock is gone and numa fully
> >initialized. Or can we reconstruct that somehow?
>
> I'm not sure you'd have to. At boot time, those markings are used to
> determine the initial boundaries of ZONE_MOVABLE. So if you removed these
> memblocks, then readded them, they would still be in ZONE_MOVABLE.

Yes but that already works like that. I am nore interested in the case
when the node goes away and it is added again. echo online > ... would
result in a non-movable memory and that is the inconsistency I tried to
call out in the changelog

--
Michal Hocko
SUSE Labs

2017-06-01 16:02:39

by Reza Arbab

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu, Jun 01, 2017 at 02:20:04PM +0200, Michal Hocko wrote:
>Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
>movable_node is enabled and the range doesn't overlap with the existing
>normal zone. This should provide a reasonable default onlining strategy.

I like it. If your distro has some auto-onlining udev rule like

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

You could get things onlined as movable just by putting movable_node in
the boot params, without changing/modifying the rule.

--
Reza Arbab

2017-06-01 16:04:56

by Reza Arbab

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu, Jun 01, 2017 at 05:52:31PM +0200, Michal Hocko wrote:
>On Thu 01-06-17 10:47:46, Reza Arbab wrote:
>> On Thu, Jun 01, 2017 at 05:38:38PM +0200, Michal Hocko wrote:
>> >On Thu 01-06-17 10:19:36, Reza Arbab wrote:
>> >>The x86 SRAT (or the dt, on other platforms) can describe memory as
>> >>hotpluggable. See memblock_mark_hotplug(). That's only for memory present at
>> >>boot, though.
>> >
>> >Yes but lose that information after the memblock is gone and numa fully
>> >initialized. Or can we reconstruct that somehow?
>>
>> I'm not sure you'd have to. At boot time, those markings are used to
>> determine the initial boundaries of ZONE_MOVABLE. So if you removed these
>> memblocks, then readded them, they would still be in ZONE_MOVABLE.
>
>Yes but that already works like that. I am nore interested in the case
>when the node goes away and it is added again. echo online > ... would
>result in a non-movable memory and that is the inconsistency I tried to
>call out in the changelog

My bad. Should have read closer.

--
Reza Arbab

2017-06-01 16:15:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu 01-06-17 11:02:28, Reza Arbab wrote:
> On Thu, Jun 01, 2017 at 02:20:04PM +0200, Michal Hocko wrote:
> >Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
> >movable_node is enabled and the range doesn't overlap with the existing
> >normal zone. This should provide a reasonable default onlining strategy.
>
> I like it. If your distro has some auto-onlining udev rule like
>
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
>
> You could get things onlined as movable just by putting movable_node in
> the boot params, without changing/modifying the rule.

yes this is the primary point of the patch ;)
--
Michal Hocko
SUSE Labs

2017-06-01 16:49:53

by Reza Arbab

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

On Thu, Jun 01, 2017 at 06:14:54PM +0200, Michal Hocko wrote:
>On Thu 01-06-17 11:02:28, Reza Arbab wrote:
>> On Thu, Jun 01, 2017 at 02:20:04PM +0200, Michal Hocko wrote:
>> >Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
>> >movable_node is enabled and the range doesn't overlap with the existing
>> >normal zone. This should provide a reasonable default onlining strategy.
>>
>> I like it. If your distro has some auto-onlining udev rule like
>>
>> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
>>
>> You could get things onlined as movable just by putting movable_node in
>> the boot params, without changing/modifying the rule.
>
>yes this is the primary point of the patch ;)

Ha. What can I say, I like restating the obvious!

At some point after all these cleanups/improvements, it would be worth
making sure Documentation/memory-hotplug.txt is still accurate.

--
Reza Arbab

2017-06-05 08:13:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

Are there any further comments? Can I post this for merging? I will
update the documentation as well.

On Thu 01-06-17 14:20:04, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> movable_node kernel parameter allows to make hotplugable NUMA
> nodes to put all the hotplugable memory into movable zone which
> allows more or less reliable memory hotremove. At least this
> is the case for the NUMA nodes present during the boot (see
> find_zone_movable_pfns_for_nodes).
>
> This is not the case for the memory hotplug, though.
>
> echo online > /sys/devices/system/memory/memoryXYZ/status
>
> will default to a kernel zone (usually ZONE_NORMAL) unless the
> particular memblock is already in the movable zone range which is not
> the case normally when onlining the memory from the udev rule context
> for a freshly hotadded NUMA node. The only option currently is to have a
> special udev rule to echo online_movable to all memblocks belonging to
> such a node which is rather clumsy. Not the mention this is inconsistent
> as well because what ended up in the movable zone during the boot will
> end up in a kernel zone after hotremove & hotadd without special care.
>
> It would be nice to reuse memblock_is_hotpluggable but the runtime
> hotplug doesn't have that information available because the boot and
> hotplug paths are not shared and it would be really non trivial to
> make them use the same code path because the runtime hotplug doesn't
> play with the memblock allocator at all.
>
> Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
> movable_node is enabled and the range doesn't overlap with the existing
> normal zone. This should provide a reasonable default onlining strategy.
>
> Strictly speaking the semantic is not identical with the boot time
> initialization because find_zone_movable_pfns_for_nodes covers only the
> hotplugable range as described by the BIOS/FW. From my experience this
> is usually a full node though (except for Node0 which is special and
> never goes away completely). If this turns out to be a problem in the
> real life we can tweak the code to store hotplug flag into memblocks
> but let's keep this simple now.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi,
> I am sending this as an RFC because this is a user visible change change
> of behavior, strictly speaking. I believe it is a desirable change of
> behavior, thought, and it an explicit opt-in (kernel parameter) is
> required to see the change so I do not expect any breakage. I would
> still like to hear what other people think about this shift. I have
> tested it on a memory hotplug capable HW where the whole numa node can
> be hotremove/added.
>
> Does anybody see any problem with the proposed semantic?
>
> mm/memory_hotplug.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b98fb0b3ae11..74d75583736c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -943,6 +943,19 @@ struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
> return &pgdat->node_zones[ZONE_NORMAL];
> }
>
> +static inline bool movable_pfn_range(int nid, struct zone *default_zone,
> + unsigned long start_pfn, unsigned long nr_pages)
> +{
> + if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
> + MMOP_ONLINE_KERNEL))
> + return true;
> +
> + if (!movable_node_is_enabled())
> + return false;
> +
> + return !zone_intersects(default_zone, start_pfn, nr_pages);
> +}
> +
> /*
> * Associates the given pfn range with the given node and the zone appropriate
> * for the given online type.
> @@ -958,10 +971,10 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
> /*
> * MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use
> * movable zone if that is not possible (e.g. we are within
> - * or past the existing movable zone)
> + * or past the existing movable zone). movable_node overrides
> + * this default and defaults to movable zone
> */
> - if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
> - MMOP_ONLINE_KERNEL))
> + if (movable_pfn_range(nid, zone, start_pfn, nr_pages))
> zone = movable_zone;
> } else if (online_type == MMOP_ONLINE_MOVABLE) {
> zone = &pgdat->node_zones[ZONE_MOVABLE];
> --
> 2.11.0
>

--
Michal Hocko
SUSE Labs