2017-03-02 13:54:35

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Mon 27-02-17 16:43:04, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.
> >
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> >
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> >
> > Auto-onlining of standby memory is the last thing we want.
I don't know much about anything other than x86 so all comments
below are from that point of view,
archetectures that don't need auto online can keep current default

> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.
> >
> > This also is a reason why auto-onlining doesn't seem be the best way.

When trying to support memory unplug on guest side in RHEL7,
experience shows otherwise. Simplistic udev rule which onlines
added block doesn't work in case one wants to online it as movable.

Hotplugged blocks in current kernel should be onlined in reverse
order to online blocks as movable depending on adjacent blocks zone.
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
- watch for all blocks in hotplugged memory appear (how would it know)
- online them in right order (order might also be different depending
on kernel version)
-- it becomes even more complicated in NUMA case when there are
multiple zones and kernel would have to provide user-space
with information about zone maps

In short current experience shows that userspace approach
- doesn't solve issues that Vitaly has been fixing (i.e. onlining
fast and/or under memory pressure) when udev (or something else
might be killed)
- doesn't reduce overall system complexity, it only gets worse
as user-space handler needs to know a lot about kernel internals
and implementation details/kernel versions to work properly

It's might be not easy but doing onlining in kernel on the other hand is:
- faster
- more reliable (can't be killed under memory pressure)
- kernel has access to all info needed for onlining and how it
internals work to implement auto-online correctly
- since there is no need to mantain ABI for user-space
(zones layout/ordering/maybe something else), kernel is
free change internal implemetation without breaking userspace
(currently hotplug+unplug doesn't work reliably and we might
need something more flexible than zones)
That's direction of research in progress, i.e. making kernel
implementation better instead of putting responsibility on
user-space to deal with kernel shortcomings.

> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.
For x86:
* this config option is enabled by default in recent Fedora,
* RHEL6 ships similar downstream patches to do the same thing for years
* RHEL7 has udev rule (because there wasn't kernel side solution at fork time)
that auto-onlines it unconditionally, Vitaly might backport it later
when he has time.
Not linux kernel but still auto online policy is used by Windows
both on baremetal and guest configurations.

That's somewhat shows that current defaults upstream on x86
might be not what end-users wish for.

When auto_online_blocks were introduced, Vitaly has been
conservative and left current upstream defaults where they were
lest it would break someone else setup but allowing downstreams
set their own auto-online policy, eventually we might switch it
upstream too.


2017-03-02 15:07:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]
> When trying to support memory unplug on guest side in RHEL7,
> experience shows otherwise. Simplistic udev rule which onlines
> added block doesn't work in case one wants to online it as movable.
>
> Hotplugged blocks in current kernel should be onlined in reverse
> order to online blocks as movable depending on adjacent blocks zone.

Could you be more specific please? Setting online_movable from the udev
rule should just work regardless of the ordering or the state of other
memblocks. If that doesn't work I would call it a bug.

> Which means simple udev rule isn't usable since it gets event from
> the first to the last hotplugged block order. So now we would have
> to write a daemon that would
> - watch for all blocks in hotplugged memory appear (how would it know)
> - online them in right order (order might also be different depending
> on kernel version)
> -- it becomes even more complicated in NUMA case when there are
> multiple zones and kernel would have to provide user-space
> with information about zone maps
>
> In short current experience shows that userspace approach
> - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> fast and/or under memory pressure) when udev (or something else
> might be killed)

yeah and that is why the patch does the onlining from the kernel.

> > Can you imagine any situation when somebody actually might want to have
> > this knob enabled? From what I understand it doesn't seem to be the
> > case.
> For x86:
> * this config option is enabled by default in recent Fedora,

How do you want to support usecases which really want to online memory
as movable? Do you expect those users to disable the option because
unless I am missing something the in kernel auto onlining only supporst
regular onlining.
--
Michal Hocko
SUSE Labs

2017-03-02 17:17:18

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Thu, 2 Mar 2017 15:28:16 +0100
Michal Hocko <[email protected]> wrote:

> On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> [...]
> > When trying to support memory unplug on guest side in RHEL7,
> > experience shows otherwise. Simplistic udev rule which onlines
> > added block doesn't work in case one wants to online it as movable.
> >
> > Hotplugged blocks in current kernel should be onlined in reverse
> > order to online blocks as movable depending on adjacent blocks zone.
>
> Could you be more specific please? Setting online_movable from the udev
> rule should just work regardless of the ordering or the state of other
> memblocks. If that doesn't work I would call it a bug.
It's rather an implementation constrain than a bug
for details and workaround patch see
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
patch attached there is limited by another memory hotplug
issue, which is NORMAL/MOVABLE zone balance, if kernel runs
on configuration where the most of memory is hot-removable
kernel might experience lack of memory in zone NORMAL.

>
> > Which means simple udev rule isn't usable since it gets event from
> > the first to the last hotplugged block order. So now we would have
> > to write a daemon that would
> > - watch for all blocks in hotplugged memory appear (how would it know)
> > - online them in right order (order might also be different depending
> > on kernel version)
> > -- it becomes even more complicated in NUMA case when there are
> > multiple zones and kernel would have to provide user-space
> > with information about zone maps
> >
> > In short current experience shows that userspace approach
> > - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > fast and/or under memory pressure) when udev (or something else
> > might be killed)
>
> yeah and that is why the patch does the onlining from the kernel.
onlining in this patch is limited to hyperv and patch breaks
auto-online on x86 kvm/vmware/baremetal as they reuse the same
hotplug path.

> > > Can you imagine any situation when somebody actually might want to have
> > > this knob enabled? From what I understand it doesn't seem to be the
> > > case.
> > For x86:
> > * this config option is enabled by default in recent Fedora,
>
> How do you want to support usecases which really want to online memory
> as movable? Do you expect those users to disable the option because
> unless I am missing something the in kernel auto onlining only supporst
> regular onlining.
current auto onlining config option does what it's been designed for,
i.e. it onlines hotplugged memory.
It's possible for non average Fedora user to override default
(commit 86dd995d6) if she/he needs non default behavior
(i.e. user knows how to online manually and/or can write
a daemon that would handle all of nuances of kernel in use).

For the rest when Fedora is used in cloud and user increases memory
via management interface of whatever cloud she/he uses, it just works.

So it's choice of distribution to pick its own default that makes
majority of user-base happy and this patch removes it without taking
that in consideration.

How to online memory is different issue not related to this patch,
current default onlining as ZONE_NORMAL works well for scaling
up VMs.

Memory unplug is rather new and it doesn't work reliably so far,
moving onlining to user-space won't really help. Further work
is need to be done so that it would work reliably.

Now about the question of onlining removable memory as movable,
x86 kernel is able to get info, if hotadded memory is removable,
from ACPI subsystem and online it as movable one without any
intervention from user-space where it's hard to do so,
as patch[1] shows.
Problem is still researched and when we figure out how to fix
hot-remove issues we might enable auto-onlining by default for x86.

2017-03-03 09:04:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> On Thu, 2 Mar 2017 15:28:16 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > [...]
> > > When trying to support memory unplug on guest side in RHEL7,
> > > experience shows otherwise. Simplistic udev rule which onlines
> > > added block doesn't work in case one wants to online it as movable.
> > >
> > > Hotplugged blocks in current kernel should be onlined in reverse
> > > order to online blocks as movable depending on adjacent blocks zone.
> >
> > Could you be more specific please? Setting online_movable from the udev
> > rule should just work regardless of the ordering or the state of other
> > memblocks. If that doesn't work I would call it a bug.
> It's rather an implementation constrain than a bug
> for details and workaround patch see
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

"You are not authorized to access bug #1314306"

could you paste the reasoning here please?

> patch attached there is limited by another memory hotplug
> issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> on configuration where the most of memory is hot-removable
> kernel might experience lack of memory in zone NORMAL.

yes and that is an inherent problem of movable memory.

> > > Which means simple udev rule isn't usable since it gets event from
> > > the first to the last hotplugged block order. So now we would have
> > > to write a daemon that would
> > > - watch for all blocks in hotplugged memory appear (how would it know)
> > > - online them in right order (order might also be different depending
> > > on kernel version)
> > > -- it becomes even more complicated in NUMA case when there are
> > > multiple zones and kernel would have to provide user-space
> > > with information about zone maps
> > >
> > > In short current experience shows that userspace approach
> > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > fast and/or under memory pressure) when udev (or something else
> > > might be killed)
> >
> > yeah and that is why the patch does the onlining from the kernel.
> onlining in this patch is limited to hyperv and patch breaks
> auto-online on x86 kvm/vmware/baremetal as they reuse the same
> hotplug path.

Those can use the udev or do you see any reason why they couldn't?

> > > > Can you imagine any situation when somebody actually might want to have
> > > > this knob enabled? From what I understand it doesn't seem to be the
> > > > case.
> > > For x86:
> > > * this config option is enabled by default in recent Fedora,
> >
> > How do you want to support usecases which really want to online memory
> > as movable? Do you expect those users to disable the option because
> > unless I am missing something the in kernel auto onlining only supporst
> > regular onlining.
>
> current auto onlining config option does what it's been designed for,
> i.e. it onlines hotplugged memory.
> It's possible for non average Fedora user to override default
> (commit 86dd995d6) if she/he needs non default behavior
> (i.e. user knows how to online manually and/or can write
> a daemon that would handle all of nuances of kernel in use).
>
> For the rest when Fedora is used in cloud and user increases memory
> via management interface of whatever cloud she/he uses, it just works.
>
> So it's choice of distribution to pick its own default that makes
> majority of user-base happy and this patch removes it without taking
> that in consideration.

You still can have a udev rule to achive the same thing for
non-ballooning based hotplug.

> How to online memory is different issue not related to this patch,
> current default onlining as ZONE_NORMAL works well for scaling
> up VMs.
>
> Memory unplug is rather new and it doesn't work reliably so far,
> moving onlining to user-space won't really help. Further work
> is need to be done so that it would work reliably.

The main problem I have with this is that this is a limited usecase
driven configuration knob which doesn't work properly for other usecases
(namely movable online once your distribution choses to set the config
option to auto online). There is a userspace solution for this so this
shouldn't have been merged in the first place! It sneaked a proper review
process (linux-api wasn't CC to get a broader attenttion) which is
really sad.

So unless this causes a major regression which would be hard to fix I
will submit the patch for inclusion.
--
Michal Hocko
SUSE Labs

2017-03-03 17:34:55

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Fri, 3 Mar 2017 09:27:23 +0100
Michal Hocko <[email protected]> wrote:

> On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > On Thu, 2 Mar 2017 15:28:16 +0100
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > [...]
> > > > When trying to support memory unplug on guest side in RHEL7,
> > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > added block doesn't work in case one wants to online it as movable.
> > > >
> > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > order to online blocks as movable depending on adjacent blocks zone.
> > >
> > > Could you be more specific please? Setting online_movable from the udev
> > > rule should just work regardless of the ordering or the state of other
> > > memblocks. If that doesn't work I would call it a bug.
> > It's rather an implementation constrain than a bug
> > for details and workaround patch see
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
>
> "You are not authorized to access bug #1314306"
Sorry,
I've made it public, related comments and patch should be accessible now
(code snippets in BZ are based on older kernel but logic is still the same upstream)

> could you paste the reasoning here please?
sure here is reproducer:
start VM with CLI:
qemu-system-x86_64 -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
-object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
/path/to/guest_image

then in guest dimm1 blocks are from 32-39

echo online_movable > /sys/devices/system/memory/memory32/state
-bash: echo: write error: Invalid argument

in current mainline kernel it triggers following code path:

online_pages()
...
if (online_type == MMOP_ONLINE_KERNEL) {
if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
return -EINVAL;

zone_can_shift()
...
if (idx < target) {
/* pages must be at end of current zone */
if (pfn + nr_pages != zone_end_pfn(zone))
return false;

since we are trying to online as movable not the last section in
ZONE_NORMAL.

Here is what makes hotplugged memory end up in ZONE_NORMAL:
acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
-> arch/x86/mm/init_64.c

/*
* Memory is added always to NORMAL zone. This means you will never get
* additional DMA/DMA32 memory.
*/
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
...
struct zone *zone = pgdat->node_zones +
zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);

i.e. all hot-plugged memory modules always go to ZONE_NORMAL
and only the first/last block in zone is allowed to be moved
to another zone. Patch [1] tries to fix issue by assigning
removable memory resource to movable zone so hotplugged+removable
blocks look like:
movable normal, movable, movable
instead of current:
normal, normal, normal movable

but then with this fixed as suggested, auto online by default
should work just fine in kernel with normal and movable zones
without any need for user-space.

> > patch attached there is limited by another memory hotplug
> > issue, which is NORMAL/MOVABLE zone balance, if kernel runs
> > on configuration where the most of memory is hot-removable
> > kernel might experience lack of memory in zone NORMAL.
>
> yes and that is an inherent problem of movable memory.
>
> > > > Which means simple udev rule isn't usable since it gets event from
> > > > the first to the last hotplugged block order. So now we would have
> > > > to write a daemon that would
> > > > - watch for all blocks in hotplugged memory appear (how would it know)
> > > > - online them in right order (order might also be different depending
> > > > on kernel version)
> > > > -- it becomes even more complicated in NUMA case when there are
> > > > multiple zones and kernel would have to provide user-space
> > > > with information about zone maps
> > > >
> > > > In short current experience shows that userspace approach
> > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > fast and/or under memory pressure) when udev (or something else
> > > > might be killed)
> > >
> > > yeah and that is why the patch does the onlining from the kernel.
> > onlining in this patch is limited to hyperv and patch breaks
> > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > hotplug path.
>
> Those can use the udev or do you see any reason why they couldn't?
Reasons are above, under >>>> and >> quotations, patch breaks
what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
user-space process could be killed if hotplugged memory isn't onlined
fast enough leading to service termination and/or memory not
being onlined at all (if udev is killed)

Currently udev rule is not usable and one needs a daemon
which would correctly do onlining and keep zone balance
even for simple case usecase of 1 normal and 1 movable zone.
And it gets more complicated in case of multiple numa nodes
with multiple zones.

> > > > > Can you imagine any situation when somebody actually might want to have
> > > > > this knob enabled? From what I understand it doesn't seem to be the
> > > > > case.
> > > > For x86:
> > > > * this config option is enabled by default in recent Fedora,
> > >
> > > How do you want to support usecases which really want to online memory
> > > as movable? Do you expect those users to disable the option because
> > > unless I am missing something the in kernel auto onlining only supporst
> > > regular onlining.
> >
> > current auto onlining config option does what it's been designed for,
> > i.e. it onlines hotplugged memory.
> > It's possible for non average Fedora user to override default
> > (commit 86dd995d6) if she/he needs non default behavior
> > (i.e. user knows how to online manually and/or can write
> > a daemon that would handle all of nuances of kernel in use).
> >
> > For the rest when Fedora is used in cloud and user increases memory
> > via management interface of whatever cloud she/he uses, it just works.
> >
> > So it's choice of distribution to pick its own default that makes
> > majority of user-base happy and this patch removes it without taking
> > that in consideration.
>
> You still can have a udev rule to achive the same thing for
> non-ballooning based hotplug.
not in case when system is under load, udev path might be slow
and udev might be killed by OOM leading to permanent disablement
of memory onlining.

> > How to online memory is different issue not related to this patch,
> > current default onlining as ZONE_NORMAL works well for scaling
> > up VMs.
> >
> > Memory unplug is rather new and it doesn't work reliably so far,
> > moving onlining to user-space won't really help. Further work
> > is need to be done so that it would work reliably.
>
> The main problem I have with this is that this is a limited usecase
> driven configuration knob which doesn't work properly for other usecases
> (namely movable online once your distribution choses to set the config
> option to auto online).
it works for default usecase in Fedora and non-default
movable can be used with
1) removable memory auto-online as movable in kernel, like
patch [1] would make movable hotplugged memory
(when I have time I'll try to work on it)
2) (or in worst case due to lack of alternative) explicitly
disabled auto-online on kernel CLI + onlining daemon
(since udev isn't working in current kernel due to ordering issue)

> There is a userspace solution for this so this
> shouldn't have been merged in the first place!
Sorry, currently user-space udev solution doesn't work nor
will it work reliably in extreme conditions.

> It sneaked a proper review
> process (linux-api wasn't CC to get a broader attenttion) which is
> really sad.
get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
MAINTAINERS should be fixed if linux-api were to be CCed.

> So unless this causes a major regression which would be hard to fix I
> will submit the patch for inclusion.
it will be a major regression due to lack of daemon that
could online fast and can't be killed on OOM. So this
clean up patch does break used feature without providing
a viable alternative.

I wouldn't object to removing config option as in this patch
if memory were onlined for x86 by default but that's not the case yet.


[1] https://bugzilla.redhat.com/attachment.cgi?id=1146332

2017-03-06 15:05:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> On Fri, 3 Mar 2017 09:27:23 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
> > > > [...]
> > > > > When trying to support memory unplug on guest side in RHEL7,
> > > > > experience shows otherwise. Simplistic udev rule which onlines
> > > > > added block doesn't work in case one wants to online it as movable.
> > > > >
> > > > > Hotplugged blocks in current kernel should be onlined in reverse
> > > > > order to online blocks as movable depending on adjacent blocks zone.
> > > >
> > > > Could you be more specific please? Setting online_movable from the udev
> > > > rule should just work regardless of the ordering or the state of other
> > > > memblocks. If that doesn't work I would call it a bug.
> > > It's rather an implementation constrain than a bug
> > > for details and workaround patch see
> > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
> >
> > "You are not authorized to access bug #1314306"
> Sorry,
> I've made it public, related comments and patch should be accessible now
> (code snippets in BZ are based on older kernel but logic is still the same upstream)
>
> > could you paste the reasoning here please?
> sure here is reproducer:
> start VM with CLI:
> qemu-system-x86_64 -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
> -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
> /path/to/guest_image
>
> then in guest dimm1 blocks are from 32-39
>
> echo online_movable > /sys/devices/system/memory/memory32/state
> -bash: echo: write error: Invalid argument
>
> in current mainline kernel it triggers following code path:
>
> online_pages()
> ...
> if (online_type == MMOP_ONLINE_KERNEL) {
> if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
> return -EINVAL;

Are you sure? I would expect MMOP_ONLINE_MOVABLE here

> zone_can_shift()
> ...
> if (idx < target) {
> /* pages must be at end of current zone */
> if (pfn + nr_pages != zone_end_pfn(zone))
> return false;
>
> since we are trying to online as movable not the last section in
> ZONE_NORMAL.
>
> Here is what makes hotplugged memory end up in ZONE_NORMAL:
> acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
> -> arch/x86/mm/init_64.c
>
> /*
> * Memory is added always to NORMAL zone. This means you will never get
> * additional DMA/DMA32 memory.
> */
> int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> {
> ...
> struct zone *zone = pgdat->node_zones +
> zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
>
> i.e. all hot-plugged memory modules always go to ZONE_NORMAL
> and only the first/last block in zone is allowed to be moved
> to another zone. Patch [1] tries to fix issue by assigning
> removable memory resource to movable zone so hotplugged+removable
> blocks look like:
> movable normal, movable, movable
> instead of current:
> normal, normal, normal movable

Hmm, this code is confusing and clean as mud. I have to stare there some
more but AFAIK zones shouldn't have problems with holes so the only
thing we have to guarantee is that different zones do not overlap. So
this smells like a bug rather than the ineherent implementation
limitation.

[...]
> > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > the first to the last hotplugged block order. So now we would have
> > > > > to write a daemon that would
> > > > > - watch for all blocks in hotplugged memory appear (how would it know)
> > > > > - online them in right order (order might also be different depending
> > > > > on kernel version)
> > > > > -- it becomes even more complicated in NUMA case when there are
> > > > > multiple zones and kernel would have to provide user-space
> > > > > with information about zone maps
> > > > >
> > > > > In short current experience shows that userspace approach
> > > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > > fast and/or under memory pressure) when udev (or something else
> > > > > might be killed)
> > > >
> > > > yeah and that is why the patch does the onlining from the kernel.
> > > onlining in this patch is limited to hyperv and patch breaks
> > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > hotplug path.
> >
> > Those can use the udev or do you see any reason why they couldn't?
>
> Reasons are above, under >>>> and >> quotations, patch breaks
> what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
> user-space process could be killed if hotplugged memory isn't onlined
> fast enough leading to service termination and/or memory not
> being onlined at all (if udev is killed)

OK, so from the discussion so far I have learned that this would be
problem _only_ if we are trying to hotplug a _lot_ of memory at once
(~1.5% of the online memory is needed). I am kind of skeptical this is
a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
would eat half of the available memory which is still quite far from
OOM)? Even if the memory balloning uses hotplug then such a grow sounds
a bit excessive.

> Currently udev rule is not usable and one needs a daemon
> which would correctly do onlining and keep zone balance
> even for simple case usecase of 1 normal and 1 movable zone.
> And it gets more complicated in case of multiple numa nodes
> with multiple zones.

That sounds to be more related to the current implementation than
anything else and as such is not a reason to invent specific user
visible api. Btw. you are talking about movable zones byt the auto
onlining doesn't allow to auto online movable memory. So either I miss
your point or I am utterly confused.

[...]
> > > Memory unplug is rather new and it doesn't work reliably so far,
> > > moving onlining to user-space won't really help. Further work
> > > is need to be done so that it would work reliably.
> >
> > The main problem I have with this is that this is a limited usecase
> > driven configuration knob which doesn't work properly for other usecases
> > (namely movable online once your distribution choses to set the config
> > option to auto online).
>
> it works for default usecase in Fedora and non-default
> movable can be used with
> 1) removable memory auto-online as movable in kernel, like
> patch [1] would make movable hotplugged memory
> (when I have time I'll try to work on it)
> 2) (or in worst case due to lack of alternative) explicitly
> disabled auto-online on kernel CLI + onlining daemon
> (since udev isn't working in current kernel due to ordering issue)

So I fail to see how this can work. Say the default will auto online
all the added memory. This will be all in Zone Normal because the auto
onlining doesn't do online_movable. Now I would like to online the
memory as movable but that might fail because some kernel objects might
be already sitting there so the offline would fail.

> > There is a userspace solution for this so this
> > shouldn't have been merged in the first place!
> Sorry, currently user-space udev solution doesn't work nor
> will it work reliably in extreme conditions.
>
> > It sneaked a proper review
> > process (linux-api wasn't CC to get a broader attenttion) which is
> > really sad.
>
> get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> MAINTAINERS should be fixed if linux-api were to be CCed.

user visible APIs _should_ be discussed at this mailing list regardless
what get_maintainer.pl says. This is not about who is the maintainer but
about getting as wide audience for things that would have to be
maintained basically for ever.

> > So unless this causes a major regression which would be hard to fix I
> > will submit the patch for inclusion.
> it will be a major regression due to lack of daemon that
> could online fast and can't be killed on OOM. So this
> clean up patch does break used feature without providing
> a viable alternative.

So let's discuss the current memory hotplug shortcomings and get rid of
the crud which developed on top. I will start by splitting up the patch
into 3 parts. Do the auto online thing from the HyperV and xen balloning
drivers and dropping the config option and finally drop the sysfs knob.
The last patch might be NAKed and I can live with that as long as the
reasoning is proper and there is a general consensus on that.
--
Michal Hocko
SUSE Labs

2017-03-07 12:42:07

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Mon, 6 Mar 2017 15:54:17 +0100
Michal Hocko <[email protected]> wrote:

> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> > On Fri, 3 Mar 2017 09:27:23 +0100
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
> > > > On Thu, 2 Mar 2017 15:28:16 +0100
> > > > Michal Hocko <[email protected]> wrote:
> > > >
> > > > > On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]

> > > > > memblocks. If that doesn't work I would call it a bug.
> > > > It's rather an implementation constrain than a bug
> > > > for details and workaround patch see
> > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
> > >
> > > "You are not authorized to access bug #1314306"
> > Sorry,
> > I've made it public, related comments and patch should be accessible now
> > (code snippets in BZ are based on older kernel but logic is still the same upstream)
> >
> > > could you paste the reasoning here please?
> > sure here is reproducer:
> > start VM with CLI:
> > qemu-system-x86_64 -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
> > -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
> > /path/to/guest_image
> >
> > then in guest dimm1 blocks are from 32-39
> >
> > echo online_movable > /sys/devices/system/memory/memory32/state
> > -bash: echo: write error: Invalid argument
> >
> > in current mainline kernel it triggers following code path:
> >
> > online_pages()
> > ...
> > if (online_type == MMOP_ONLINE_KERNEL) {
> > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
> > return -EINVAL;
>
> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
pretty much, reproducer is above so try and see for yourself

[...]
> [...]
> > > > > > Which means simple udev rule isn't usable since it gets event from
> > > > > > the first to the last hotplugged block order. So now we would have
> > > > > > to write a daemon that would
> > > > > > - watch for all blocks in hotplugged memory appear (how would it know)
> > > > > > - online them in right order (order might also be different depending
> > > > > > on kernel version)
> > > > > > -- it becomes even more complicated in NUMA case when there are
> > > > > > multiple zones and kernel would have to provide user-space
> > > > > > with information about zone maps
> > > > > >
> > > > > > In short current experience shows that userspace approach
> > > > > > - doesn't solve issues that Vitaly has been fixing (i.e. onlining
> > > > > > fast and/or under memory pressure) when udev (or something else
> > > > > > might be killed)
> > > > >
> > > > > yeah and that is why the patch does the onlining from the kernel.
> > > > onlining in this patch is limited to hyperv and patch breaks
> > > > auto-online on x86 kvm/vmware/baremetal as they reuse the same
> > > > hotplug path.
> > >
> > > Those can use the udev or do you see any reason why they couldn't?
> >
> > Reasons are above, under >>>> and >> quotations, patch breaks
> > what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
> > user-space process could be killed if hotplugged memory isn't onlined
> > fast enough leading to service termination and/or memory not
> > being onlined at all (if udev is killed)
>
> OK, so from the discussion so far I have learned that this would be
> problem _only_ if we are trying to hotplug a _lot_ of memory at once
> (~1.5% of the online memory is needed). I am kind of skeptical this is
> a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
> would eat half of the available memory which is still quite far from
> OOM)? Even if the memory balloning uses hotplug then such a grow sounds
> a bit excessive.
Slow and killable udev issue doesn't really depends on
amount of hotplugged memory since it's onlined in blocks
(128M for x64). Considering that it's currently onlined
as zone normal, kernel doesn't have any issues adding more
follow up blocks of memory.

> > Currently udev rule is not usable and one needs a daemon
> > which would correctly do onlining and keep zone balance
> > even for simple case usecase of 1 normal and 1 movable zone.
> > And it gets more complicated in case of multiple numa nodes
> > with multiple zones.
>
> That sounds to be more related to the current implementation than
> anything else and as such is not a reason to invent specific user
> visible api. Btw. you are talking about movable zones byt the auto
> onlining doesn't allow to auto online movable memory. So either I miss
> your point or I am utterly confused.
in current state neither does udev rule as memory is onlined
as NORMAL (x64 variant at least), which is the same as auto online
does now.

We are discussing 2 different issues here and thread got pretty
hard to follow. I'll try to sum up at results the end.

> [...]
> > > > Memory unplug is rather new and it doesn't work reliably so far,
> > > > moving onlining to user-space won't really help. Further work
> > > > is need to be done so that it would work reliably.
> > >
> > > The main problem I have with this is that this is a limited usecase
> > > driven configuration knob which doesn't work properly for other usecases
> > > (namely movable online once your distribution choses to set the config
> > > option to auto online).
> >
> > it works for default usecase in Fedora and non-default
> > movable can be used with
> > 1) removable memory auto-online as movable in kernel, like
> > patch [1] would make movable hotplugged memory
> > (when I have time I'll try to work on it)
> > 2) (or in worst case due to lack of alternative) explicitly
> > disabled auto-online on kernel CLI + onlining daemon
> > (since udev isn't working in current kernel due to ordering issue)
>
> So I fail to see how this can work. Say the default will auto online
> all the added memory. This will be all in Zone Normal because the auto
> onlining doesn't do online_movable. Now I would like to online the
> memory as movable but that might fail because some kernel objects might
> be already sitting there so the offline would fail.
same happens with udev rule as currently it can online from
the first to last hotplugged block only as Zone Normal.
see
arch/x86/mm/init_64.c:
acpi_memory_enable_device() -> add_memory -> add_memory_resource

both auto online and udev could be fixed by patch
https://bugzilla.redhat.com/attachment.cgi?id=1146332
to online reomovable memory as movable.

> > > There is a userspace solution for this so this
> > > shouldn't have been merged in the first place!
> > Sorry, currently user-space udev solution doesn't work nor
> > will it work reliably in extreme conditions.
> >
> > > It sneaked a proper review
> > > process (linux-api wasn't CC to get a broader attenttion) which is
> > > really sad.
> >
> > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> > MAINTAINERS should be fixed if linux-api were to be CCed.
>
> user visible APIs _should_ be discussed at this mailing list regardless
> what get_maintainer.pl says. This is not about who is the maintainer but
> about getting as wide audience for things that would have to be
> maintained basically for ever.
How would random contributor know which list to CC?

> > > So unless this causes a major regression which would be hard to fix I
> > > will submit the patch for inclusion.
> > it will be a major regression due to lack of daemon that
> > could online fast and can't be killed on OOM. So this
> > clean up patch does break used feature without providing
> > a viable alternative.
>
> So let's discuss the current memory hotplug shortcomings and get rid of
> the crud which developed on top. I will start by splitting up the patch
> into 3 parts. Do the auto online thing from the HyperV and xen balloning
> drivers and dropping the config option and finally drop the sysfs knob.
> The last patch might be NAKed and I can live with that as long as the
> reasoning is proper and there is a general consensus on that.
PS: CC me on that patches too

It's major regression if you remove auto online in kernels that
run on top of x86 kvm/vmware hypervisors, making API cleanups
while breaking useful functionality doesn't make sense.

I would ACK config option removal if auto online keeps working
for all x86 hypervisors (hyperv/xen isn't the only who needs it)
and keep kernel CLI option to override default.

That doesn't mean that others will agree with flipping default,
that's why config option has been added.

Now to sum up what's been discussed on this thread, there were 2
different issues discussed:
1) memory hotplug: remove in kernel auto online for all
except of hyperv/xen

- suggested RFC is not acceptable from virt point of view
as it regresses guests on top of x86 kvm/vmware which
both use ACPI based memory hotplug.

- udev/userspace solution doesn't work in practice as it's
too slow and unreliable when system is under load which
is quite common in virt usecase. That's why auto online
has been introduced in the first place.

2) memory unplug: online memory as movable

- doesn't work currently with udev rule due to kernel
issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

- could be fixed both for in kernel auto online and udev
with following patch:
https://bugzilla.redhat.com/attachment.cgi?id=1146332
but fixing it this way exposes zone disbalance issues,
which are not present in current kernel as blocks are
onlined in Zone Normal. So this is area to work and
improve on.

- currently if one wants to use online_movable,
one has to either
* disable auto online in kernel OR
* remove udev rule that distro ships
AND write custom daemon that will be able to online
block in right zone/order. So currently whole
online_movable thing isn't working by default
regardless of who onlines memory.

I'm in favor of implementing that in kernel as it keeps
kernel internals inside kernel and doesn't need
kernel API to be involved (memory blocks in sysfs,
online_kernel, online_movable)
There would be no need in userspace which would have to
deal with kernel zoo and maintain that as well.

2017-03-09 12:54:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> On Mon, 6 Mar 2017 15:54:17 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
[...]
> > > in current mainline kernel it triggers following code path:
> > >
> > > online_pages()
> > > ...
> > > if (online_type == MMOP_ONLINE_KERNEL) {
> > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
> > > return -EINVAL;
> >
> > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> pretty much, reproducer is above so try and see for yourself

I will play with this...

[...]
> > > get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
> > > MAINTAINERS should be fixed if linux-api were to be CCed.
> >
> > user visible APIs _should_ be discussed at this mailing list regardless
> > what get_maintainer.pl says. This is not about who is the maintainer but
> > about getting as wide audience for things that would have to be
> > maintained basically for ever.
>
> How would random contributor know which list to CC?

This should have been brought up during the review process which was
less than sufficient in this case.

> > > > So unless this causes a major regression which would be hard to fix I
> > > > will submit the patch for inclusion.
> > > it will be a major regression due to lack of daemon that
> > > could online fast and can't be killed on OOM. So this
> > > clean up patch does break used feature without providing
> > > a viable alternative.
> >
> > So let's discuss the current memory hotplug shortcomings and get rid of
> > the crud which developed on top. I will start by splitting up the patch
> > into 3 parts. Do the auto online thing from the HyperV and xen balloning
> > drivers and dropping the config option and finally drop the sysfs knob.
> > The last patch might be NAKed and I can live with that as long as the
> > reasoning is proper and there is a general consensus on that.
> PS: CC me on that patches too
>
> It's major regression if you remove auto online in kernels that
> run on top of x86 kvm/vmware hypervisors, making API cleanups
> while breaking useful functionality doesn't make sense.
>
> I would ACK config option removal if auto online keeps working
> for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> and keep kernel CLI option to override default.
>
> That doesn't mean that others will agree with flipping default,
> that's why config option has been added.
>
> Now to sum up what's been discussed on this thread, there were 2
> different issues discussed:
> 1) memory hotplug: remove in kernel auto online for all
> except of hyperv/xen
>
> - suggested RFC is not acceptable from virt point of view
> as it regresses guests on top of x86 kvm/vmware which
> both use ACPI based memory hotplug.
>
> - udev/userspace solution doesn't work in practice as it's
> too slow and unreliable when system is under load which
> is quite common in virt usecase. That's why auto online
> has been introduced in the first place.

Please try to be more specific why "too slow" is a problem. Also how
much slower are we talking about?

> 2) memory unplug: online memory as movable
>
> - doesn't work currently with udev rule due to kernel
> issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

These should be fixed

> - could be fixed both for in kernel auto online and udev
> with following patch:
> https://bugzilla.redhat.com/attachment.cgi?id=1146332
> but fixing it this way exposes zone disbalance issues,
> which are not present in current kernel as blocks are
> onlined in Zone Normal. So this is area to work and
> improve on.
>
> - currently if one wants to use online_movable,
> one has to either
> * disable auto online in kernel OR

which might not just work because an unmovable allocation could have
made the memblock pinned.

> * remove udev rule that distro ships
> AND write custom daemon that will be able to online
> block in right zone/order. So currently whole
> online_movable thing isn't working by default
> regardless of who onlines memory.

my epxperience with onlining full nodes as movable shows this works just
fine (with all the limitations of the movable zones but that is a
separate thing). I haven't played with configurations where movable
zones are sharing the node with other zones.

> I'm in favor of implementing that in kernel as it keeps
> kernel internals inside kernel and doesn't need
> kernel API to be involved (memory blocks in sysfs,
> online_kernel, online_movable)
> There would be no need in userspace which would have to
> deal with kernel zoo and maintain that as well.

The kernel is supposed to provide a proper API and that is sysfs
currently. I am not entirely happy about it either but pulling a lot of
code into the kernel is not the rigth thing to do. Especially when
different usecases require different treatment.
--
Michal Hocko
SUSE Labs

2017-03-10 13:58:19

by Michal Hocko

[permalink] [raw]
Subject: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

Let's CC people touching this logic. A short summary is that onlining
memory via udev is currently unusable for online_movable because blocks
are added from lower addresses while movable blocks are allowed from
last blocks. More below.

On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> > On Mon, 6 Mar 2017 15:54:17 +0100
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> [...]
> > > > in current mainline kernel it triggers following code path:
> > > >
> > > > online_pages()
> > > > ...
> > > > if (online_type == MMOP_ONLINE_KERNEL) {
> > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
> > > > return -EINVAL;
> > >
> > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> > pretty much, reproducer is above so try and see for yourself
>
> I will play with this...

OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
[...]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
[ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
[ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
[ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
[ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
[ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]

so there is neither any normal zone nor movable one at the boot time.
Then I hotplugged 1G slot
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

unfortunatelly the memory didn't show up automatically and I got
[ 116.375781] acpi PNP0C80:00: Enumeration failure

so I had to probe it manually (prbably the BIOS my qemu uses doesn't
support auto probing - I haven't really dug further). Anyway the SRAT
table printed during the boot told that we should start at 0x100000000

# echo 0x100000000 > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable

which looks reasonably right? Both Normal and Movable zones are allowed

# echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

Huh, so our valid_zones have changed under our feet...

# echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

and again. So only the last memblock is considered movable. Let's try to
online them now.

# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal

This would explain why onlining from the last block actually works but
to me this sounds like a completely crappy behavior. All we need to
guarantee AFAICS is that Normal and Movable zones do not overlap. I
believe there is even no real requirement about ordering of the physical
memory in Normal vs. Movable zones as long as they do not overlap. But
let's keep it simple for the start and always enforce the current status
quo that Normal zone is physically preceeding Movable zone.
Can somebody explain why we cannot have a simple rule for Normal vs.
Movable which would be:
- block [pfn, pfn+block_size] can be Normal if
!zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
- block [pfn, pfn+block_size] can be Movable if
!zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

I haven't fully grokked all the restrictions on the movable zone size
based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
this shouldn't really make the situation really much more complicated I
believe because those parameters should be mostly about static
initialization rather than hotplug but I might be easily missing
something.
--
Michal Hocko
SUSE Labs

2017-03-10 15:53:57

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Fri 10-03-17 14:58:07, Michal Hocko wrote:
[...]
> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
> - block [pfn, pfn+block_size] can be Normal if
> !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> - block [pfn, pfn+block_size] can be Movable if
> !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

OK, so while I was playing with this setup some more I probably got why
this is done this way. All new memblocks are added to the zone Normal
where they are accounted as spanned but not present. When we do
online_movable we just cut from the end of the Normal zone and move it
to Movable zone. This sounds really awkward. What was the reason to go
this way? Why cannot we simply add those pages to the zone at the online
time?
--
Michal Hocko
SUSE Labs

2017-03-10 17:39:42

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface



On 03/10/2017 08:58 AM, Michal Hocko wrote:
> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
>
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
>> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
>>> On Mon, 6 Mar 2017 15:54:17 +0100
>>> Michal Hocko <[email protected]> wrote:
>>>
>>>> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
>> [...]
>>>>> in current mainline kernel it triggers following code path:
>>>>>
>>>>> online_pages()
>>>>> ...
>>>>> if (online_type == MMOP_ONLINE_KERNEL) {
>>>>> if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
>>>>> return -EINVAL;
>>>>
>>>> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
>>> pretty much, reproducer is above so try and see for yourself
>>
>> I will play with this...
>
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
> [...]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> [ 0.000000] Normal empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
>
> so there is neither any normal zone nor movable one at the boot time.
> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
>
> unfortunatelly the memory didn't show up automatically and I got
> [ 116.375781] acpi PNP0C80:00: Enumeration failure
>
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x100000000
>
> # echo 0x100000000 > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory32/valid_zones
> Normal Movable
>
> which looks reasonably right? Both Normal and Movable zones are allowed
>
> # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>
> Huh, so our valid_zones have changed under our feet...
>
> # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal
> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>
> and again. So only the last memblock is considered movable. Let's try to
> online them now.
>
> # echo online_movable > /sys/devices/system/memory/memory34/state
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>

I think there is no strong reason which kernel has the restriction.
By setting the restrictions, it seems to have made management of
these zone structs simple.

Thanks,
Yasuaki Ishimatsu

> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
> - block [pfn, pfn+block_size] can be Normal if
> !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> - block [pfn, pfn+block_size] can be Movable if
> !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
>
> I haven't fully grokked all the restrictions on the movable zone size
> based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
> this shouldn't really make the situation really much more complicated I
> believe because those parameters should be mostly about static
> initialization rather than hotplug but I might be easily missing
> something.
>

2017-03-10 19:01:09

by Reza Arbab

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
>OK, so while I was playing with this setup some more I probably got why
>this is done this way. All new memblocks are added to the zone Normal
>where they are accounted as spanned but not present.

It's not always zone Normal. See zone_for_memory(). This leads to a
workaround for having to do online_movable in descending block order.
Instead of this:

1. probe block 34, probe block 33, probe block 32, ...
2. online_movable 34, online_movable 33, online_movable 32, ...

you can online_movable the first block before adding the rest:

1. probe block 32, online_movable 32
2. probe block 33, probe block 34, ...
- zone_for_memory() will cause these to start Movable
3. online 33, online 34, ...
- they're already in Movable, so online_movable is equivalentr

I agree with your general sentiment that this stuff is very
nonintuitive.

--
Reza Arbab

2017-03-10 22:01:08

by Daniel Kiper

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

Hey,

On Mon, Mar 06, 2017 at 03:54:17PM +0100, Michal Hocko wrote:

[...]

> So let's discuss the current memory hotplug shortcomings and get rid of
> the crud which developed on top. I will start by splitting up the patch
> into 3 parts. Do the auto online thing from the HyperV and xen balloning
> drivers and dropping the config option and finally drop the sysfs knob.
> The last patch might be NAKed and I can live with that as long as the
> reasoning is proper and there is a general consensus on that.

If it is not too late please CC me on the patches and relevant threads.

Daniel

2017-03-13 09:20:50

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface

On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote:
> On 03/10/2017 08:58 AM, Michal Hocko wrote:
[...]
> >OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
> >[...]
> >[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> >[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> >[ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> >[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> >[ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> >[ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> >[ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> >[ 0.000000] Zone ranges:
> >[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> >[ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> >[ 0.000000] Normal empty
> >[ 0.000000] Movable zone start for each node
> >[ 0.000000] Early memory node ranges
> >[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> >[ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> >[ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
> >
> >so there is neither any normal zone nor movable one at the boot time.
> >Then I hotplugged 1G slot
> >(qemu) object_add memory-backend-ram,id=mem1,size=1G
> >(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> >
> >unfortunatelly the memory didn't show up automatically and I got
> >[ 116.375781] acpi PNP0C80:00: Enumeration failure
> >
> >so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> >support auto probing - I haven't really dug further). Anyway the SRAT
> >table printed during the boot told that we should start at 0x100000000
> >
> ># echo 0x100000000 > /sys/devices/system/memory/probe
> ># grep . /sys/devices/system/memory/memory32/valid_zones
> >Normal Movable
> >
> >which looks reasonably right? Both Normal and Movable zones are allowed
> >
> ># echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> ># grep . /sys/devices/system/memory/memory3?/valid_zones
> >/sys/devices/system/memory/memory32/valid_zones:Normal
> >/sys/devices/system/memory/memory33/valid_zones:Normal Movable
> >
> >Huh, so our valid_zones have changed under our feet...
> >
> ># echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> ># grep . /sys/devices/system/memory/memory3?/valid_zones
> >/sys/devices/system/memory/memory32/valid_zones:Normal
> >/sys/devices/system/memory/memory33/valid_zones:Normal
> >/sys/devices/system/memory/memory34/valid_zones:Normal Movable
> >
> >and again. So only the last memblock is considered movable. Let's try to
> >online them now.
> >
> ># echo online_movable > /sys/devices/system/memory/memory34/state
> ># grep . /sys/devices/system/memory/memory3?/valid_zones
> >/sys/devices/system/memory/memory32/valid_zones:Normal
> >/sys/devices/system/memory/memory33/valid_zones:Normal Movable
> >/sys/devices/system/memory/memory34/valid_zones:Movable Normal
> >
>
> I think there is no strong reason which kernel has the restriction.
> By setting the restrictions, it seems to have made management of
> these zone structs simple.

Could you be more specific please? How could this make management any
easier when udev is basically racing with the physical hotplug and the
result is basically undefined?
--
Michal Hocko
SUSE Labs

2017-03-13 09:22:46

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> >OK, so while I was playing with this setup some more I probably got why
> >this is done this way. All new memblocks are added to the zone Normal
> >where they are accounted as spanned but not present.
>
> It's not always zone Normal. See zone_for_memory(). This leads to a
> workaround for having to do online_movable in descending block order.
> Instead of this:
>
> 1. probe block 34, probe block 33, probe block 32, ...
> 2. online_movable 34, online_movable 33, online_movable 32, ...
>
> you can online_movable the first block before adding the rest:

I do I enforce that behavior when the probe happens automagically?

> 1. probe block 32, online_movable 32
> 2. probe block 33, probe block 34, ...
> - zone_for_memory() will cause these to start Movable
> 3. online 33, online 34, ...
> - they're already in Movable, so online_movable is equivalentr
>
> I agree with your general sentiment that this stuff is very nonintuitive.

My criterion for nonintuitive is probably different because I would call
this _completely_unusable_. Sorry for being so loud about this but the
more I look into this area the more WTF code I see. This has seen close
to zero review and seems to be building up more single usecase code on
top of previous. We need to change this, seriously!
--
Michal Hocko
SUSE Labs

2017-03-13 10:31:30

by Igor Mammedov

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Fri, 10 Mar 2017 14:58:07 +0100
Michal Hocko <[email protected]> wrote:

> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
>
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> > On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> > > On Mon, 6 Mar 2017 15:54:17 +0100
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> > [...]
> > > > > in current mainline kernel it triggers following code path:
> > > > >
> > > > > online_pages()
> > > > > ...
> > > > > if (online_type == MMOP_ONLINE_KERNEL) {
> > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
> > > > > return -EINVAL;
> > > >
> > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> > > pretty much, reproducer is above so try and see for yourself
> >
> > I will play with this...
>
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
'mem' here distributes boot memory specified by "-m 2G" and does not
include memory specified by -device pc-dimm.

> [...]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> [ 0.000000] Normal empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
>
> so there is neither any normal zone nor movable one at the boot time.
it could be if hotpluggable memory were present at boot time in E802 table
(if I remember right when running on hyperv there is movable zone at boot time),

but in qemu hotpluggable memory isn't put into E820,
so zone is allocated later when memory is enumerated
by ACPI subsystem and onlined.
It causes less issues wrt movable zone and works for
different versions of linux/windows as well.

That's where in kernel auto-onlining could be also useful,
since user would be able to start-up with with small
non removable memory plus several removable DIMMs
and have all the memory onlined/available by the time
initrd is loaded. (missing piece here is onling
removable memory as movable by default).


> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
You can also specify node a pc-dimm goes to with 'node' property
if it should go to other then node 0.

device_add pc-dimm,id=dimm1,memdev=mem1,node=1


> unfortunatelly the memory didn't show up automatically and I got
> [ 116.375781] acpi PNP0C80:00: Enumeration failure
it should work,
do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?

> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x100000000

2017-03-13 10:43:26

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> On Fri, 10 Mar 2017 14:58:07 +0100
[...]
> > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> > [ 0.000000] Normal empty
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
> >
> > so there is neither any normal zone nor movable one at the boot time.
> it could be if hotpluggable memory were present at boot time in E802 table
> (if I remember right when running on hyperv there is movable zone at boot time),
>
> but in qemu hotpluggable memory isn't put into E820,
> so zone is allocated later when memory is enumerated
> by ACPI subsystem and onlined.
> It causes less issues wrt movable zone and works for
> different versions of linux/windows as well.
>
> That's where in kernel auto-onlining could be also useful,
> since user would be able to start-up with with small
> non removable memory plus several removable DIMMs
> and have all the memory onlined/available by the time
> initrd is loaded. (missing piece here is onling
> removable memory as movable by default).

Why we should even care to online that memory that early rather than
making it available via e820?

> > Then I hotplugged 1G slot
> > (qemu) object_add memory-backend-ram,id=mem1,size=1G
> > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> You can also specify node a pc-dimm goes to with 'node' property
> if it should go to other then node 0.
>
> device_add pc-dimm,id=dimm1,memdev=mem1,node=1

thanks for the tip

> > unfortunatelly the memory didn't show up automatically and I got
> > [ 116.375781] acpi PNP0C80:00: Enumeration failure
> it should work,
> do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?

No I didn't. Thanks, good to know!
--
Michal Hocko
SUSE Labs

2017-03-13 10:56:11

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Thu, 9 Mar 2017 13:54:00 +0100
Michal Hocko <[email protected]> wrote:

[...]
> > It's major regression if you remove auto online in kernels that
> > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > while breaking useful functionality doesn't make sense.
> >
> > I would ACK config option removal if auto online keeps working
> > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > and keep kernel CLI option to override default.
> >
> > That doesn't mean that others will agree with flipping default,
> > that's why config option has been added.
> >
> > Now to sum up what's been discussed on this thread, there were 2
> > different issues discussed:
> > 1) memory hotplug: remove in kernel auto online for all
> > except of hyperv/xen
> >
> > - suggested RFC is not acceptable from virt point of view
> > as it regresses guests on top of x86 kvm/vmware which
> > both use ACPI based memory hotplug.
> >
> > - udev/userspace solution doesn't work in practice as it's
> > too slow and unreliable when system is under load which
> > is quite common in virt usecase. That's why auto online
> > has been introduced in the first place.
>
> Please try to be more specific why "too slow" is a problem. Also how
> much slower are we talking about?
In virt case on host with lots VMs, userspace handler
processing could be scheduled late enough to trigger a race
between (guest memory going away/OOM handler) and memory
coming online.

>
> > 2) memory unplug: online memory as movable
> >
> > - doesn't work currently with udev rule due to kernel
> > issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7
>
> These should be fixed
>
> > - could be fixed both for in kernel auto online and udev
> > with following patch:
> > https://bugzilla.redhat.com/attachment.cgi?id=1146332
> > but fixing it this way exposes zone disbalance issues,
> > which are not present in current kernel as blocks are
> > onlined in Zone Normal. So this is area to work and
> > improve on.
> >
> > - currently if one wants to use online_movable,
> > one has to either
> > * disable auto online in kernel OR
>
> which might not just work because an unmovable allocation could have
> made the memblock pinned.
With memhp_default_state=offline on kernel CLI there won't be any
unmovable allocation as hotplugged memory won't be onlined and
user can online it manually. So it works for non default usecase
of playing with memory hot-unplug.

> > * remove udev rule that distro ships
> > AND write custom daemon that will be able to online
> > block in right zone/order. So currently whole
> > online_movable thing isn't working by default
> > regardless of who onlines memory.
>
> my epxperience with onlining full nodes as movable shows this works just
> fine (with all the limitations of the movable zones but that is a
> separate thing). I haven't played with configurations where movable
> zones are sharing the node with other zones.
I don't have access to a such baremetal configuration to play
with anymore.


> > I'm in favor of implementing that in kernel as it keeps
> > kernel internals inside kernel and doesn't need
> > kernel API to be involved (memory blocks in sysfs,
> > online_kernel, online_movable)
> > There would be no need in userspace which would have to
> > deal with kernel zoo and maintain that as well.
>
> The kernel is supposed to provide a proper API and that is sysfs
> currently. I am not entirely happy about it either but pulling a lot of
> code into the kernel is not the rigth thing to do. Especially when
> different usecases require different treatment.
If it could be done from kernel side alone, it looks like a better way
to me not to involve userspace at all. And for ACPI based x86/ARM it's
possible to implement without adding a lot of kernel code.
That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
so we could continue on improving kernel only auto-onlining
and fixing current memory hot(un)plug issues without affecting
other platforms/users that are no interested in it.
(PS: I don't care much about sysfs knob for setting auto-onlining,
as kernel CLI override with memhp_default_state seems
sufficient to me)

2017-03-13 12:28:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> On Thu, 9 Mar 2017 13:54:00 +0100
> Michal Hocko <[email protected]> wrote:
>
> [...]
> > > It's major regression if you remove auto online in kernels that
> > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > while breaking useful functionality doesn't make sense.
> > >
> > > I would ACK config option removal if auto online keeps working
> > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > and keep kernel CLI option to override default.
> > >
> > > That doesn't mean that others will agree with flipping default,
> > > that's why config option has been added.
> > >
> > > Now to sum up what's been discussed on this thread, there were 2
> > > different issues discussed:
> > > 1) memory hotplug: remove in kernel auto online for all
> > > except of hyperv/xen
> > >
> > > - suggested RFC is not acceptable from virt point of view
> > > as it regresses guests on top of x86 kvm/vmware which
> > > both use ACPI based memory hotplug.
> > >
> > > - udev/userspace solution doesn't work in practice as it's
> > > too slow and unreliable when system is under load which
> > > is quite common in virt usecase. That's why auto online
> > > has been introduced in the first place.
> >
> > Please try to be more specific why "too slow" is a problem. Also how
> > much slower are we talking about?
>
> In virt case on host with lots VMs, userspace handler
> processing could be scheduled late enough to trigger a race
> between (guest memory going away/OOM handler) and memory
> coming online.

Either you are mixing two things together or this doesn't really make
much sense. So is this a balloning based on memory hotplug (aka active
memory hotadd initiated between guest and host automatically) or a guest
asking for additional memory by other means (pay more for memory etc.)?
Because if this is an administrative operation then I seriously question
this reasoning.

[...]
> > > - currently if one wants to use online_movable,
> > > one has to either
> > > * disable auto online in kernel OR
> >
> > which might not just work because an unmovable allocation could have
> > made the memblock pinned.
>
> With memhp_default_state=offline on kernel CLI there won't be any
> unmovable allocation as hotplugged memory won't be onlined and
> user can online it manually. So it works for non default usecase
> of playing with memory hot-unplug.

I was talking about the case when the auto_online was true, of course,
e.g. depending on the config option which you've said is enabled in
Fedora kernels.

[...]
> > > I'm in favor of implementing that in kernel as it keeps
> > > kernel internals inside kernel and doesn't need
> > > kernel API to be involved (memory blocks in sysfs,
> > > online_kernel, online_movable)
> > > There would be no need in userspace which would have to
> > > deal with kernel zoo and maintain that as well.
> >
> > The kernel is supposed to provide a proper API and that is sysfs
> > currently. I am not entirely happy about it either but pulling a lot of
> > code into the kernel is not the rigth thing to do. Especially when
> > different usecases require different treatment.
>
> If it could be done from kernel side alone, it looks like a better way
> to me not to involve userspace at all. And for ACPI based x86/ARM it's
> possible to implement without adding a lot of kernel code.

But this is not how we do the kernel development. We provide the API so
that userspace can implement the appropriate policy on top. We do not
add random knobs to implement the same thing in the kernel. Different
users might want to implement different onlining strategies and that is
hardly describable by a single global knob. Just look at the s390
example provided earlier. Please try to think out of your usecase scope.

> That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> so we could continue on improving kernel only auto-onlining
> and fixing current memory hot(un)plug issues without affecting
> other platforms/users that are no interested in it.

I really do not see any reason to keep the config option. Setting up
this to enabled is _wrong_ thing to do in general purpose
(distribution) kernel and a kernel for the specific usecase can achieve
the same thing via boot command line.

> (PS: I don't care much about sysfs knob for setting auto-onlining,
> as kernel CLI override with memhp_default_state seems
> sufficient to me)

That is good to hear! I would be OK with keeping the kernel command line
option until we resolve all the current issues with the hotplug.

--
Michal Hocko
SUSE Labs

2017-03-13 12:56:07

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

Michal Hocko <[email protected]> writes:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
>> > >
>> > > - suggested RFC is not acceptable from virt point of view
>> > > as it regresses guests on top of x86 kvm/vmware which
>> > > both use ACPI based memory hotplug.
>> > >
>> > > - udev/userspace solution doesn't work in practice as it's
>> > > too slow and unreliable when system is under load which
>> > > is quite common in virt usecase. That's why auto online
>> > > has been introduced in the first place.
>> >
>> > Please try to be more specific why "too slow" is a problem. Also how
>> > much slower are we talking about?
>>
>> In virt case on host with lots VMs, userspace handler
>> processing could be scheduled late enough to trigger a race
>> between (guest memory going away/OOM handler) and memory
>> coming online.
>
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.

I'm probably repeating myself but it seems this point was lost:

This is not really a 'ballooning', it is just a pure memory
hotplug. People may have any tools monitoring their VM memory usage and
when a VM is running low on memory they may want to hotplug more memory
to it. With udev-style memory onlining they should be aware of page
tables and other in-kernel structures which require allocation so they
need to add memory slowly and gradually or they risk running into OOM
(at least getting some processes killed and these processes may be
important). With in-kernel memory hotplug everything happens
synchronously and no 'slowly and gradually' algorithm is required in
all tools which may trigger memory hotplug.

It's not about slowness, it's about being synchronous or
asynchronous. This is not related to the virtualization technology used,
the use-case is the same for all of them which support memory hotplug.

--
Vitaly

2017-03-13 13:19:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> >> > >
> >> > > - suggested RFC is not acceptable from virt point of view
> >> > > as it regresses guests on top of x86 kvm/vmware which
> >> > > both use ACPI based memory hotplug.
> >> > >
> >> > > - udev/userspace solution doesn't work in practice as it's
> >> > > too slow and unreliable when system is under load which
> >> > > is quite common in virt usecase. That's why auto online
> >> > > has been introduced in the first place.
> >> >
> >> > Please try to be more specific why "too slow" is a problem. Also how
> >> > much slower are we talking about?
> >>
> >> In virt case on host with lots VMs, userspace handler
> >> processing could be scheduled late enough to trigger a race
> >> between (guest memory going away/OOM handler) and memory
> >> coming online.
> >
> > Either you are mixing two things together or this doesn't really make
> > much sense. So is this a balloning based on memory hotplug (aka active
> > memory hotadd initiated between guest and host automatically) or a guest
> > asking for additional memory by other means (pay more for memory etc.)?
> > Because if this is an administrative operation then I seriously question
> > this reasoning.
>
> I'm probably repeating myself but it seems this point was lost:
>
> This is not really a 'ballooning', it is just a pure memory
> hotplug. People may have any tools monitoring their VM memory usage and
> when a VM is running low on memory they may want to hotplug more memory
> to it.

What is the API those guests ask for the memory? And who is actually
responsible to ask for that memory? Is it a kernel or userspace
solution?

> With udev-style memory onlining they should be aware of page
> tables and other in-kernel structures which require allocation so they
> need to add memory slowly and gradually or they risk running into OOM
> (at least getting some processes killed and these processes may be
> important). With in-kernel memory hotplug everything happens
> synchronously and no 'slowly and gradually' algorithm is required in
> all tools which may trigger memory hotplug.

What prevents those APIs being used reasonably and only asks so much
memory as they can afford? I mean 1.5% available memory necessary for
the hotplug is not all that much. Or more precisely what prevents to ask
for this additional memory in a synchronous way?

And just to prevent from a further solution, I can see why _in-kernel_
hotplug based 'ballooning (or whatever you call an on demand memory hotplug
based on the memory pressure)' want's to be synchronous and that is why
my patch changed those onlined the memory explicitly. I am questioning
memory hotplug requested by admin/user space component to need any
special from kernel assistance becuase it is only a shortcut which can
be implemented from the userspace. I hope I've made myself clear
finally.
--
Michal Hocko
SUSE Labs

2017-03-13 13:43:08

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

Michal Hocko <[email protected]> writes:

> On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
>> >> > >
>> >> > > - suggested RFC is not acceptable from virt point of view
>> >> > > as it regresses guests on top of x86 kvm/vmware which
>> >> > > both use ACPI based memory hotplug.
>> >> > >
>> >> > > - udev/userspace solution doesn't work in practice as it's
>> >> > > too slow and unreliable when system is under load which
>> >> > > is quite common in virt usecase. That's why auto online
>> >> > > has been introduced in the first place.
>> >> >
>> >> > Please try to be more specific why "too slow" is a problem. Also how
>> >> > much slower are we talking about?
>> >>
>> >> In virt case on host with lots VMs, userspace handler
>> >> processing could be scheduled late enough to trigger a race
>> >> between (guest memory going away/OOM handler) and memory
>> >> coming online.
>> >
>> > Either you are mixing two things together or this doesn't really make
>> > much sense. So is this a balloning based on memory hotplug (aka active
>> > memory hotadd initiated between guest and host automatically) or a guest
>> > asking for additional memory by other means (pay more for memory etc.)?
>> > Because if this is an administrative operation then I seriously question
>> > this reasoning.
>>
>> I'm probably repeating myself but it seems this point was lost:
>>
>> This is not really a 'ballooning', it is just a pure memory
>> hotplug. People may have any tools monitoring their VM memory usage and
>> when a VM is running low on memory they may want to hotplug more memory
>> to it.
>
> What is the API those guests ask for the memory? And who is actually
> responsible to ask for that memory? Is it a kernel or userspace
> solution?

Whatever, this can even be a system administrator running
'free'. Hyper-V driver sends si_mem_available() and
vm_memory_committed() metrics to the host every second and this can be
later queried by any tool (e.g. powershell script).

>
>> With udev-style memory onlining they should be aware of page
>> tables and other in-kernel structures which require allocation so they
>> need to add memory slowly and gradually or they risk running into OOM
>> (at least getting some processes killed and these processes may be
>> important). With in-kernel memory hotplug everything happens
>> synchronously and no 'slowly and gradually' algorithm is required in
>> all tools which may trigger memory hotplug.
>
> What prevents those APIs being used reasonably and only asks so much
> memory as they can afford? I mean 1.5% available memory necessary for
> the hotplug is not all that much. Or more precisely what prevents to ask
> for this additional memory in a synchronous way?

The knowledge about the fact that we need to add memory slowly and
wait till it gets onlined is not obvious. AFAIR when you hotplug memory
to Windows VMs there is no such thing as 'onlining', and no brain is
required, a simple script 'low memory -> add mory memory' always
works. Asking all these script writers to think twice before issuing a
memory add command memory sounds like too much (to me).

--
Vitaly

2017-03-13 13:57:30

by Igor Mammedov

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Mon, 13 Mar 2017 11:43:02 +0100
Michal Hocko <[email protected]> wrote:

> On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > On Fri, 10 Mar 2017 14:58:07 +0100
> [...]
> > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> > > [ 0.000000] Zone ranges:
> > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> > > [ 0.000000] Normal empty
> > > [ 0.000000] Movable zone start for each node
> > > [ 0.000000] Early memory node ranges
> > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
> > >
> > > so there is neither any normal zone nor movable one at the boot time.
> > it could be if hotpluggable memory were present at boot time in E802 table
> > (if I remember right when running on hyperv there is movable zone at boot time),
> >
> > but in qemu hotpluggable memory isn't put into E820,
> > so zone is allocated later when memory is enumerated
> > by ACPI subsystem and onlined.
> > It causes less issues wrt movable zone and works for
> > different versions of linux/windows as well.
> >
> > That's where in kernel auto-onlining could be also useful,
> > since user would be able to start-up with with small
> > non removable memory plus several removable DIMMs
> > and have all the memory onlined/available by the time
> > initrd is loaded. (missing piece here is onling
> > removable memory as movable by default).
>
> Why we should even care to online that memory that early rather than
> making it available via e820?

It's not forbidden by spec and has less complications
when it comes to removable memory. Declaring it in E820
would add following limitations/drawbacks:
- firmware should be able to exclude removable memory
from its usage (currently SeaBIOS nor EFI have to
know/care about it) => less qemu-guest ABI to maintain.
- OS should be taught to avoid/move (early) nonmovable
allocations from removable address ranges.
There were patches targeting that in recent kernels,
but it won't work with older kernels that don't have it.
So limiting a range of OSes that could run on QEMU
and do memory removal.

E820 less approach works reasonably well with wide range
of guest OSes and less complex that if removable memory
were present it E820. Hence I don't have a compelling
reason to introduce removable memory in E820 as it
only adds to hot(un)plug issues.

I have an off-tree QEMU hack that puts hotremovable
memory added with "-device pc-dimm" on CLI into E820
to experiment with. It could be useful to play
with zone layouts at boot time, so if you are
interested I can rebase it on top of current master
and post it here to play with.

2017-03-13 14:32:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Mon 13-03-17 13:54:59, Vitaly Kuznetsov wrote:
> >> Michal Hocko <[email protected]> writes:
> >>
> >> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> >> >> > >
> >> >> > > - suggested RFC is not acceptable from virt point of view
> >> >> > > as it regresses guests on top of x86 kvm/vmware which
> >> >> > > both use ACPI based memory hotplug.
> >> >> > >
> >> >> > > - udev/userspace solution doesn't work in practice as it's
> >> >> > > too slow and unreliable when system is under load which
> >> >> > > is quite common in virt usecase. That's why auto online
> >> >> > > has been introduced in the first place.
> >> >> >
> >> >> > Please try to be more specific why "too slow" is a problem. Also how
> >> >> > much slower are we talking about?
> >> >>
> >> >> In virt case on host with lots VMs, userspace handler
> >> >> processing could be scheduled late enough to trigger a race
> >> >> between (guest memory going away/OOM handler) and memory
> >> >> coming online.
> >> >
> >> > Either you are mixing two things together or this doesn't really make
> >> > much sense. So is this a balloning based on memory hotplug (aka active
> >> > memory hotadd initiated between guest and host automatically) or a guest
> >> > asking for additional memory by other means (pay more for memory etc.)?
> >> > Because if this is an administrative operation then I seriously question
> >> > this reasoning.
> >>
> >> I'm probably repeating myself but it seems this point was lost:
> >>
> >> This is not really a 'ballooning', it is just a pure memory
> >> hotplug. People may have any tools monitoring their VM memory usage and
> >> when a VM is running low on memory they may want to hotplug more memory
> >> to it.
> >
> > What is the API those guests ask for the memory? And who is actually
> > responsible to ask for that memory? Is it a kernel or userspace
> > solution?
>
> Whatever, this can even be a system administrator running
> 'free'.

I am pretty sure that 'free' will not give you additional memory but
let's be serious here... If this is solely about monitoring from
userspace and requesting more memory from the userspace then I would
consider arguing about timely hotplug operation as void because there is
absolutely no guarantee to do the request itself in a timely fashion.

> Hyper-V driver sends si_mem_available() and
> vm_memory_committed() metrics to the host every second and this can be
> later queried by any tool (e.g. powershell script).

And how exactly is this related to the acpi hotplug which you were
arguing needs the timely handling as well?

> >> With udev-style memory onlining they should be aware of page
> >> tables and other in-kernel structures which require allocation so they
> >> need to add memory slowly and gradually or they risk running into OOM
> >> (at least getting some processes killed and these processes may be
> >> important). With in-kernel memory hotplug everything happens
> >> synchronously and no 'slowly and gradually' algorithm is required in
> >> all tools which may trigger memory hotplug.
> >
> > What prevents those APIs being used reasonably and only asks so much
> > memory as they can afford? I mean 1.5% available memory necessary for
> > the hotplug is not all that much. Or more precisely what prevents to ask
> > for this additional memory in a synchronous way?
>
> The knowledge about the fact that we need to add memory slowly and
> wait till it gets onlined is not obvious.

yes it is and we cannot afford to give a better experience with the
implementation that requires to have memory to online a memory.

> AFAIR when you hotplug memory
> to Windows VMs there is no such thing as 'onlining', and no brain is
> required, a simple script 'low memory -> add mory memory' always
> works. Asking all these script writers to think twice before issuing a
> memory add command memory sounds like too much (to me).

Pardon me, but not requiring a brain while doing something on Windows
VMs is not really an argument...
--
Michal Hocko
SUSE Labs

2017-03-13 14:36:29

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Mon 13-03-17 14:57:12, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 11:43:02 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > > On Fri, 10 Mar 2017 14:58:07 +0100
> > [...]
> > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> > > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> > > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> > > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> > > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> > > > [ 0.000000] Zone ranges:
> > > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> > > > [ 0.000000] Normal empty
> > > > [ 0.000000] Movable zone start for each node
> > > > [ 0.000000] Early memory node ranges
> > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> > > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
> > > >
> > > > so there is neither any normal zone nor movable one at the boot time.
> > > it could be if hotpluggable memory were present at boot time in E802 table
> > > (if I remember right when running on hyperv there is movable zone at boot time),
> > >
> > > but in qemu hotpluggable memory isn't put into E820,
> > > so zone is allocated later when memory is enumerated
> > > by ACPI subsystem and onlined.
> > > It causes less issues wrt movable zone and works for
> > > different versions of linux/windows as well.
> > >
> > > That's where in kernel auto-onlining could be also useful,
> > > since user would be able to start-up with with small
> > > non removable memory plus several removable DIMMs
> > > and have all the memory onlined/available by the time
> > > initrd is loaded. (missing piece here is onling
> > > removable memory as movable by default).
> >
> > Why we should even care to online that memory that early rather than
> > making it available via e820?
>
> It's not forbidden by spec and has less complications
> when it comes to removable memory. Declaring it in E820
> would add following limitations/drawbacks:
> - firmware should be able to exclude removable memory
> from its usage (currently SeaBIOS nor EFI have to
> know/care about it) => less qemu-guest ABI to maintain.
> - OS should be taught to avoid/move (early) nonmovable
> allocations from removable address ranges.
> There were patches targeting that in recent kernels,
> but it won't work with older kernels that don't have it.
> So limiting a range of OSes that could run on QEMU
> and do memory removal.
>
> E820 less approach works reasonably well with wide range
> of guest OSes and less complex that if removable memory
> were present it E820. Hence I don't have a compelling
> reason to introduce removable memory in E820 as it
> only adds to hot(un)plug issues.

OK I see and that sounds like an argument to not put those ranges to
E820. I still fail to see why we haeve to online the memory early during
the boot and cannot wait for userspace to run?

--
Michal Hocko
SUSE Labs

2017-03-13 14:58:49

by Reza Arbab

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
>> I agree with your general sentiment that this stuff is very
>> nonintuitive.
>
>My criterion for nonintuitive is probably different because I would
>call this _completely_unusable_. Sorry for being so loud about this but
>the more I look into this area the more WTF code I see. This has seen
>close to zero review and seems to be building up more single usecase
>code on top of previous. We need to change this, seriously!

No argument here. I'm happy to help however I can.

--
Reza Arbab

2017-03-13 15:10:49

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

Michal Hocko <[email protected]> writes:

> On Mon 13-03-17 14:42:37, Vitaly Kuznetsov wrote:
>> >
>> > What is the API those guests ask for the memory? And who is actually
>> > responsible to ask for that memory? Is it a kernel or userspace
>> > solution?
>>
>> Whatever, this can even be a system administrator running
>> 'free'.
>
> I am pretty sure that 'free' will not give you additional memory but
> let's be serious here...
> If this is solely about monitoring from
> userspace and requesting more memory from the userspace then I would
> consider arguing about timely hotplug operation as void because there is
> absolutely no guarantee to do the request itself in a timely fashion.
>
>> Hyper-V driver sends si_mem_available() and
>> vm_memory_committed() metrics to the host every second and this can be
>> later queried by any tool (e.g. powershell script).
>
> And how exactly is this related to the acpi hotplug which you were
> arguing needs the timely handling as well?
>

What I meant to say is that there is no single 'right' way to get memory
usage from a VM, make a decision somewhere (in the hypervisor, on some
other host or even in someone's head) and issue a command to add more
memory. I don't know what particular tools people use with ESX/KVM VMs
but I think that multiple options are available.

>> >> With udev-style memory onlining they should be aware of page
>> >> tables and other in-kernel structures which require allocation so they
>> >> need to add memory slowly and gradually or they risk running into OOM
>> >> (at least getting some processes killed and these processes may be
>> >> important). With in-kernel memory hotplug everything happens
>> >> synchronously and no 'slowly and gradually' algorithm is required in
>> >> all tools which may trigger memory hotplug.
>> >
>> > What prevents those APIs being used reasonably and only asks so much
>> > memory as they can afford? I mean 1.5% available memory necessary for
>> > the hotplug is not all that much. Or more precisely what prevents to ask
>> > for this additional memory in a synchronous way?
>>
>> The knowledge about the fact that we need to add memory slowly and
>> wait till it gets onlined is not obvious.
>
> yes it is and we cannot afford to give a better experience with the
> implementation that requires to have memory to online a memory.

Actually, we need memory to add memory, not to online it. And as I said
before, this is a real issue which requires addressing, it should always
be possible to add more memory (and to online already added memory if
these events are separated).

>
>> AFAIR when you hotplug memory
>> to Windows VMs there is no such thing as 'onlining', and no brain is
>> required, a simple script 'low memory -> add mory memory' always
>> works. Asking all these script writers to think twice before issuing a
>> memory add command memory sounds like too much (to me).
>
> Pardon me, but not requiring a brain while doing something on Windows
> VMs is not really an argument...

Why? Windows (or any other OS) is just an example that things can be
done in a different way, otherwise we'll end up with arguments like "it
was always like that in Linux so it's good".

--
Vitaly

2017-03-13 15:11:14

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

Let's add Andi

On Fri 10-03-17 16:53:33, Michal Hocko wrote:
> On Fri 10-03-17 14:58:07, Michal Hocko wrote:
> [...]
> > This would explain why onlining from the last block actually works but
> > to me this sounds like a completely crappy behavior. All we need to
> > guarantee AFAICS is that Normal and Movable zones do not overlap. I
> > believe there is even no real requirement about ordering of the physical
> > memory in Normal vs. Movable zones as long as they do not overlap. But
> > let's keep it simple for the start and always enforce the current status
> > quo that Normal zone is physically preceeding Movable zone.
> > Can somebody explain why we cannot have a simple rule for Normal vs.
> > Movable which would be:
> > - block [pfn, pfn+block_size] can be Normal if
> > !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> > - block [pfn, pfn+block_size] can be Movable if
> > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
>
> OK, so while I was playing with this setup some more I probably got why
> this is done this way. All new memblocks are added to the zone Normal
> where they are accounted as spanned but not present. When we do
> online_movable we just cut from the end of the Normal zone and move it
> to Movable zone. This sounds really awkward. What was the reason to go
> this way? Why cannot we simply add those pages to the zone at the online
> time?

Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH]
x86_64: Support memory hotadd without sparsemem") which is no longer
true because
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on COMPILE_TEST || !KASAN

so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled.
config X86_64_ACPI_NUMA
def_bool y
prompt "ACPI NUMA detection"
depends on X86_64 && NUMA && ACPI && PCI
select ACPI_NUMA

But I do not see any way how to enable anything but SPARSEMEM for x86_64
choice
prompt "Memory model"
depends on SELECT_MEMORY_MODEL
default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
default FLATMEM_MANUAL

ARCH_SPARSEMEM_DEFAULT is 32b only
config ARCH_DISCONTIGMEM_DEFAULT
def_bool y
depends on NUMA && X86_32

and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
the reason to add this code back in 2006 is not true anymore. So I am
really wondering. Do we absolutely need to assign pages which are not
onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
out of any zone and wait for memory online operation to put them where
requested?
--
Michal Hocko
SUSE Labs

2017-03-13 23:16:14

by Andi Kleen

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
> the reason to add this code back in 2006 is not true anymore. So I am
> really wondering. Do we absolutely need to assign pages which are not
> onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
> out of any zone and wait for memory online operation to put them where
> requested?

I don't remember any specific reason. May have just been simplification.
Should be fine to use any zone.

-Andi

2017-03-14 13:20:34

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Mon, 13 Mar 2017 13:28:25 +0100
Michal Hocko <[email protected]> wrote:

> On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > On Thu, 9 Mar 2017 13:54:00 +0100
> > Michal Hocko <[email protected]> wrote:
> >
> > [...]
> > > > It's major regression if you remove auto online in kernels that
> > > > run on top of x86 kvm/vmware hypervisors, making API cleanups
> > > > while breaking useful functionality doesn't make sense.
> > > >
> > > > I would ACK config option removal if auto online keeps working
> > > > for all x86 hypervisors (hyperv/xen isn't the only who needs it)
> > > > and keep kernel CLI option to override default.
> > > >
> > > > That doesn't mean that others will agree with flipping default,
> > > > that's why config option has been added.
> > > >
> > > > Now to sum up what's been discussed on this thread, there were 2
> > > > different issues discussed:
> > > > 1) memory hotplug: remove in kernel auto online for all
> > > > except of hyperv/xen
> > > >
> > > > - suggested RFC is not acceptable from virt point of view
> > > > as it regresses guests on top of x86 kvm/vmware which
> > > > both use ACPI based memory hotplug.
> > > >
> > > > - udev/userspace solution doesn't work in practice as it's
> > > > too slow and unreliable when system is under load which
> > > > is quite common in virt usecase. That's why auto online
> > > > has been introduced in the first place.
> > >
> > > Please try to be more specific why "too slow" is a problem. Also how
> > > much slower are we talking about?
> >
> > In virt case on host with lots VMs, userspace handler
> > processing could be scheduled late enough to trigger a race
> > between (guest memory going away/OOM handler) and memory
> > coming online.
>
> Either you are mixing two things together or this doesn't really make
> much sense. So is this a balloning based on memory hotplug (aka active
> memory hotadd initiated between guest and host automatically) or a guest
> asking for additional memory by other means (pay more for memory etc.)?
> Because if this is an administrative operation then I seriously question
> this reasoning.
It doesn't have to be user initiated action,
have you heard about pay as you go phone plans, same thing use case
applies to cloud environments where typically hotplug user initiated
action on baremetal could be easily automated to hotplug on demand.


> [...]
> > > > - currently if one wants to use online_movable,
> > > > one has to either
> > > > * disable auto online in kernel OR
> > >
> > > which might not just work because an unmovable allocation could have
> > > made the memblock pinned.
> >
> > With memhp_default_state=offline on kernel CLI there won't be any
> > unmovable allocation as hotplugged memory won't be onlined and
> > user can online it manually. So it works for non default usecase
> > of playing with memory hot-unplug.
>
> I was talking about the case when the auto_online was true, of course,
> e.g. depending on the config option which you've said is enabled in
> Fedora kernels.
>
> [...]
> > > > I'm in favor of implementing that in kernel as it keeps
> > > > kernel internals inside kernel and doesn't need
> > > > kernel API to be involved (memory blocks in sysfs,
> > > > online_kernel, online_movable)
> > > > There would be no need in userspace which would have to
> > > > deal with kernel zoo and maintain that as well.
> > >
> > > The kernel is supposed to provide a proper API and that is sysfs
> > > currently. I am not entirely happy about it either but pulling a lot of
> > > code into the kernel is not the rigth thing to do. Especially when
> > > different usecases require different treatment.
> >
> > If it could be done from kernel side alone, it looks like a better way
> > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > possible to implement without adding a lot of kernel code.
>
> But this is not how we do the kernel development. We provide the API so
> that userspace can implement the appropriate policy on top. We do not
> add random knobs to implement the same thing in the kernel. Different
> users might want to implement different onlining strategies and that is
> hardly describable by a single global knob. Just look at the s390
> example provided earlier. Please try to think out of your usecase scope.
And could you think outside of legacy sysfs based onlining usecase scope?

I don't think that S390 comparing with x86 is correct as platforms
and hardware implementations of memory hotplug are different with
correspondingly different requirements, hence CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
were introduced to allows platform specify behavior.

For x86/ARM(+ACPI) it's possible to implement hotplug in race free
way inside kernel without userspace intervention, onlining memory
using hardware vendor defined policy (ACPI SRAT/Memory device describe
memory sufficiently to do it) so user won't have to do it manually,
config option is a convenient way to enable new feature for platforms
that could support it.

It's good to maintain uniform API to userspace as far as API does
the job, but being stuck to legacy way isn't good when
there is a way (even though it's limited to limited set of platforms)
to improve it by removing need for API, making overall less complex
and race-less (more reliable) system.

> > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> > so we could continue on improving kernel only auto-onlining
> > and fixing current memory hot(un)plug issues without affecting
> > other platforms/users that are no interested in it.
>
> I really do not see any reason to keep the config option. Setting up
> this to enabled is _wrong_ thing to do in general purpose
> (distribution) kernel and a kernel for the specific usecase can achieve
> the same thing via boot command line.
I have to disagree with you that setting policy 'not online by default'
in kernel is more valid than opposite policy 'online by default'.
It maybe works for your usecases but it doesn't mean that it suits
needs of others.

As example RHEL distribution (x86) are shipped with memory
autoonline enabled by default policy as it's what customers ask for.

And onlining memory as removable considered as a specific usecase,
since arguably a number of users where physical memory removal is
supported is less than a number of users where just hot add is
supported, plus single virt usecase adds huge userbase to
the later as it's easily available/accessible versus baremetal
hotplug.

So default depends on target audience and distributions need
a config option to pick default that suits its customers needs.
If we don't provide reliably working memory hot-add solution
customers will just move to OS that does (Windows or with your
patch hyperv/xen based cloud instead of KVM/VMware.

> > (PS: I don't care much about sysfs knob for setting auto-onlining,
> > as kernel CLI override with memhp_default_state seems
> > sufficient to me)
>
> That is good to hear! I would be OK with keeping the kernel command line
> option until we resolve all the current issues with the hotplug.
You RFC doesn't fix anything except of cleaning up config option,
and even at that is does it inconsistently breaking both userspaces
- one that does expect auto-online
kernel update on Fedora will break memory hot-add
(on KVM/VMware hosts) since userspace doesn't ship any
scripts that would do it but will continue to work on
hyperv/xen hosts.
- another that doesn't expect auto-online:
no change for KVM/VMware but suddenly hyperv/xen would
start auto-onlinig memory.
So users would have broken VMs until regression is noticed
and would have to manually fix userspace that they use to
accommodate 'improved' kernel.

This RFC under guise of clean up is removing default choice
from distributions and actually crippling linux guests on top
of KVM/VMware hosts while favoring xen/hyperv hosts.
IMHO it doesn't make any sense.

2017-03-14 16:06:41

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface



On 03/13/2017 05:19 AM, Michal Hocko wrote:
> On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote:
>> On 03/10/2017 08:58 AM, Michal Hocko wrote:
> [...]
>>> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
>>> [...]
>>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
>>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
>>> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
>>> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
>>> [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
>>> [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
>>> [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
>>> [ 0.000000] Zone ranges:
>>> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
>>> [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
>>> [ 0.000000] Normal empty
>>> [ 0.000000] Movable zone start for each node
>>> [ 0.000000] Early memory node ranges
>>> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
>>> [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
>>> [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
>>>
>>> so there is neither any normal zone nor movable one at the boot time.
>>> Then I hotplugged 1G slot
>>> (qemu) object_add memory-backend-ram,id=mem1,size=1G
>>> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
>>>
>>> unfortunatelly the memory didn't show up automatically and I got
>>> [ 116.375781] acpi PNP0C80:00: Enumeration failure
>>>
>>> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
>>> support auto probing - I haven't really dug further). Anyway the SRAT
>>> table printed during the boot told that we should start at 0x100000000
>>>
>>> # echo 0x100000000 > /sys/devices/system/memory/probe
>>> # grep . /sys/devices/system/memory/memory32/valid_zones
>>> Normal Movable
>>>
>>> which looks reasonably right? Both Normal and Movable zones are allowed
>>>
>>> # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
>>> # grep . /sys/devices/system/memory/memory3?/valid_zones
>>> /sys/devices/system/memory/memory32/valid_zones:Normal
>>> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>>>
>>> Huh, so our valid_zones have changed under our feet...
>>>
>>> # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
>>> # grep . /sys/devices/system/memory/memory3?/valid_zones
>>> /sys/devices/system/memory/memory32/valid_zones:Normal
>>> /sys/devices/system/memory/memory33/valid_zones:Normal
>>> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>>>
>>> and again. So only the last memblock is considered movable. Let's try to
>>> online them now.
>>>
>>> # echo online_movable > /sys/devices/system/memory/memory34/state
>>> # grep . /sys/devices/system/memory/memory3?/valid_zones
>>> /sys/devices/system/memory/memory32/valid_zones:Normal
>>> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>>> /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>>>
>>
>> I think there is no strong reason which kernel has the restriction.
>> By setting the restrictions, it seems to have made management of
>> these zone structs simple.
>
> Could you be more specific please? How could this make management any
> easier when udev is basically racing with the physical hotplug and the
> result is basically undefined?
>

When changing zone from NORMAL(N) to MOVALBE(M), we must resize both zones,
zone->zone_start_pfn and zone->spanned_pages. Currently there is the
restriction.

So we just simply change:
zone(N)->spanned_pages -= nr_pages
zone(M)->zone_start_pfn -= nr_pages

But if every memory can change zone with no restriction, we must recalculate
these zones spanned_pages and zone_start_pfn follows:

memory section #
1 2 3 4 5 6 7
|N|M|N|N|N|M|M|
|
|N|N|N|N|N|M|M|
* change memory section #2 from MOVABLE to NORMAL.
then we must find next movable memory section (#6) and resize these zones.

I think when implementing movable memory, there is no requirement of this.
So kernel has the current restriction.

Thanks,
Yasuaki Ishimatsu

2017-03-14 16:20:14

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface

On Tue 14-03-17 12:05:59, YASUAKI ISHIMATSU wrote:
>
>
> On 03/13/2017 05:19 AM, Michal Hocko wrote:
> >On Fri 10-03-17 12:39:27, Yasuaki Ishimatsu wrote:
> >>On 03/10/2017 08:58 AM, Michal Hocko wrote:
[...]
> >>># echo online_movable > /sys/devices/system/memory/memory34/state
> >>># grep . /sys/devices/system/memory/memory3?/valid_zones
> >>>/sys/devices/system/memory/memory32/valid_zones:Normal
> >>>/sys/devices/system/memory/memory33/valid_zones:Normal Movable
> >>>/sys/devices/system/memory/memory34/valid_zones:Movable Normal
> >>>
> >>
> >>I think there is no strong reason which kernel has the restriction.
> >>By setting the restrictions, it seems to have made management of
> >>these zone structs simple.
> >
> >Could you be more specific please? How could this make management any
> >easier when udev is basically racing with the physical hotplug and the
> >result is basically undefined?
> >
>
> When changing zone from NORMAL(N) to MOVALBE(M), we must resize both zones,
> zone->zone_start_pfn and zone->spanned_pages. Currently there is the
> restriction.
>
> So we just simply change:
> zone(N)->spanned_pages -= nr_pages
> zone(M)->zone_start_pfn -= nr_pages

Yes I understand how this made the implementation simpler. I was
questioning how this made user management any easier. Changing
valid zones which races with the hotplug consumer (e.g. udev) sounds
like a terrible idea to me.

Anyway, it seems that the initial assumption/restriction that all
pages have to start on the zone Normal is not really needed. I have a
preliminary patch which removes that and associates newly added pages
with a zone at the online time and it seems to be working reasonably
well. I have to iron out some corners before I post it.
--
Michal Hocko
SUSE Labs

2017-03-14 19:35:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

Hello,

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > >OK, so while I was playing with this setup some more I probably got why
> > >this is done this way. All new memblocks are added to the zone Normal
> > >where they are accounted as spanned but not present.
> >
> > It's not always zone Normal. See zone_for_memory(). This leads to a
> > workaround for having to do online_movable in descending block order.
> > Instead of this:
> >
> > 1. probe block 34, probe block 33, probe block 32, ...
> > 2. online_movable 34, online_movable 33, online_movable 32, ...
> >
> > you can online_movable the first block before adding the rest:
>
> I do I enforce that behavior when the probe happens automagically?

What I provided as guide to online as movable in current and older
kernels is:

1) Remove udev rule
2) After adding memory with qemu/libvirt API run in the guest

------- workaround start ----
#!/bin/bash
for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done
------- workaround end ----

That's how bad is onlining as movable for memory hotunplug.

> > 1. probe block 32, online_movable 32
> > 2. probe block 33, probe block 34, ...
> > - zone_for_memory() will cause these to start Movable
> > 3. online 33, online 34, ...
> > - they're already in Movable, so online_movable is equivalentr
> >
> > I agree with your general sentiment that this stuff is very nonintuitive.
>
> My criterion for nonintuitive is probably different because I would call
> this _completely_unusable_. Sorry for being so loud about this but the
> more I look into this area the more WTF code I see. This has seen close
> to zero review and seems to be building up more single usecase code on
> top of previous. We need to change this, seriously!

It's not a bug, but when I found it I called it a "constraint" and
when I debugged it (IIRC) it originated here:

} else if (online_type == MMOP_ONLINE_MOVABLE) {
if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift))
return -EINVAL;
}

Fixing it so you could online as movable even if it wasn't the last
block was in my todo list but then we had other plans.

Unfortunately unless pfn+nr_pages of the newly created Movable zone
matches the end of the normal zone (or whatever was there that has to
be converted to Movable), it will fail onlining as movable.

To fix it is enough to check that everything from pfn to the end of
whatever zone existed there (or the end of the node perhaps safer) can
be converted as movable and just convert the whole thing as movable
instead of stopping at pfn+nr_pages.

Also note, onlining highmem physical ranges as movable requires yet
another config option to be set for the below check to pass
(CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would
want to set =n, and perhaps would be candidate for dropping well
before considering to drop _DEFAULT_ONLINE and at best it should be
replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y
behavior.

if ((zone_idx(zone) > ZONE_NORMAL ||
online_type == MMOP_ONLINE_MOVABLE) &&
!can_online_high_movable(zone))
return -EINVAL;

I would suggest to drop the constraints in online_pages and perhaps
also the CONFIG_MOVABLE_NODE option before going to drop the automatic
onlining in kernel.

Because of the above constraint the udev rule is unusable anyway for
onlining memory as movable so that it can be hotunplugged reliably
(well not so reliably, page_migrate does a loop and tries many times
but it occurred to me it failed once to offline and at next try it
worked, temporary page pins with O_DIRECT screw with page_migration,
rightfully so and no easy fix).

After the above constraint is fixed I suggest to look into fixing the
async onlining by making the udev rule run synchronously. Adding 4T to
a 8G guest is perfectly reasonable and normal operation, no excuse for
it to fail as long as you don't pretend such 4T to be unpluggable too
later (which we won't).

As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely
that it's easier to fix it in kernel than fixing the udev
rule. Furthermore the above constraint for the zone shifting which
breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y
because kernel onlining is synchronous by design (no special
synchronous udev rule fix is required) so it can cope fine with the
existing above constraint by onlining as movable from the end of the
last zone in each node so that such constraint never gets in the way.

Extending _DEFFAULT_ONLINE=y so that it can also online as movable has
been worked on.

On a side note, kernelcore=xxx passed to a kernel with
_DEFFAULT_ONLINE=y should already allow to create lots of
hotunpluggable memory onlined automatically as movable (never tested
but I would expect it to work).

After the udev rule can handle adding 4T to a 8G guest and it can
handle onlining memory as movable reliably by just doing
s/online/online_movable/, I think then we can restart the discussion
about the removal of _DEFFAULT_ONLINE=y. As far as I can tell, there
are higher priority and more interesting things to fix in this area
before _DEFFAULT_ONLINE=y can be removed. Either udev gets fixed and
it's reasonably simpler to fix (it will remain slower) or we should
eventually obsolete the udev rule instead, which is why the focus
hasn't been in fixing the udev rule and to replace it instead.

To be clear, I'm not necessarily against eventually removing
_DEFFAULT_ONLINE, but an equally reliable and comparably fast
alternative should be provided first and there's no such alternative
right now.

If s390 has special issues requiring admin or a software hotplug
manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE
could be then an option selected or not selected by
arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1
with the in-kernel feature. If the in-kernel automatic onlining is not
workable on s390 I would assume the udev rule is also not workable.

Thanks,
Andrea

2017-03-15 07:54:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

On Tue 14-03-17 14:20:14, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 13:28:25 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Mon 13-03-17 11:55:54, Igor Mammedov wrote:
> > > On Thu, 9 Mar 2017 13:54:00 +0100
> > > Michal Hocko <[email protected]> wrote:
[...]
> > > > The kernel is supposed to provide a proper API and that is sysfs
> > > > currently. I am not entirely happy about it either but pulling a lot of
> > > > code into the kernel is not the rigth thing to do. Especially when
> > > > different usecases require different treatment.
> > >
> > > If it could be done from kernel side alone, it looks like a better way
> > > to me not to involve userspace at all. And for ACPI based x86/ARM it's
> > > possible to implement without adding a lot of kernel code.
> >
> > But this is not how we do the kernel development. We provide the API so
> > that userspace can implement the appropriate policy on top. We do not
> > add random knobs to implement the same thing in the kernel. Different
> > users might want to implement different onlining strategies and that is
> > hardly describable by a single global knob. Just look at the s390
> > example provided earlier. Please try to think out of your usecase scope.
>
> And could you think outside of legacy sysfs based onlining usecase scope?

Well, I always prefer a more generic solution which supports more
usecases and I am trying really hard to understand usecases you are
coming up with. So far I have heard that the current sysfs behavior is
broken (which is true!) and some very vague arguments about why we need
to online as quickly as possible to the point that userspace handling is
an absolute no go.

To be honest I still consider the later a non-issue. If the only thing
you care about is the memory footprint of the first phase then I believe
this is fixable. Memblock and section descriptors should be the only
necessary thing to allocate and that is not much.

As an aside, the more I think about the way the original authors
separated the physical hotadd from onlining the more I appreciate that
decision because the way how the memory can be online is definitely not
carved in stone and evolves with usecases. I believe nobody expected
that memory could be onlined as movable back then and I am pretty sure
new ways will emerge over time.

> I don't think that S390 comparing with x86 is correct as platforms
> and hardware implementations of memory hotplug are different with
> correspondingly different requirements, hence CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> were introduced to allows platform specify behavior.

There are different usecases which are arch agnostic. E.g. decide the
movability based on some criterion (e.g. specific node, physical address
range and what not). Global auto onlining cannot handle those for obvious
reasons and a config option will not do achieve that for the same
reason.

> For x86/ARM(+ACPI) it's possible to implement hotplug in race free
> way inside kernel without userspace intervention, onlining memory
> using hardware vendor defined policy (ACPI SRAT/Memory device describe
> memory sufficiently to do it) so user won't have to do it manually,
> config option is a convenient way to enable new feature for platforms
> that could support it.

Sigh. Can you see the actual difference between the global kernel policy
and the policy coming from the specific hardware (ACPI etc...)? I am not
opposing auto onlining based on the ACPI attributes. But what we have
now is a policy _in_the_kernel_. This is almost always a bad idea and
I do not see any strong argument why it would be any different in this
particular case. Actually your current default in Fedora makes it harder
for anybody to use movable zones/nodes.

> It's good to maintain uniform API to userspace as far as API does
> the job, but being stuck to legacy way isn't good when
> there is a way (even though it's limited to limited set of platforms)
> to improve it by removing need for API, making overall less complex
> and race-less (more reliable) system.

then convince your virtualization platform to provide necessary data
for the memory auto onlining via ACPI etc...

> > > That's one more of a reason to keep CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> > > so we could continue on improving kernel only auto-onlining
> > > and fixing current memory hot(un)plug issues without affecting
> > > other platforms/users that are no interested in it.
> >
> > I really do not see any reason to keep the config option. Setting up
> > this to enabled is _wrong_ thing to do in general purpose
> > (distribution) kernel and a kernel for the specific usecase can achieve
> > the same thing via boot command line.
>
> I have to disagree with you that setting policy 'not online by default'
> in kernel is more valid than opposite policy 'online by default'.
> It maybe works for your usecases but it doesn't mean that it suits
> needs of others.

Well, as described above there are good reasons to not hardwire any
policy into the kernel because things tend to evolve and come with many
surprising usecases original authors haven't anticipated at all.

On the other hand we have your auto_online policy which handles _one_
particular class of usecases which I believe could have been addressed
by enhancing the implementation of the current interface. E.g. allocate
less memory in the initial phase, preemptive failing the first phase
when there is too much memory waiting for onlining or even help udev to
react faster by having preallocated workers to handle events. Instead, I
suspect, you have chosen the path of the least resistance/effort and now
we've ended up with a global policy with known limitations. I cannot say
I would be happy about that.

> As example RHEL distribution (x86) are shipped with memory
> autoonline enabled by default policy as it's what customers ask for.
>
> And onlining memory as removable considered as a specific usecase,
> since arguably a number of users where physical memory removal is
> supported is less than a number of users where just hot add is
> supported, plus single virt usecase adds huge userbase to
> the later as it's easily available/accessible versus baremetal
> hotplug.

this might be the case now but might turn out to be a completely wrong
thing to do in few years when overhyped^Wcloud workloads won't be all
that interesting anymore.

> So default depends on target audience and distributions need
> a config option to pick default that suits its customers needs.

Well, I would hope that such a thing could be achieved by more flexible
means than the kernel config... E.g. pre-defined defaults that I can
install as a package rather than enforcing a particular policy to
everybody.

> If we don't provide reliably working memory hot-add solution
> customers will just move to OS that does (Windows or with your
> patch hyperv/xen based cloud instead of KVM/VMware.
>
> > > (PS: I don't care much about sysfs knob for setting auto-onlining,
> > > as kernel CLI override with memhp_default_state seems
> > > sufficient to me)
> >
> > That is good to hear! I would be OK with keeping the kernel command line
> > option until we resolve all the current issues with the hotplug.
>
> You RFC doesn't fix anything except of cleaning up config option,
> and even at that is does it inconsistently breaking both userspaces
> - one that does expect auto-online
> kernel update on Fedora will break memory hot-add
> (on KVM/VMware hosts) since userspace doesn't ship any
> scripts that would do it but will continue to work on
> hyperv/xen hosts.

that is actually trivial to fix and provide a userspace fix while
the kernel still offers the functionality and remove the kernel
functionality later. Nobody talks about removing the whole thing at
once. API changes are not that simple at all.

> - another that doesn't expect auto-online:
> no change for KVM/VMware but suddenly hyperv/xen would
> start auto-onlinig memory.

I would argue that removing a policy which covers only some usecases as
a fix but whatever. We obviously disagree here...

Anyway, I consider "never break the userspace" to be a hard rule and I
do not want to break any usecase of course. I thought this RFC would
help to trigger a constructive discussion with some reasonable outcome
where we would get rid of the cruft eventually. It seems this will not
be the case because getting an immediate half-solutions is preferred
much more than exhausting all the potential options these days.

I am sorry, but I have to say I really hate the way this all sneaked in
without a wider review, though. If this went through a proper review
process it would get a straight NAK, from me at least, I believe.
--
Michal Hocko
SUSE Labs

2017-03-15 07:58:01

by Michal Hocko

[permalink] [raw]
Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote:
> Hello,
>
> On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> > On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > > >OK, so while I was playing with this setup some more I probably got why
> > > >this is done this way. All new memblocks are added to the zone Normal
> > > >where they are accounted as spanned but not present.
> > >
> > > It's not always zone Normal. See zone_for_memory(). This leads to a
> > > workaround for having to do online_movable in descending block order.
> > > Instead of this:
> > >
> > > 1. probe block 34, probe block 33, probe block 32, ...
> > > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > >
> > > you can online_movable the first block before adding the rest:
> >
> > I do I enforce that behavior when the probe happens automagically?
>
> What I provided as guide to online as movable in current and older
> kernels is:
>
> 1) Remove udev rule
> 2) After adding memory with qemu/libvirt API run in the guest
>
> ------- workaround start ----
> #!/bin/bash
> for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done
> ------- workaround end ----
>
> That's how bad is onlining as movable for memory hotunplug.

Yeah, this is really yucky. Fortunately, I already have a working prototype
which removes this restriction altogether.

> > > 1. probe block 32, online_movable 32
> > > 2. probe block 33, probe block 34, ...
> > > - zone_for_memory() will cause these to start Movable
> > > 3. online 33, online 34, ...
> > > - they're already in Movable, so online_movable is equivalentr
> > >
> > > I agree with your general sentiment that this stuff is very nonintuitive.
> >
> > My criterion for nonintuitive is probably different because I would call
> > this _completely_unusable_. Sorry for being so loud about this but the
> > more I look into this area the more WTF code I see. This has seen close
> > to zero review and seems to be building up more single usecase code on
> > top of previous. We need to change this, seriously!
>
> It's not a bug, but when I found it I called it a "constraint" and
> when I debugged it (IIRC) it originated here:
>
> } else if (online_type == MMOP_ONLINE_MOVABLE) {
> if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift))
> return -EINVAL;
> }
>
> Fixing it so you could online as movable even if it wasn't the last
> block was in my todo list but then we had other plans.
>
> Unfortunately unless pfn+nr_pages of the newly created Movable zone
> matches the end of the normal zone (or whatever was there that has to
> be converted to Movable), it will fail onlining as movable.

Well, I think the main problem is that we associate pages added in the
first phase (probe) to the Normal zone. The everything else is just a
fallout and fiddling to make it work somehow.

[...]

> To be clear, I'm not necessarily against eventually removing
> _DEFFAULT_ONLINE, but an equally reliable and comparably fast
> alternative should be provided first and there's no such alternative
> right now.

As pointed in other reply that is not an intention here. I primarily
wanted to understand the scope of the problem. I believe this config
option was rushed into the kernel without considering other alternatives
which would make the hotplug more usable by others as well.

> If s390 has special issues requiring admin or a software hotplug
> manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE
> could be then an option selected or not selected by
> arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1
> with the in-kernel feature. If the in-kernel automatic onlining is not
> workable on s390 I would assume the udev rule is also not workable.

But this is not about s390. It is about different usecases which require
different onlining policy and that is the main problem of the hardcoded
one in the kernel. See the other reply.
--
Michal Hocko
SUSE Labs