2013-07-19 18:45:11

by KY Srinivasan

[permalink] [raw]
Subject: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

The current machinery for hot-adding memory requires having udev
rules to bring the memory segments online. Export the necessary functionality
to to bring the memory segment online without involving user space code.

Signed-off-by: K. Y. Srinivasan <[email protected]>
---
drivers/base/memory.c | 5 ++++-
include/linux/memory.h | 4 ++++
2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2b7813e..a8204ac 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -328,7 +328,7 @@ static int __memory_block_change_state_uevent(struct memory_block *mem,
return ret;
}

-static int memory_block_change_state(struct memory_block *mem,
+int memory_block_change_state(struct memory_block *mem,
unsigned long to_state, unsigned long from_state_req,
int online_type)
{
@@ -341,6 +341,8 @@ static int memory_block_change_state(struct memory_block *mem,

return ret;
}
+EXPORT_SYMBOL(memory_block_change_state);
+
static ssize_t
store_mem_state(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
@@ -540,6 +542,7 @@ struct memory_block *find_memory_block(struct mem_section *section)
{
return find_memory_block_hinted(section, NULL);
}
+EXPORT_SYMBOL(find_memory_block);

static struct attribute *memory_memblk_attrs[] = {
&dev_attr_phys_index.attr,
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 85c31a8..8e3ede5 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -115,6 +115,10 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
extern int register_memory_isolate_notifier(struct notifier_block *nb);
extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
extern int register_new_memory(int, struct mem_section *);
+extern int memory_block_change_state(struct memory_block *mem,
+ unsigned long to_state, unsigned long from_state_req,
+ int online_type);
+
#ifdef CONFIG_MEMORY_HOTREMOVE
extern int unregister_memory_section(struct mem_section *);
#endif
--
1.7.4.1


2013-07-22 03:17:12

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/20/2013 03:23 AM, K. Y. Srinivasan wrote:
> The current machinery for hot-adding memory requires having udev
> rules to bring the memory segments online. Export the necessary functionality
> to to bring the memory segment online without involving user space code.

According to udev guys, udev won't provide unconditional, always enabled
kernel
policy in udev. This is really useful for driver to online the pages
without user-space involvement.

Acked-by: Jason Wang <[email protected]>
> Signed-off-by: K. Y. Srinivasan <[email protected]>
> ---
> drivers/base/memory.c | 5 ++++-
> include/linux/memory.h | 4 ++++
> 2 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 2b7813e..a8204ac 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -328,7 +328,7 @@ static int __memory_block_change_state_uevent(struct memory_block *mem,
> return ret;
> }
>
> -static int memory_block_change_state(struct memory_block *mem,
> +int memory_block_change_state(struct memory_block *mem,
> unsigned long to_state, unsigned long from_state_req,
> int online_type)
> {
> @@ -341,6 +341,8 @@ static int memory_block_change_state(struct memory_block *mem,
>
> return ret;
> }
> +EXPORT_SYMBOL(memory_block_change_state);
> +
> static ssize_t
> store_mem_state(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t count)
> @@ -540,6 +542,7 @@ struct memory_block *find_memory_block(struct mem_section *section)
> {
> return find_memory_block_hinted(section, NULL);
> }
> +EXPORT_SYMBOL(find_memory_block);
>
> static struct attribute *memory_memblk_attrs[] = {
> &dev_attr_phys_index.attr,
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 85c31a8..8e3ede5 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -115,6 +115,10 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
> extern int register_memory_isolate_notifier(struct notifier_block *nb);
> extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
> extern int register_new_memory(int, struct mem_section *);
> +extern int memory_block_change_state(struct memory_block *mem,
> + unsigned long to_state, unsigned long from_state_req,
> + int online_type);
> +
> #ifdef CONFIG_MEMORY_HOTREMOVE
> extern int unregister_memory_section(struct mem_section *);
> #endif

2013-07-22 12:37:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On Fri 19-07-13 12:23:05, K. Y. Srinivasan wrote:
> The current machinery for hot-adding memory requires having udev
> rules to bring the memory segments online. Export the necessary functionality
> to to bring the memory segment online without involving user space code.

Why? Who is going to use it and for what purpose?
If you need to do it from the kernel cannot you use usermod helper
thread?

Besides that this is far from being complete. memory_block_change_state
seems to depend on device_hotplug_lock and find_memory_block is
currently called with mem_sysfs_mutex held. None of them is exported
AFAICS.

> Signed-off-by: K. Y. Srinivasan <[email protected]>
> ---
> drivers/base/memory.c | 5 ++++-
> include/linux/memory.h | 4 ++++
> 2 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 2b7813e..a8204ac 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -328,7 +328,7 @@ static int __memory_block_change_state_uevent(struct memory_block *mem,
> return ret;
> }
>
> -static int memory_block_change_state(struct memory_block *mem,
> +int memory_block_change_state(struct memory_block *mem,
> unsigned long to_state, unsigned long from_state_req,
> int online_type)
> {
> @@ -341,6 +341,8 @@ static int memory_block_change_state(struct memory_block *mem,
>
> return ret;
> }
> +EXPORT_SYMBOL(memory_block_change_state);
> +
> static ssize_t
> store_mem_state(struct device *dev,
> struct device_attribute *attr, const char *buf, size_t count)
> @@ -540,6 +542,7 @@ struct memory_block *find_memory_block(struct mem_section *section)
> {
> return find_memory_block_hinted(section, NULL);
> }
> +EXPORT_SYMBOL(find_memory_block);
>
> static struct attribute *memory_memblk_attrs[] = {
> &dev_attr_phys_index.attr,
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 85c31a8..8e3ede5 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -115,6 +115,10 @@ extern void unregister_memory_notifier(struct notifier_block *nb);
> extern int register_memory_isolate_notifier(struct notifier_block *nb);
> extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
> extern int register_new_memory(int, struct mem_section *);
> +extern int memory_block_change_state(struct memory_block *mem,
> + unsigned long to_state, unsigned long from_state_req,
> + int online_type);
> +
> #ifdef CONFIG_MEMORY_HOTREMOVE
> extern int unregister_memory_section(struct mem_section *);
> #endif
> --
> 1.7.4.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs

2013-07-23 14:54:36

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Michal Hocko [mailto:[email protected]]
> Sent: Monday, July 22, 2013 8:37 AM
> To: KY Srinivasan
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On Fri 19-07-13 12:23:05, K. Y. Srinivasan wrote:
> > The current machinery for hot-adding memory requires having udev
> > rules to bring the memory segments online. Export the necessary functionality
> > to to bring the memory segment online without involving user space code.
>
> Why? Who is going to use it and for what purpose?
> If you need to do it from the kernel cannot you use usermod helper
> thread?
>
> Besides that this is far from being complete. memory_block_change_state
> seems to depend on device_hotplug_lock and find_memory_block is
> currently called with mem_sysfs_mutex held. None of them is exported
> AFAICS.

You are right; not all of the required symbols are exported (yet). Let me answer your
other questions first:

The Hyper-V balloon driver can use this functionality. I have prototyped the in-kernel
"onlining" of hot added memory without requiring any help from user level code that
performs significantly better than having user level code involved in the hot add process.
With this change, I am able to successfully hot-add and online the hot-added memory
even under extreme memory pressure which is what you would want given that we are
hot-adding memory to alleviate memory pressure. The current scheme of involving user
level code to close this loop obviously does not perform well under high memory pressure.


I can, if you prefer export all of the necessary functionality in one patch.


Regards,

K. Y

2013-07-23 15:09:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On Tue 23-07-13 14:52:36, KY Srinivasan wrote:
>
>
> > -----Original Message-----
> > From: Michal Hocko [mailto:[email protected]]
> > Sent: Monday, July 22, 2013 8:37 AM
> > To: KY Srinivasan
> > Cc: [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> > memory blocks
> >
> > On Fri 19-07-13 12:23:05, K. Y. Srinivasan wrote:
> > > The current machinery for hot-adding memory requires having udev
> > > rules to bring the memory segments online. Export the necessary functionality
> > > to to bring the memory segment online without involving user space code.
> >
> > Why? Who is going to use it and for what purpose?
> > If you need to do it from the kernel cannot you use usermod helper
> > thread?
> >
> > Besides that this is far from being complete. memory_block_change_state
> > seems to depend on device_hotplug_lock and find_memory_block is
> > currently called with mem_sysfs_mutex held. None of them is exported
> > AFAICS.
>
> You are right; not all of the required symbols are exported (yet). Let
> me answer your other questions first:
>
> The Hyper-V balloon driver can use this functionality. I have
> prototyped the in-kernel "onlining" of hot added memory without
> requiring any help from user level code that performs significantly
> better than having user level code involved in the hot add process.

What does significantly better mean here?

> With this change, I am able to successfully hot-add and online the
> hot-added memory even under extreme memory pressure which is what you
> would want given that we are hot-adding memory to alleviate memory
> pressure. The current scheme of involving user level code to close
> this loop obviously does not perform well under high memory pressure.

Hmm, this is really unexpected. Why the high memory pressure matters
here? Userspace only need to access sysfs file and echo a simple string
into a file. The reset is same regardless you do it from the userspace.

> I can, if you prefer export all of the necessary functionality in one
> patch.

If this turns out really a valid use case then I would prefer exporting
a high level function which would hide all the locking and direct
manipulation with mem blocks.

--
Michal Hocko
SUSE Labs

2013-07-23 15:28:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/23/2013 07:52 AM, KY Srinivasan wrote:
> The current scheme of involving user
> level code to close this loop obviously does not perform well under high memory pressure.

Adding memory usually requires allocating some large, contiguous areas
of memory for use as mem_map[] and other VM structures. That's really
hard to do under heavy memory pressure. How are you accomplishing this?

2013-07-23 15:52:45

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Michal Hocko [mailto:[email protected]]
> Sent: Tuesday, July 23, 2013 11:10 AM
> To: KY Srinivasan
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On Tue 23-07-13 14:52:36, KY Srinivasan wrote:
> >
> >
> > > -----Original Message-----
> > > From: Michal Hocko [mailto:[email protected]]
> > > Sent: Monday, July 22, 2013 8:37 AM
> > > To: KY Srinivasan
> > > Cc: [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected];
> [email protected];
> > > [email protected]; [email protected]
> > > Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> > > memory blocks
> > >
> > > On Fri 19-07-13 12:23:05, K. Y. Srinivasan wrote:
> > > > The current machinery for hot-adding memory requires having udev
> > > > rules to bring the memory segments online. Export the necessary
> functionality
> > > > to to bring the memory segment online without involving user space code.
> > >
> > > Why? Who is going to use it and for what purpose?
> > > If you need to do it from the kernel cannot you use usermod helper
> > > thread?
> > >
> > > Besides that this is far from being complete. memory_block_change_state
> > > seems to depend on device_hotplug_lock and find_memory_block is
> > > currently called with mem_sysfs_mutex held. None of them is exported
> > > AFAICS.
> >
> > You are right; not all of the required symbols are exported (yet). Let
> > me answer your other questions first:
> >
> > The Hyper-V balloon driver can use this functionality. I have
> > prototyped the in-kernel "onlining" of hot added memory without
> > requiring any help from user level code that performs significantly
> > better than having user level code involved in the hot add process.
>
> What does significantly better mean here?

Less failures than before.
>
> > With this change, I am able to successfully hot-add and online the
> > hot-added memory even under extreme memory pressure which is what you
> > would want given that we are hot-adding memory to alleviate memory
> > pressure. The current scheme of involving user level code to close
> > this loop obviously does not perform well under high memory pressure.
>
> Hmm, this is really unexpected. Why the high memory pressure matters
> here? Userspace only need to access sysfs file and echo a simple string
> into a file. The reset is same regardless you do it from the userspace.

Could it be that we could fail to launch the user-space thread. The host presents
a large chunk of memory for "hot adding". Within the guest, I break this up and hot-add
128MB chunks; as I loop through this process, I wait for the onlining to occur before proceeding
with the next hotadd operation. With user space code involved in the onlining process, I would
frequently timeout waiting for onlining to complete (under high memory load).
After I switched over to not involving the user space code, this problem does not exist since
onlining is done "in context".

>
> > I can, if you prefer export all of the necessary functionality in one
> > patch.
>
> If this turns out really a valid use case then I would prefer exporting
> a high level function which would hide all the locking and direct
> manipulation with mem blocks.

I will take a crack at defining wrappers to hide some of the details. I will also
post the Hyper-V balloon driver patch that uses this functionality.

Regards,

K. Y


2013-07-23 15:55:30

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Tuesday, July 23, 2013 11:28 AM
> To: KY Srinivasan
> Cc: Michal Hocko; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On 07/23/2013 07:52 AM, KY Srinivasan wrote:
> > The current scheme of involving user
> > level code to close this loop obviously does not perform well under high
> memory pressure.
>
> Adding memory usually requires allocating some large, contiguous areas
> of memory for use as mem_map[] and other VM structures. That's really
> hard to do under heavy memory pressure. How are you accomplishing this?

I cannot avoid failures because of lack of memory. In this case I notify the host of
the failure and also tag the failure as transient. Host retries the operation after some
delay. There is no guarantee it will succeed though.

K. Y
>
>
>


2013-07-23 16:01:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/23/2013 08:54 AM, KY Srinivasan wrote:
>> > Adding memory usually requires allocating some large, contiguous areas
>> > of memory for use as mem_map[] and other VM structures. That's really
>> > hard to do under heavy memory pressure. How are you accomplishing this?
> I cannot avoid failures because of lack of memory. In this case I notify the host of
> the failure and also tag the failure as transient. Host retries the operation after some
> delay. There is no guarantee it will succeed though.

You didn't really answer the question.

You have allocated some large, physically contiguous areas of memory
under heavy pressure. But you also contend that there is too much
memory pressure to run a small userspace helper. Under heavy memory
pressure, I'd expect large, kernel allocations to fail much more often
than running a small userspace helper.

It _sounds_ like you really want to be able to have the host retry the
operation if it fails, and you return success/failure from inside the
kernel. It's hard for you to tell if running the userspace helper
failed, so your solution is to move what what previously done in
userspace in to the kernel so that you can more easily tell if it failed
or succeeded.

Is that right?

2013-07-23 16:02:00

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On Fri, Jul 19, 2013 at 12:23:05PM -0700, K. Y. Srinivasan wrote:
> The current machinery for hot-adding memory requires having udev
> rules to bring the memory segments online. Export the necessary functionality
> to to bring the memory segment online without involving user space code.
>
> Signed-off-by: K. Y. Srinivasan <[email protected]>
> ---
> drivers/base/memory.c | 5 ++++-
> include/linux/memory.h | 4 ++++
> 2 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 2b7813e..a8204ac 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -328,7 +328,7 @@ static int __memory_block_change_state_uevent(struct memory_block *mem,
> return ret;
> }
>
> -static int memory_block_change_state(struct memory_block *mem,
> +int memory_block_change_state(struct memory_block *mem,
> unsigned long to_state, unsigned long from_state_req,
> int online_type)
> {
> @@ -341,6 +341,8 @@ static int memory_block_change_state(struct memory_block *mem,
>
> return ret;
> }
> +EXPORT_SYMBOL(memory_block_change_state);

EXPORT_SYMBOL_GPL() for all of these please.

And as others have pointed out, I can't export symbols without a user of
those symbols going into the tree at the same time. So I'll drop this
patch for now and wait for your consumer of these symbols to be
submitted.

thanks,

greg k-h

2013-07-23 16:56:01

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Greg KH [mailto:[email protected]]
> Sent: Tuesday, July 23, 2013 12:02 PM
> To: KY Srinivasan
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On Fri, Jul 19, 2013 at 12:23:05PM -0700, K. Y. Srinivasan wrote:
> > The current machinery for hot-adding memory requires having udev
> > rules to bring the memory segments online. Export the necessary functionality
> > to to bring the memory segment online without involving user space code.
> >
> > Signed-off-by: K. Y. Srinivasan <[email protected]>
> > ---
> > drivers/base/memory.c | 5 ++++-
> > include/linux/memory.h | 4 ++++
> > 2 files changed, 8 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 2b7813e..a8204ac 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -328,7 +328,7 @@ static int
> __memory_block_change_state_uevent(struct memory_block *mem,
> > return ret;
> > }
> >
> > -static int memory_block_change_state(struct memory_block *mem,
> > +int memory_block_change_state(struct memory_block *mem,
> > unsigned long to_state, unsigned long from_state_req,
> > int online_type)
> > {
> > @@ -341,6 +341,8 @@ static int memory_block_change_state(struct
> memory_block *mem,
> >
> > return ret;
> > }
> > +EXPORT_SYMBOL(memory_block_change_state);
>
> EXPORT_SYMBOL_GPL() for all of these please.

Will do.
>
> And as others have pointed out, I can't export symbols without a user of
> those symbols going into the tree at the same time. So I'll drop this
> patch for now and wait for your consumer of these symbols to be
> submitted.

I will submit the consumer as well.

Thanks,

K. Y

> greg k-h
>
>


2013-07-23 17:24:09

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Tuesday, July 23, 2013 12:01 PM
> To: KY Srinivasan
> Cc: Michal Hocko; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On 07/23/2013 08:54 AM, KY Srinivasan wrote:
> >> > Adding memory usually requires allocating some large, contiguous areas
> >> > of memory for use as mem_map[] and other VM structures. That's really
> >> > hard to do under heavy memory pressure. How are you accomplishing this?
> > I cannot avoid failures because of lack of memory. In this case I notify the host
> of
> > the failure and also tag the failure as transient. Host retries the operation after
> some
> > delay. There is no guarantee it will succeed though.
>
> You didn't really answer the question.
>
> You have allocated some large, physically contiguous areas of memory
> under heavy pressure. But you also contend that there is too much
> memory pressure to run a small userspace helper. Under heavy memory
> pressure, I'd expect large, kernel allocations to fail much more often
> than running a small userspace helper.

I am only reporting what I am seeing. Broadly, I have two main failure conditions to
deal with: (a) resource related failure (add_memory() returning -ENOMEM) and (b) not being
able to online a segment that has been successfully hot-added. I have seen both these failures
under high memory pressure. By supporting "in context" onlining, we can eliminate one failure
case. Our inability to online is not a recoverable failure from the host's point of view - the memory
is committed to the guest (since hot add succeeded) but is not usable since it is not onlined.
>
> It _sounds_ like you really want to be able to have the host retry the
> operation if it fails, and you return success/failure from inside the
> kernel. It's hard for you to tell if running the userspace helper
> failed, so your solution is to move what what previously done in
> userspace in to the kernel so that you can more easily tell if it failed
> or succeeded.
>
> Is that right?

No; I am able to get the proper error code for recoverable failures (hot add failures
because of lack of memory). By doing what I am proposing here, we can avoid one class
of failures completely and I think this is what resulted in a better "hot add" experience in the
guest.

K. Y
>
>


2013-07-24 16:43:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/23/2013 10:21 AM, KY Srinivasan wrote:
>> You have allocated some large, physically contiguous areas of memory
>> under heavy pressure. But you also contend that there is too much
>> memory pressure to run a small userspace helper. Under heavy memory
>> pressure, I'd expect large, kernel allocations to fail much more often
>> than running a small userspace helper.
>
> I am only reporting what I am seeing. Broadly, I have two main failure conditions to
> deal with: (a) resource related failure (add_memory() returning -ENOMEM) and (b) not being
> able to online a segment that has been successfully hot-added. I have seen both these failures
> under high memory pressure. By supporting "in context" onlining, we can eliminate one failure
> case. Our inability to online is not a recoverable failure from the host's point of view - the memory
> is committed to the guest (since hot add succeeded) but is not usable since it is not onlined.

Could you please precisely report on what you are seeing in detail?
Where are the -ENOMEMs coming from? Which allocation site? Are you
seeing OOMs or page allocation failure messages on the console?

The operation was split up in to two parts for good reason. It's
actually for your _precise_ use case.

A system under memory pressure is going to have troubles doing a
hot-add. You need memory to add memory. Of the two operations ("add"
and "online"), "add" is the one vastly more likely to fail. It has to
allocate several large swaths of contiguous physical memory. For that
reason, the system was designed so that you could "add" and "online"
separately. The intention was that you could "add" far in advance and
then "online" under memory pressure, with the "online" having *VASTLY*
smaller memory requirements and being much more likely to succeed.

You're lumping the "allocate several large swaths of contiguous physical
memory" failures in to the same class as "run a small userspace helper".
They are _really_ different problems. Both prone to allocation
failures for sure, but _very_ separate problems. Please don't conflate
them.

>> It _sounds_ like you really want to be able to have the host retry the
>> operation if it fails, and you return success/failure from inside the
>> kernel. It's hard for you to tell if running the userspace helper
>> failed, so your solution is to move what what previously done in
>> userspace in to the kernel so that you can more easily tell if it failed
>> or succeeded.
>>
>> Is that right?
>
> No; I am able to get the proper error code for recoverable failures (hot add failures
> because of lack of memory). By doing what I am proposing here, we can avoid one class
> of failures completely and I think this is what resulted in a better "hot add" experience in the
> guest.

I think you're taking a huge leap here: "We could not online memory,
thus we must take userspace out of the loop."

You might be right. There might be only one way out of this situation.
But you need to provide a little more supporting evidence before we all
arrive at the same conclusion.

BTW, it doesn't _require_ udev. There could easily be another listener
for hotplug events.

2013-07-24 19:47:37

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Wednesday, July 24, 2013 12:43 PM
> To: KY Srinivasan
> Cc: Dave Hansen; Michal Hocko; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On 07/23/2013 10:21 AM, KY Srinivasan wrote:
> >> You have allocated some large, physically contiguous areas of memory
> >> under heavy pressure. But you also contend that there is too much
> >> memory pressure to run a small userspace helper. Under heavy memory
> >> pressure, I'd expect large, kernel allocations to fail much more often
> >> than running a small userspace helper.
> >
> > I am only reporting what I am seeing. Broadly, I have two main failure
> conditions to
> > deal with: (a) resource related failure (add_memory() returning -ENOMEM)
> and (b) not being
> > able to online a segment that has been successfully hot-added. I have seen
> both these failures
> > under high memory pressure. By supporting "in context" onlining, we can
> eliminate one failure
> > case. Our inability to online is not a recoverable failure from the host's point of
> view - the memory
> > is committed to the guest (since hot add succeeded) but is not usable since it is
> not onlined.
>
> Could you please precisely report on what you are seeing in detail?
> Where are the -ENOMEMs coming from? Which allocation site? Are you
> seeing OOMs or page allocation failure messages on the console?

The ENOMEM failure I see from the call to hot add memory - the call to
add_memory(). Usually I don't see any OOM messages on the console.

>
> The operation was split up in to two parts for good reason. It's
> actually for your _precise_ use case.

I agree and without this split, I could not implement the balloon driver with
hot-add.

>
> A system under memory pressure is going to have troubles doing a
> hot-add. You need memory to add memory. Of the two operations ("add"
> and "online"), "add" is the one vastly more likely to fail. It has to
> allocate several large swaths of contiguous physical memory. For that
> reason, the system was designed so that you could "add" and "online"
> separately. The intention was that you could "add" far in advance and
> then "online" under memory pressure, with the "online" having *VASTLY*
> smaller memory requirements and being much more likely to succeed.
>
> You're lumping the "allocate several large swaths of contiguous physical
> memory" failures in to the same class as "run a small userspace helper".
> They are _really_ different problems. Both prone to allocation
> failures for sure, but _very_ separate problems. Please don't conflate
> them.

I don't think I am conflating these two issues; I am sorry if I gave that
impression. All I am saying is that I see two classes of failures: (a) Our
inability to allocate memory to manage the memory that is being hot added
and (b) Our inability to bring the hot added memory online within a reasonable
amount of time. I am not sure the cause for (b) and I was just speculating that
this could be memory related. What is interesting is that I have seen failure related
to our inability to online the memory after having succeeded in hot adding the
memory.

>
> >> It _sounds_ like you really want to be able to have the host retry the
> >> operation if it fails, and you return success/failure from inside the
> >> kernel. It's hard for you to tell if running the userspace helper
> >> failed, so your solution is to move what what previously done in
> >> userspace in to the kernel so that you can more easily tell if it failed
> >> or succeeded.
> >>
> >> Is that right?
> >
> > No; I am able to get the proper error code for recoverable failures (hot add
> failures
> > because of lack of memory). By doing what I am proposing here, we can avoid
> one class
> > of failures completely and I think this is what resulted in a better "hot add"
> experience in the
> > guest.
>
> I think you're taking a huge leap here: "We could not online memory,
> thus we must take userspace out of the loop."
>
> You might be right. There might be only one way out of this situation.
> But you need to provide a little more supporting evidence before we all
> arrive at the same conclusion.

I am not even suggesting that. All I am saying is that there should be a mechanism
for "in context" onlining of memory in addition to the existing sysfs mechanism
for bringing memory online from a kernel context. Hyper-V balloon driver
can certainly use this functionality. I should be sending out the patches for this
shortly.
>
> BTW, it doesn't _require_ udev. There could easily be another listener
> for hotplug events.

Agreed; but structurally it is identical to having a udev rule.

Regards,

K. Y


2013-07-24 21:03:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/24/2013 12:45 PM, KY Srinivasan wrote:
> All I am saying is that I see two classes of failures: (a) Our
> inability to allocate memory to manage the memory that is being hot added
> and (b) Our inability to bring the hot added memory online within a reasonable
> amount of time. I am not sure the cause for (b) and I was just speculating that
> this could be memory related. What is interesting is that I have seen failure related
> to our inability to online the memory after having succeeded in hot adding the
> memory.

I think we should hold off on this patch and other like it until we've
been sufficiently able to explain how (b) happens.

2013-07-25 07:57:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On Wed 24-07-13 14:02:32, Dave Hansen wrote:
> On 07/24/2013 12:45 PM, KY Srinivasan wrote:
> > All I am saying is that I see two classes of failures: (a) Our
> > inability to allocate memory to manage the memory that is being hot added
> > and (b) Our inability to bring the hot added memory online within a reasonable
> > amount of time. I am not sure the cause for (b) and I was just speculating that
> > this could be memory related. What is interesting is that I have seen failure related
> > to our inability to online the memory after having succeeded in hot adding the
> > memory.
>
> I think we should hold off on this patch and other like it until we've
> been sufficiently able to explain how (b) happens.

Agreed.

--
Michal Hocko
SUSE Labs

2013-07-25 11:15:41

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Michal Hocko [mailto:[email protected]]
> Sent: Thursday, July 25, 2013 3:57 AM
> To: Dave Hansen
> Cc: KY Srinivasan; Dave Hansen; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On Wed 24-07-13 14:02:32, Dave Hansen wrote:
> > On 07/24/2013 12:45 PM, KY Srinivasan wrote:
> > > All I am saying is that I see two classes of failures: (a) Our
> > > inability to allocate memory to manage the memory that is being hot added
> > > and (b) Our inability to bring the hot added memory online within a
> reasonable
> > > amount of time. I am not sure the cause for (b) and I was just speculating that
> > > this could be memory related. What is interesting is that I have seen failure
> related
> > > to our inability to online the memory after having succeeded in hot adding the
> > > memory.
> >
> > I think we should hold off on this patch and other like it until we've
> > been sufficiently able to explain how (b) happens.
>
> Agreed.

As promised, I have sent out the patches for (a) an implementation of an in-kernel API
for onlining and a consumer for this API. While I don't know the exact reason why the
user mode code is delayed (under some low memory conditions), what is the harm in having
a mechanism to online memory that has been hot added without involving user space code.
Based on Michal's feedback, the onlininig API hides all of the internal details and presents a
generic interface.

Regards,

K. Y


2013-07-25 15:03:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/25/2013 04:14 AM, KY Srinivasan wrote:
> As promised, I have sent out the patches for (a) an implementation of an in-kernel API
> for onlining and a consumer for this API. While I don't know the exact reason why the
> user mode code is delayed (under some low memory conditions), what is the harm in having
> a mechanism to online memory that has been hot added without involving user space code.

KY, your potential problem, not being able to online more memory because
of a shortage of memory, is a serious one.

However, this potential issue exists in long-standing code, and
potentially affects all users of memory hotplug. The problem has not
been described in sufficient detail for the rest of the developers to
tell if you are facing a new problem, or whether *any* proposed solution
will help the problem you face.

Your propsed solution changes the semantics of existing user/kernel
interfaces, duplicates existing functionality, and adds code complexity
to the kernel.

2013-07-25 15:15:35

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On Thu, Jul 25, 2013 at 5:03 PM, Dave Hansen <[email protected]> wrote:
> On 07/25/2013 04:14 AM, KY Srinivasan wrote:
>> As promised, I have sent out the patches for (a) an implementation of an in-kernel API
>> for onlining and a consumer for this API. While I don't know the exact reason why the
>> user mode code is delayed (under some low memory conditions), what is the harm in having
>> a mechanism to online memory that has been hot added without involving user space code.
>
> KY, your potential problem, not being able to online more memory because
> of a shortage of memory, is a serious one.
>
> However, this potential issue exists in long-standing code, and
> potentially affects all users of memory hotplug. The problem has not
> been described in sufficient detail for the rest of the developers to
> tell if you are facing a new problem, or whether *any* proposed solution
> will help the problem you face.
>
> Your propsed solution changes the semantics of existing user/kernel
> interfaces, duplicates existing functionality, and adds code complexity
> to the kernel.

Complexity, well, it's just a bit of code which belongs in the kernel.
The mentioned unconditional hotplug loop through userspace is
absolutely pointless. Such defaults never belong in userspace tools if
they do not involve data that is only available in userspace and
something would make a decision about that. Saying "hello" to
userspace and usrspace has a hardcoded "yes" in return makes no sense
at all. The kernel can just go ahead and do its job, like it does for
all other devices it finds too.

Kay

2013-07-25 15:50:55

by KY Srinivasan

[permalink] [raw]
Subject: RE: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks



> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Thursday, July 25, 2013 11:04 AM
> To: KY Srinivasan
> Cc: Michal Hocko; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining
> memory blocks
>
> On 07/25/2013 04:14 AM, KY Srinivasan wrote:
> > As promised, I have sent out the patches for (a) an implementation of an in-
> kernel API
> > for onlining and a consumer for this API. While I don't know the exact reason
> why the
> > user mode code is delayed (under some low memory conditions), what is the
> harm in having
> > a mechanism to online memory that has been hot added without involving user
> space code.
>
> KY, your potential problem, not being able to online more memory because
> of a shortage of memory, is a serious one.

All I can say is that the online is not happening within the allowed time (5 seconds in
the current code).
>
> However, this potential issue exists in long-standing code, and
> potentially affects all users of memory hotplug. The problem has not
> been described in sufficient detail for the rest of the developers to
> tell if you are facing a new problem, or whether *any* proposed solution
> will help the problem you face.
>
> Your propsed solution changes the semantics of existing user/kernel
> interfaces, duplicates existing functionality, and adds code complexity
> to the kernel.

How so? All I am doing is to use the existing infrastructure to online
memory. The only change I have made is to export an API that allows
onlining without involving any user space code. I don't see how this
adds complexity to the kernel. This would be an useful extension as can
be seen from its usage in the Hyper-V balloon driver.

In my particular use case, I need to wait for the memory to be onlined before
I can proceed to hot add the next chunk. This synchronization can be completely
avoided if we can avoid the involvement of user level code. I would submit to you that
this is a valid use case that we ought to be able to support.

Regards,

K. Y


2013-07-25 16:35:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] Drivers: base: memory: Export symbols for onlining memory blocks

On 07/25/2013 08:15 AM, Kay Sievers wrote:
> Complexity, well, it's just a bit of code which belongs in the kernel.
> The mentioned unconditional hotplug loop through userspace is
> absolutely pointless. Such defaults never belong in userspace tools if
> they do not involve data that is only available in userspace and
> something would make a decision about that. Saying "hello" to
> userspace and usrspace has a hardcoded "yes" in return makes no sense
> at all. The kernel can just go ahead and do its job, like it does for
> all other devices it finds too.

Sorry, but memory is different than all other devices. You never need a
mouse in order to add another mouse to the kernel.

I'll repaste something I said earlier in this thread:

> A system under memory pressure is going to have troubles doing a
> hot-add. You need memory to add memory. Of the two operations ("add"
> and "online"), "add" is the one vastly more likely to fail. It has to
> allocate several large swaths of contiguous physical memory. For that
> reason, the system was designed so that you could "add" and "online"
> separately. The intention was that you could "add" far in advance and
> then "online" under memory pressure, with the "online" having *VASTLY*
> smaller memory requirements and being much more likely to succeed.

So, no, it makes no sense to just have userspace always unconditionally
online all the memory that the kernel adds. But, the way it's set up,
we _have_ a method that can work under lots memory pressure, and it is
available for users that want it. It was designed 10 years ago, and
maybe it's outdated, or history has proved that nobody is going to use
it the way it was designed.

If I had it to do over again, I'd probably set up configurable per-node
sets of spare kernel metadata. That way, you could say "make sure we
have enough memory reserved to add $FOO sections to node $BAR". Use
that for the largest allocations, then depend on PF_MEMALLOC to get us
enough for the other little bits along the way.

Also, if this is a problem, it's going to be a problem for *EVERY* user
of memory hotplug, not just hyperv. So, let's see it fixed generically
for *EVERY* user. Something along the lines of:

1. Come up with an interface that specifies a default policy for
newly-added memory sections. Currently, added memory gets "added",
but not "onlined", and the default should stay doing that.
2. Make sure that we at least WARN_ONCE() if someone tries to online an
already-kernel-onlined memory section. That way, if someone trips
over this new policy, we have a _chance_ of explaining to them what
is going on.
3. Think about what we do in the failure case where we are able to
"add", but fail to "online" in the kernel. Do we tear the
newly-added structures down and back out of the operation, or do
we leave the memory added, but offline (what happens in the normal
case now)?