2021-10-29 14:23:28

by Jürgen Groß

[permalink] [raw]
Subject: [PATCH v3] xen/balloon: add late_initcall_sync() for initial ballooning done

When running as PVH or HVM guest with actual memory < max memory the
hypervisor is using "populate on demand" in order to allow the guest
to balloon down from its maximum memory size. For this to work
correctly the guest must not touch more memory pages than its target
memory size as otherwise the PoD cache will be exhausted and the guest
is crashed as a result of that.

In extreme cases ballooning down might not be finished today before
the init process is started, which can consume lots of memory.

In order to avoid random boot crashes in such cases, add a late init
call to wait for ballooning down having finished for PVH/HVM guests.

Warn on console if initial ballooning fails, panic() after stalling
for more than 3 minutes per default. Add a module parameter for
changing this timeout.

Cc: <[email protected]>
Reported-by: Marek Marczykowski-Górecki <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- add warning and panic() when stalling (Marek Marczykowski-Górecki)
- don't wait if credit > 0
V3:
- issue warning only after ballooning failed (Marek Marczykowski-Górecki)
- make panic() timeout configurable via parameter
---
.../stable/sysfs-devices-system-xen_memory | 10 +++
drivers/xen/balloon.c | 63 +++++++++++++++----
drivers/xen/xen-balloon.c | 2 +
include/xen/balloon.h | 1 +
4 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-system-xen_memory b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
index 6d83f95a8a8e..2da062e2c94a 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-xen_memory
+++ b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
@@ -84,3 +84,13 @@ Description:
Control scrubbing pages before returning them to Xen for others domains
use. Can be set with xen_scrub_pages cmdline
parameter. Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
+
+What: /sys/devices/system/xen_memory/xen_memory0/boot_timeout
+Date: November 2021
+KernelVersion: 5.16
+Contact: [email protected]
+Description:
+ The time (in seconds) to wait before giving up to boot in case
+ initial ballooning fails to free enough memory. Applies only
+ when running as HVM or PVH guest and started with less memory
+ configured than allowed at max.
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 3a50f097ed3e..98fae43d4cec 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -125,12 +125,12 @@ static struct ctl_table xen_root[] = {
* BP_ECANCELED: error, balloon operation canceled.
*/

-enum bp_state {
+static enum bp_state {
BP_DONE,
BP_WAIT,
BP_EAGAIN,
BP_ECANCELED
-};
+} balloon_state = BP_DONE;

/* Main waiting point for xen-balloon thread. */
static DECLARE_WAIT_QUEUE_HEAD(balloon_thread_wq);
@@ -494,9 +494,9 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
* Stop waiting if either state is BP_DONE and ballooning action is
* needed, or if the credit has changed while state is not BP_DONE.
*/
-static bool balloon_thread_cond(enum bp_state state, long credit)
+static bool balloon_thread_cond(long credit)
{
- if (state == BP_DONE)
+ if (balloon_state == BP_DONE)
credit = 0;

return current_credit() != credit || kthread_should_stop();
@@ -510,13 +510,12 @@ static bool balloon_thread_cond(enum bp_state state, long credit)
*/
static int balloon_thread(void *unused)
{
- enum bp_state state = BP_DONE;
long credit;
unsigned long timeout;

set_freezable();
for (;;) {
- switch (state) {
+ switch (balloon_state) {
case BP_DONE:
case BP_ECANCELED:
timeout = 3600 * HZ;
@@ -532,7 +531,7 @@ static int balloon_thread(void *unused)
credit = current_credit();

wait_event_freezable_timeout(balloon_thread_wq,
- balloon_thread_cond(state, credit), timeout);
+ balloon_thread_cond(credit), timeout);

if (kthread_should_stop())
return 0;
@@ -543,22 +542,23 @@ static int balloon_thread(void *unused)

if (credit > 0) {
if (balloon_is_inflated())
- state = increase_reservation(credit);
+ balloon_state = increase_reservation(credit);
else
- state = reserve_additional_memory();
+ balloon_state = reserve_additional_memory();
}

if (credit < 0) {
long n_pages;

n_pages = min(-credit, si_mem_available());
- state = decrease_reservation(n_pages, GFP_BALLOON);
- if (state == BP_DONE && n_pages != -credit &&
+ balloon_state = decrease_reservation(n_pages,
+ GFP_BALLOON);
+ if (balloon_state == BP_DONE && n_pages != -credit &&
n_pages < totalreserve_pages)
- state = BP_EAGAIN;
+ balloon_state = BP_EAGAIN;
}

- state = update_schedule(state);
+ balloon_state = update_schedule(balloon_state);

mutex_unlock(&balloon_mutex);

@@ -731,6 +731,7 @@ static int __init balloon_init(void)
balloon_stats.max_schedule_delay = 32;
balloon_stats.retry_count = 1;
balloon_stats.max_retry_count = 4;
+ balloon_stats.boot_timeout = 180;

#ifdef CONFIG_XEN_BALLOON_MEMORY_HOTPLUG
set_online_page_callback(&xen_online_page);
@@ -765,3 +766,39 @@ static int __init balloon_init(void)
return 0;
}
subsys_initcall(balloon_init);
+
+static int __init balloon_wait_finish(void)
+{
+ long credit, last_credit = 0;
+ unsigned long last_changed;
+
+ if (!xen_domain())
+ return -ENODEV;
+
+ /* PV guests don't need to wait. */
+ if (xen_pv_domain() || !current_credit())
+ return 0;
+
+ pr_info("Waiting for initial ballooning down having finished.\n");
+
+ while ((credit = current_credit()) < 0) {
+ if (credit != last_credit) {
+ last_changed = jiffies;
+ last_credit = credit;
+ }
+ if (balloon_state == BP_ECANCELED) {
+ pr_warn_once("Initial ballooning failed, %ld pages need to be freed.\n",
+ -credit);
+ if (jiffies - last_changed >=
+ HZ * balloon_stats.boot_timeout)
+ panic("Initial ballooning failed!\n");
+ }
+
+ schedule_timeout_interruptible(HZ / 10);
+ }
+
+ pr_info("Initial ballooning down finished.\n");
+
+ return 0;
+}
+late_initcall_sync(balloon_wait_finish);
diff --git a/drivers/xen/xen-balloon.c b/drivers/xen/xen-balloon.c
index 8cd583db20b1..6e5db50ede0f 100644
--- a/drivers/xen/xen-balloon.c
+++ b/drivers/xen/xen-balloon.c
@@ -150,6 +150,7 @@ static DEVICE_ULONG_ATTR(schedule_delay, 0444, balloon_stats.schedule_delay);
static DEVICE_ULONG_ATTR(max_schedule_delay, 0644, balloon_stats.max_schedule_delay);
static DEVICE_ULONG_ATTR(retry_count, 0444, balloon_stats.retry_count);
static DEVICE_ULONG_ATTR(max_retry_count, 0644, balloon_stats.max_retry_count);
+static DEVICE_ULONG_ATTR(boot_timeout, 0644, balloon_stats.boot_timeout);
static DEVICE_BOOL_ATTR(scrub_pages, 0644, xen_scrub_pages);

static ssize_t target_kb_show(struct device *dev, struct device_attribute *attr,
@@ -211,6 +212,7 @@ static struct attribute *balloon_attrs[] = {
&dev_attr_max_schedule_delay.attr.attr,
&dev_attr_retry_count.attr.attr,
&dev_attr_max_retry_count.attr.attr,
+ &dev_attr_boot_timeout.attr.attr,
&dev_attr_scrub_pages.attr.attr,
NULL
};
diff --git a/include/xen/balloon.h b/include/xen/balloon.h
index 6dbdb0b3fd03..95a4187f263b 100644
--- a/include/xen/balloon.h
+++ b/include/xen/balloon.h
@@ -20,6 +20,7 @@ struct balloon_stats {
unsigned long max_schedule_delay;
unsigned long retry_count;
unsigned long max_retry_count;
+ unsigned long boot_timeout;
};

extern struct balloon_stats balloon_stats;
--
2.26.2


2021-10-29 21:51:38

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v3] xen/balloon: add late_initcall_sync() for initial ballooning done


On 10/29/21 10:20 AM, Juergen Gross wrote:
> --- a/Documentation/ABI/stable/sysfs-devices-system-xen_memory
> +++ b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
> @@ -84,3 +84,13 @@ Description:
> Control scrubbing pages before returning them to Xen for others domains
> use. Can be set with xen_scrub_pages cmdline
> parameter. Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
> +
> +What: /sys/devices/system/xen_memory/xen_memory0/boot_timeout
> +Date: November 2021
> +KernelVersion: 5.16
> +Contact: [email protected]
> +Description:
> + The time (in seconds) to wait before giving up to boot in case
> + initial ballooning fails to free enough memory. Applies only
> + when running as HVM or PVH guest and started with less memory
> + configured than allowed at max.


How is this going to be used? We only need this during boot.


>
> - state = update_schedule(state);
> + balloon_state = update_schedule(balloon_state);


Now that balloon_state has whole file scope it can probably be updated inside update_schedule().


> + while ((credit = current_credit()) < 0) {
> + if (credit != last_credit) {
> + last_changed = jiffies;
> + last_credit = credit;
> + }
> + if (balloon_state == BP_ECANCELED) {


What about other states? We are really waiting for BP_DONE, aren't we?


-boris


> + pr_warn_once("Initial ballooning failed, %ld pages need to be freed.\n",
> + -credit);
> + if (jiffies - last_changed >=
> + HZ * balloon_stats.boot_timeout)
> + panic("Initial ballooning failed!\n");
> + }
> +

Subject: Re: [PATCH v3] xen/balloon: add late_initcall_sync() for initial ballooning done

On Fri, Oct 29, 2021 at 05:46:18PM -0400, Boris Ostrovsky wrote:
>
> On 10/29/21 10:20 AM, Juergen Gross wrote:
> > --- a/Documentation/ABI/stable/sysfs-devices-system-xen_memory
> > +++ b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
> > @@ -84,3 +84,13 @@ Description:
> > Control scrubbing pages before returning them to Xen for others domains
> > use. Can be set with xen_scrub_pages cmdline
> > parameter. Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
> > +
> > +What: /sys/devices/system/xen_memory/xen_memory0/boot_timeout
> > +Date: November 2021
> > +KernelVersion: 5.16
> > +Contact: [email protected]
> > +Description:
> > + The time (in seconds) to wait before giving up to boot in case
> > + initial ballooning fails to free enough memory. Applies only
> > + when running as HVM or PVH guest and started with less memory
> > + configured than allowed at max.
>
>
> How is this going to be used? We only need this during boot.
>
>
> > - state = update_schedule(state);
> > + balloon_state = update_schedule(balloon_state);
>
>
> Now that balloon_state has whole file scope it can probably be updated inside update_schedule().
>
>
> > + while ((credit = current_credit()) < 0) {
> > + if (credit != last_credit) {
> > + last_changed = jiffies;
> > + last_credit = credit;
> > + }
> > + if (balloon_state == BP_ECANCELED) {
>
>
> What about other states? We are really waiting for BP_DONE, aren't we?

BP_DONE is set also as an intermediate step:

balloon_state = decrease_reservation(n_pages,
GFP_BALLOON);
if (balloon_state == BP_DONE && n_pages != -credit &&
n_pages < totalreserve_pages)
balloon_state = BP_EAGAIN;

It would be bad to finish waiting in this case.

--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


Attachments:
(No filename) (2.00 kB)
signature.asc (499.00 B)
Download all attachments

2021-10-29 23:51:19

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v3] xen/balloon: add late_initcall_sync() for initial ballooning done


On 10/29/21 6:18 PM, Marek Marczykowski-Górecki wrote:
> On Fri, Oct 29, 2021 at 05:46:18PM -0400, Boris Ostrovsky wrote:
>> On 10/29/21 10:20 AM, Juergen Gross wrote:
>>> --- a/Documentation/ABI/stable/sysfs-devices-system-xen_memory
>>> +++ b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
>>> @@ -84,3 +84,13 @@ Description:
>>> Control scrubbing pages before returning them to Xen for others domains
>>> use. Can be set with xen_scrub_pages cmdline
>>> parameter. Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
>>> +
>>> +What: /sys/devices/system/xen_memory/xen_memory0/boot_timeout
>>> +Date: November 2021
>>> +KernelVersion: 5.16
>>> +Contact: [email protected]
>>> +Description:
>>> + The time (in seconds) to wait before giving up to boot in case
>>> + initial ballooning fails to free enough memory. Applies only
>>> + when running as HVM or PVH guest and started with less memory
>>> + configured than allowed at max.
>>
>> How is this going to be used? We only need this during boot.
>>
>>
>>> - state = update_schedule(state);
>>> + balloon_state = update_schedule(balloon_state);
>>
>> Now that balloon_state has whole file scope it can probably be updated inside update_schedule().
>>
>>
>>> + while ((credit = current_credit()) < 0) {
>>> + if (credit != last_credit) {
>>> + last_changed = jiffies;
>>> + last_credit = credit;
>>> + }
>>> + if (balloon_state == BP_ECANCELED) {
>>
>> What about other states? We are really waiting for BP_DONE, aren't we?
> BP_DONE is set also as an intermediate step:
>
> balloon_state = decrease_reservation(n_pages,
> GFP_BALLOON);
> if (balloon_state == BP_DONE && n_pages != -credit &&
> n_pages < totalreserve_pages)
> balloon_state = BP_EAGAIN;
>
> It would be bad to finish waiting in this case.


RIght, but if we were to say 'if (balloon_state != BP_DONE)' the worst that can happen is that we will continue on to the next iteration without warning and/or panicing. Of course, there is a chance thaton the next iteration the same thing will happen but I think chances of hitting this race every time are infinitely low. We can also check for current_credit() again.


The question is whether we do want to continue waiting if we are in BP_AGAIN. I don't think BP_WAIT is possible in this case although this may change in the future and we will forget to update this code.


-boris

2021-11-01 07:17:36

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH v3] xen/balloon: add late_initcall_sync() for initial ballooning done

On 29.10.21 23:46, Boris Ostrovsky wrote:
>
> On 10/29/21 10:20 AM, Juergen Gross wrote:
>> --- a/Documentation/ABI/stable/sysfs-devices-system-xen_memory
>> +++ b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
>> @@ -84,3 +84,13 @@ Description:
>>           Control scrubbing pages before returning them to Xen for
>> others domains
>>           use. Can be set with xen_scrub_pages cmdline
>>           parameter. Default value controlled with
>> CONFIG_XEN_SCRUB_PAGES_DEFAULT.
>> +
>> +What:        /sys/devices/system/xen_memory/xen_memory0/boot_timeout
>> +Date:        November 2021
>> +KernelVersion:    5.16
>> +Contact:    [email protected]
>> +Description:
>> +        The time (in seconds) to wait before giving up to boot in case
>> +        initial ballooning fails to free enough memory. Applies only
>> +        when running as HVM or PVH guest and started with less memory
>> +        configured than allowed at max.
>
>
> How is this going to be used? We only need this during boot.

Of course. Will switch to module_param().

>> -        state = update_schedule(state);
>> +        balloon_state = update_schedule(balloon_state);
>
>
> Now that balloon_state has whole file scope it can probably be updated
> inside update_schedule().

I can do that.

>> +    while ((credit = current_credit()) < 0) {
>> +        if (credit != last_credit) {
>> +            last_changed = jiffies;
>> +            last_credit = credit;
>> +        }
>> +        if (balloon_state == BP_ECANCELED) {
>
>
> What about other states? We are really waiting for BP_DONE, aren't we?

Nearly. We are waiting for credit not being negative.

And in case of cancelled we know this won't happen without Xen admin
intervention.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP public key
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2021-11-01 07:22:26

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH v3] xen/balloon: add late_initcall_sync() for initial ballooning done

On 30.10.21 01:44, Boris Ostrovsky wrote:
>
> On 10/29/21 6:18 PM, Marek Marczykowski-Górecki wrote:
>> On Fri, Oct 29, 2021 at 05:46:18PM -0400, Boris Ostrovsky wrote:
>>> On 10/29/21 10:20 AM, Juergen Gross wrote:
>>>> --- a/Documentation/ABI/stable/sysfs-devices-system-xen_memory
>>>> +++ b/Documentation/ABI/stable/sysfs-devices-system-xen_memory
>>>> @@ -84,3 +84,13 @@ Description:
>>>>            Control scrubbing pages before returning them to Xen for
>>>> others domains
>>>>            use. Can be set with xen_scrub_pages cmdline
>>>>            parameter. Default value controlled with
>>>> CONFIG_XEN_SCRUB_PAGES_DEFAULT.
>>>> +
>>>> +What:        /sys/devices/system/xen_memory/xen_memory0/boot_timeout
>>>> +Date:        November 2021
>>>> +KernelVersion:    5.16
>>>> +Contact:    [email protected]
>>>> +Description:
>>>> +        The time (in seconds) to wait before giving up to boot in case
>>>> +        initial ballooning fails to free enough memory. Applies only
>>>> +        when running as HVM or PVH guest and started with less memory
>>>> +        configured than allowed at max.
>>>
>>> How is this going to be used? We only need this during boot.
>>>
>>>
>>>> -        state = update_schedule(state);
>>>> +        balloon_state = update_schedule(balloon_state);
>>>
>>> Now that balloon_state has whole file scope it can probably be
>>> updated inside update_schedule().
>>>
>>>
>>>> +    while ((credit = current_credit()) < 0) {
>>>> +        if (credit != last_credit) {
>>>> +            last_changed = jiffies;
>>>> +            last_credit = credit;
>>>> +        }
>>>> +        if (balloon_state == BP_ECANCELED) {
>>>
>>> What about other states? We are really waiting for BP_DONE, aren't we?
>> BP_DONE is set also as an intermediate step:
>>
>>                         balloon_state = decrease_reservation(n_pages,
>>
>> GFP_BALLOON);
>>                         if (balloon_state == BP_DONE && n_pages !=
>> -credit &&
>>                              n_pages < totalreserve_pages)
>>                                 balloon_state = BP_EAGAIN;
>>
>> It would be bad to finish waiting in this case.
>
>
> RIght, but if we were to say 'if (balloon_state != BP_DONE)' the worst
> that can happen is that we will continue on to the next iteration
> without warning and/or panicing. Of course, there is a chance thaton the
> next iteration the same thing will happen but I think chances of hitting
> this race every time are infinitely low. We can also check for
> current_credit() again.
>
>
> The question is whether we do want to continue waiting if we are in
> BP_AGAIN. I don't think BP_WAIT is possible in this case although this
> may change in the future and we will forget to update this code.

BP_EAGAIN should not stop waiting, as it might be intermediate in case
some caches or buffers are freed.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP public key
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments