2022-10-05 10:03:39

by Alexander Atanasov

[permalink] [raw]
Subject: [PATCH v4 2/7] Enable balloon drivers to report inflated memory

Add counters to be updated by the balloon drivers.
Create balloon notifier to propagate changes.

Signed-off-by: Alexander Atanasov <[email protected]>
---
include/linux/balloon.h | 18 ++++++++++++++++++
mm/balloon.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)

diff --git a/include/linux/balloon.h b/include/linux/balloon.h
index 46ac8f61f607..59657da77d95 100644
--- a/include/linux/balloon.h
+++ b/include/linux/balloon.h
@@ -57,6 +57,24 @@ struct balloon_dev_info {
struct page *page, enum migrate_mode mode);
};

+extern atomic_long_t mem_balloon_inflated_total_kb;
+extern atomic_long_t mem_balloon_inflated_free_kb;
+
+void balloon_set_inflated_total(long inflated_kb);
+void balloon_set_inflated_free(long inflated_kb);
+
+#define BALLOON_CHANGED_TOTAL 0
+#define BALLOON_CHANGED_FREE 1
+
+int register_balloon_notifier(struct notifier_block *nb);
+void unregister_balloon_notifier(struct notifier_block *nb);
+
+#define balloon_notifier(fn, pri) ({ \
+ static struct notifier_block fn##_mem_nb __meminitdata =\
+ { .notifier_call = fn, .priority = pri }; \
+ register_balloon_notifier(&fn##_mem_nb); \
+})
+
struct page *balloon_page_alloc(void);
void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
struct page *page);
diff --git a/mm/balloon.c b/mm/balloon.c
index 22b3e876bc78..8e1d5855fef8 100644
--- a/mm/balloon.c
+++ b/mm/balloon.c
@@ -7,8 +7,44 @@
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/export.h>
+#include <linux/notifier.h>
#include <linux/balloon.h>

+atomic_long_t mem_balloon_inflated_total_kb = ATOMIC_LONG_INIT(0);
+atomic_long_t mem_balloon_inflated_free_kb = ATOMIC_LONG_INIT(0);
+SRCU_NOTIFIER_HEAD_STATIC(balloon_chain);
+
+int register_balloon_notifier(struct notifier_block *nb)
+{
+ return srcu_notifier_chain_register(&balloon_chain, nb);
+}
+EXPORT_SYMBOL(register_balloon_notifier);
+
+void unregister_balloon_notifier(struct notifier_block *nb)
+{
+ srcu_notifier_chain_unregister(&balloon_chain, nb);
+}
+EXPORT_SYMBOL(unregister_balloon_notifier);
+
+static int balloon_notify(unsigned long val)
+{
+ return srcu_notifier_call_chain(&balloon_chain, val, NULL);
+}
+
+void balloon_set_inflated_total(long inflated_kb)
+{
+ atomic_long_set(&mem_balloon_inflated_total_kb, inflated_kb);
+ balloon_notify(BALLOON_CHANGED_TOTAL);
+}
+EXPORT_SYMBOL(balloon_set_inflated_total);
+
+void balloon_set_inflated_free(long inflated_kb)
+{
+ atomic_long_set(&mem_balloon_inflated_free_kb, inflated_kb);
+ balloon_notify(BALLOON_CHANGED_FREE);
+}
+EXPORT_SYMBOL(balloon_set_inflated_free);
+
static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
struct page *page)
{
--
2.31.1


2022-10-05 17:46:29

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On Oct 5, 2022, at 2:01 AM, Alexander Atanasov <[email protected]> wrote:

> Add counters to be updated by the balloon drivers.
> Create balloon notifier to propagate changes.

I missed the other patches before (including this one). Sorry, but next
time, please cc me.

I was looking through the series and I did not see actual users of the
notifier. Usually, it is not great to build an API without users.

[snip]

> +
> +static int balloon_notify(unsigned long val)
> +{
> + return srcu_notifier_call_chain(&balloon_chain, val, NULL);

Since you know the inflated_kb value here, why not to use it as an argument
to the callback? I think casting to (void *) and back is best. But you can
also provide pointer to the value. Doesn’t it sound better than having
potentially different notifiers reading different values?

Anyhow, without users (actual notifiers) it’s kind of hard to know how
reasonable it all is. For instance, is it balloon_notify() supposed to
prevent further balloon inflating/deflating until the notifier completes?
Accordingly, are callers to balloon_notify() expected to relinquish locks
before calling balloon_notify() to prevent deadlocks and high latency?

2022-10-06 08:11:55

by Alexander Atanasov

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] Enable balloon drivers to report inflated memory

Hello,


On 5.10.22 20:25, Nadav Amit wrote:
> On Oct 5, 2022, at 2:01 AM, Alexander Atanasov <[email protected]> wrote:
>
>> Add counters to be updated by the balloon drivers.
>> Create balloon notifier to propagate changes.
>
> I missed the other patches before (including this one). Sorry, but next
> time, please cc me.

You are CCed in the cover letter since the version. I will add CC to you
in the individual patches if you want so.

>
> I was looking through the series and I did not see actual users of the
> notifier. Usually, it is not great to build an API without users.


You are right. I hope to get some feedback/interest from potential users
that i mentioned in the cover letter. I will probably split the notifier
in separate series. To make it usefull it will require more changes.
See bellow more about them.


> [snip]
>
>> +
>> +static int balloon_notify(unsigned long val)
>> +{
>> + return srcu_notifier_call_chain(&balloon_chain, val, NULL);
>
> Since you know the inflated_kb value here, why not to use it as an argument
> to the callback? I think casting to (void *) and back is best. But you can
> also provide pointer to the value. Doesn’t it sound better than having
> potentially different notifiers reading different values?

My current idea is to have a struct with current and previous value,
may be change in percents. The actual value does not matter to anyone
but the size of change does. When a user gets notified it can act upon
the change - if it is small it can ignore it , if it is above some
threshold it can act - if it makes sense for some receiver is can
accumulate changes from several notification. Other option/addition is
to have si_meminfo_current(..) and totalram_pages_current(..) that
return values adjusted with the balloon values.

Going further - there are few places that calculate something based on
available memory that do not have sysfs/proc interface for setting
limits. Most of them work in percents so they can be converted to do
calculations when they get notification.

The one that have interface for configuration but use memory values can
be handled in two ways - convert to use percents of what is available or
extend the notifier to notify userspace which in turn to do calculations
and update configuration.

> Anyhow, without users (actual notifiers) it’s kind of hard to know how
> reasonable it all is. For instance, is it balloon_notify() supposed to
> prevent further balloon inflating/deflating until the notifier completes?

No, we must avoid that at any cost.

> Accordingly, are callers to balloon_notify() expected to relinquish locks
> before calling balloon_notify() to prevent deadlocks and high latency?

My goal is to avoid any possible impact on performance. Drivers are free
to delay notifications if they get in the way. (I see that i need to
move the notification after the semaphore in the vmw driver - i missed
that - will fix in the next iterration.)
Deadlocks - depends on the users but a few to none will possibly have to
deal with common locks.


--
Regards,
Alexander Atanasov

2022-10-06 21:53:14

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On Oct 6, 2022, at 12:34 AM, Alexander Atanasov <[email protected]> wrote:

> Hello,
>
>
> On 5.10.22 20:25, Nadav Amit wrote:
>> On Oct 5, 2022, at 2:01 AM, Alexander Atanasov <[email protected]> wrote:
>>> Add counters to be updated by the balloon drivers.
>>> Create balloon notifier to propagate changes.
>> I missed the other patches before (including this one). Sorry, but next
>> time, please cc me.
>
> You are CCed in the cover letter since the version. I will add CC to you
> in the individual patches if you want so.

Thanks.

Just to clarify - I am not attacking you. It’s more of me making an excuse
for not addressing some issues in earlier versions.

>> I was looking through the series and I did not see actual users of the
>> notifier. Usually, it is not great to build an API without users.
>
>
> You are right. I hope to get some feedback/interest from potential users that i mentioned in the cover letter. I will probably split the notifier
> in separate series. To make it usefull it will require more changes.
> See bellow more about them.

Fair, but this is something that is more suitable for RFC. Otherwise, more
likely than not - your patches would go in as is.

>> [snip]
>>> +
>>> +static int balloon_notify(unsigned long val)
>>> +{
>>> + return srcu_notifier_call_chain(&balloon_chain, val, NULL);
>> Since you know the inflated_kb value here, why not to use it as an argument
>> to the callback? I think casting to (void *) and back is best. But you can
>> also provide pointer to the value. Doesn’t it sound better than having
>> potentially different notifiers reading different values?
>
> My current idea is to have a struct with current and previous value,
> may be change in percents. The actual value does not matter to anyone
> but the size of change does. When a user gets notified it can act upon
> the change - if it is small it can ignore it , if it is above some threshold it can act - if it makes sense for some receiver is can accumulate changes from several notification. Other option/addition is to have si_meminfo_current(..) and totalram_pages_current(..) that return values adjusted with the balloon values.
>
> Going further - there are few places that calculate something based on available memory that do not have sysfs/proc interface for setting limits. Most of them work in percents so they can be converted to do calculations when they get notification.
>
> The one that have interface for configuration but use memory values can be handled in two ways - convert to use percents of what is available or extend the notifier to notify userspace which in turn to do calculations and update configuration.

I really need to see code to fully understand what you have in mind.
Division, as you know, is not something that we really want to do very
frequently.

>> Anyhow, without users (actual notifiers) it’s kind of hard to know how
>> reasonable it all is. For instance, is it balloon_notify() supposed to
>> prevent further balloon inflating/deflating until the notifier completes?
>
> No, we must avoid that at any cost.
>
>> Accordingly, are callers to balloon_notify() expected to relinquish locks
>> before calling balloon_notify() to prevent deadlocks and high latency?
>
> My goal is to avoid any possible impact on performance. Drivers are free to delay notifications if they get in the way. (I see that i need to move the notification after the semaphore in the vmw driver - i missed that - will fix in the next iterration.)
> Deadlocks - depends on the users but a few to none will possibly have to deal with common locks.

I will need to see the next version to give better feedback. One more thing
that comes to mind though is whether saving the balloon size in multiple
places (both mem_balloon_inflated_total_kb and each balloon’s accounting) is
the right way. It does not sounds very clean.

Two other options is to move *all* the accounting to your new
mem_balloon_inflated_total_kb-like interface or expose some per-balloon
function to get the balloon size (indirect-function-call would likely have
some overhead though).

Anyhow, I am not crazy about having the same data replicated. Even from
reading the code stand-of-view it is not intuitive.

2022-10-07 11:36:13

by Alexander Atanasov

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On 7.10.22 0:07, Nadav Amit wrote:
> On Oct 6, 2022, at 12:34 AM, Alexander Atanasov <[email protected]> wrote:
>
>> Hello,
>>
>>
>> On 5.10.22 20:25, Nadav Amit wrote:
>>> On Oct 5, 2022, at 2:01 AM, Alexander Atanasov <[email protected]> wrote:
>>>> Add counters to be updated by the balloon drivers.
>>>> Create balloon notifier to propagate changes.
>>> I missed the other patches before (including this one). Sorry, but next
>>> time, please cc me.
>>
>> You are CCed in the cover letter since the version. I will add CC to you
>> in the individual patches if you want so.
>
> Thanks.
>
> Just to clarify - I am not attacking you. It’s more of me making an excuse
> for not addressing some issues in earlier versions.

ok, i am glad that you did it now but i do not think you need to excuse.

>>> I was looking through the series and I did not see actual users of the
>>> notifier. Usually, it is not great to build an API without users.
>>
>>
>> You are right. I hope to get some feedback/interest from potential users that i mentioned in the cover letter. I will probably split the notifier
>> in separate series. To make it usefull it will require more changes.
>> See bellow more about them.
>
> Fair, but this is something that is more suitable for RFC. Otherwise, more
> likely than not - your patches would go in as is.

Yes, i will remove the notifier and resend both as RFC. I think that
every patch is an RFC and RFC tag is used for more general changes that
could affect unexpected areas, change functionality, change design and
in general can lead to bigger impact. In the case with this it adds
functionality that is missing and it could hardly affect anything else.
In essence it provides information that you can not get without it.
But i will take your advice and push everything thru RFC from now on.

>>> [snip]
>>>> +
>>>> +static int balloon_notify(unsigned long val)
>>>> +{
>>>> + return srcu_notifier_call_chain(&balloon_chain, val, NULL);
>>> Since you know the inflated_kb value here, why not to use it as an argument
>>> to the callback? I think casting to (void *) and back is best. But you can
>>> also provide pointer to the value. Doesn’t it sound better than having
>>> potentially different notifiers reading different values?
>>
>> My current idea is to have a struct with current and previous value,
>> may be change in percents. The actual value does not matter to anyone
>> but the size of change does. When a user gets notified it can act upon
>> the change - if it is small it can ignore it , if it is above some threshold it can act - if it makes sense for some receiver is can accumulate changes from several notification. Other option/addition is to have si_meminfo_current(..) and totalram_pages_current(..) that return values adjusted with the balloon values.
>>
>> Going further - there are few places that calculate something based on available memory that do not have sysfs/proc interface for setting limits. Most of them work in percents so they can be converted to do calculations when they get notification.
>>
>> The one that have interface for configuration but use memory values can be handled in two ways - convert to use percents of what is available or extend the notifier to notify userspace which in turn to do calculations and update configuration.
>
> I really need to see code to fully understand what you have in mind.

Sure - you can check some of the users with git grep totalram_pages -
shows self explanatory results of usage like:
fs/f2fs/node.c:bool f2fs_available_free_memory(struct f2fs_sb_info *sbi,
int type) - calculations in percents - one good example
fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages()))
<< (PAGE_SHIFT-10);
fs/fuse/inode.c: *limit = ((totalram_pages() <<
PAGE_SHIFT) >> 13) / 392;
fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages())) <<
(PAGE_SHIFT-10);
fs/nfsd/nfscache.c: unsigned long low_pages = totalram_pages() -
totalhigh_pages()
mm/oom_kill.c: oc->totalpages = totalram_pages() + total_swap_pages;


So all balloon drivers give large amount of RAM on boot , then inflate
the balloon. But this places have already been initiallized and they
know that the system have given amount of totalram which is not true the
moment they start to operate. the result is that too much space gets
used and it degrades the userspace performance.
example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of
ram for eventpool - when you inflate half of the ram it becomes 8% of
the ram - do you really need 8% of your ram to be used for eventpool?

To solve this you need to register and when notified update - cache
size, limits and for whatever is the calculated amount of memory used.

The difference is here:

mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
uses percents and you can recalculate easy with

+static inline unsigned long totalram_pages_current(void)
+{
+ unsigned long inflated = 0;
+#ifdef CONFIG_MEMORY_BALLOON
+ extern atomic_long_t mem_balloon_inflated_free_kb;
+ inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
+ inflated >>= (PAGE_SHIFT - 10);
+#endif
+ return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
+}

And you are good when you switch to _current version -
si_meminfo_current is alike .

On init (probably) all use some kind of fractions to calculate but when
there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is
just a value and you can not recalculate it. And here, please, share
your ideas how to solve this.


> Division, as you know, is not something that we really want to do very
> frequently.

Yes, thats true but in the actual implementation there are a lot of ways
to avoid it. But it is easier to explain with division.

Even if you do have to do a division you can limit the recalculations
struct balloon_notify {
unsigned long last_inflated_free_kb;
unsigned long last_inflated_free_kb;
unsigned long inflated_free_kb;
unsigned long inflated_used_kb;
}

So you can do it only if change is more then 1GB and you do nothing if
change is 1MB.

>
>>> Anyhow, without users (actual notifiers) it’s kind of hard to know how
>>> reasonable it all is. For instance, is it balloon_notify() supposed to
>>> prevent further balloon inflating/deflating until the notifier completes?
>>
>> No, we must avoid that at any cost.
>>
>>> Accordingly, are callers to balloon_notify() expected to relinquish locks
>>> before calling balloon_notify() to prevent deadlocks and high latency?
>>
>> My goal is to avoid any possible impact on performance. Drivers are free to delay notifications if they get in the way. (I see that i need to move the notification after the semaphore in the vmw driver - i missed that - will fix in the next iterration.)
>> Deadlocks - depends on the users but a few to none will possibly have to deal with common locks.
>
> I will need to see the next version to give better feedback. One more thing
> that comes to mind though is whether saving the balloon size in multiple
> places (both mem_balloon_inflated_total_kb and each balloon’s accounting) is
> the right way. It does not sounds very clean.
>
> Two other options is to move *all* the accounting to your new
> mem_balloon_inflated_total_kb-like interface or expose some per-balloon
> function to get the balloon size (indirect-function-call would likely have
> some overhead though).
>
> Anyhow, I am not crazy about having the same data replicated. Even from
> reading the code stand-of-view it is not intuitive.

If such interface existed before the drivers it would ideally be like
that all in one place. But keeping the internal (for the driver)
representation which may differ from the system and the external
(system) representation separate is a a good option. if a driver can
convert and use only the system counters they can do so.

--
Regards,
Alexander Atanasov

2022-10-10 06:31:32

by Nadav Amit

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <[email protected]> wrote:

> On 7.10.22 0:07, Nadav Amit wrote:
>>>>
>>>> I was looking through the series and I did not see actual users of the
>>>> notifier. Usually, it is not great to build an API without users.
>>>
>>>
>>> You are right. I hope to get some feedback/interest from potential users that i mentioned in the cover letter. I will probably split the notifier
>>> in separate series. To make it usefull it will require more changes.
>>> See bellow more about them.
>> Fair, but this is something that is more suitable for RFC. Otherwise, more
>> likely than not - your patches would go in as is.
>
> Yes, i will remove the notifier and resend both as RFC. I think that every patch is an RFC and RFC tag is used for more general changes that could affect unexpected areas, change functionality, change design and in general can lead to bigger impact. In the case with this it adds functionality that is missing and it could hardly affect anything else.
> In essence it provides information that you can not get without it.
> But i will take your advice and push everything thru RFC from now on.

Jus keep the version numbers as you had before. That’s fine and better to
prevent confusion.

>>>> [snip]
>>>>> +
>>>>> +static int balloon_notify(unsigned long val)
>>>>> +{
>>>>> + return srcu_notifier_call_chain(&balloon_chain, val, NULL);
>>>> Since you know the inflated_kb value here, why not to use it as an argument
>>>> to the callback? I think casting to (void *) and back is best. But you can
>>>> also provide pointer to the value. Doesn’t it sound better than having
>>>> potentially different notifiers reading different values?
>>>
>>> My current idea is to have a struct with current and previous value,
>>> may be change in percents. The actual value does not matter to anyone
>>> but the size of change does. When a user gets notified it can act upon
>>> the change - if it is small it can ignore it , if it is above some threshold it can act - if it makes sense for some receiver is can accumulate changes from several notification. Other option/addition is to have si_meminfo_current(..) and totalram_pages_current(..) that return values adjusted with the balloon values.
>>>
>>> Going further - there are few places that calculate something based on available memory that do not have sysfs/proc interface for setting limits. Most of them work in percents so they can be converted to do calculations when they get notification.
>>>
>>> The one that have interface for configuration but use memory values can be handled in two ways - convert to use percents of what is available or extend the notifier to notify userspace which in turn to do calculations and update configuration.
>> I really need to see code to fully understand what you have in mind.
>
> Sure - you can check some of the users with git grep totalram_pages - shows self explanatory results of usage like:
> fs/f2fs/node.c:bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) - calculations in percents - one good example
> fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10);
> fs/fuse/inode.c: *limit = ((totalram_pages() << PAGE_SHIFT) >> 13) / 392;
> fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10);
> fs/nfsd/nfscache.c: unsigned long low_pages = totalram_pages() - totalhigh_pages()
> mm/oom_kill.c: oc->totalpages = totalram_pages() + total_swap_pages;
>
>
> So all balloon drivers give large amount of RAM on boot , then inflate the balloon. But this places have already been initiallized and they know that the system have given amount of totalram which is not true the moment they start to operate. the result is that too much space gets used and it degrades the userspace performance.
> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of ram for eventpool - when you inflate half of the ram it becomes 8% of the ram - do you really need 8% of your ram to be used for eventpool?
>
> To solve this you need to register and when notified update - cache size, limits and for whatever is the calculated amount of memory used.

Hmm.. Not sure about all of that. Most balloon drivers are manually managed,
and call adjust_managed_page_count(), and tas a result might want to redo
all the calculations that are based on totalram_pages().

Side-note: That’s not the case for VMware balloon. I actually considered
calling adjust_managed_page_count() just to conform with other balloon
drivers. But since we use totalram_pages() to communicate to the hypervisor
the total-ram, this would create endless (and wrong) feedback loop. I am not
claiming it is not possible to VMware balloon driver to call
adjust_managed_page_count(), but the chances are that it would create more
harm than good.

Back to the matter at hand. It seems that you wish that the notifiers would
be called following any changes that would be reflected in totalram_pages().
So, doesn't it make more sense to call it from adjust_managed_page_count() ?

> The difference is here:
>
> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
> uses percents and you can recalculate easy with
>
> +static inline unsigned long totalram_pages_current(void)
> +{
> + unsigned long inflated = 0;
> +#ifdef CONFIG_MEMORY_BALLOON
> + extern atomic_long_t mem_balloon_inflated_free_kb;
> + inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
> + inflated >>= (PAGE_SHIFT - 10);
> +#endif
> + return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
> +}
>
> And you are good when you switch to _current version - si_meminfo_current is alike .
>
> On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this.

I don’t get all of that. Now that you provided some more explanations, it
sounds that what you want is adjust_managed_page_count(), which we already
have and affects the output of totalram_pages(). Therefore, totalram_pages()
anyhow accounts for the balloon memory (excluding VMware’s). So why do we
need to take mem_balloon_inflated_free_kb into account?

Sounds to me that all you want is some notifier to be called from
adjust_managed_page_count(). What am I missing?

2022-10-10 08:41:38

by Alexander Atanasov

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

Hello,

On 10.10.22 9:18, Nadav Amit wrote:
> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <[email protected]> wrote:
>
>> On 7.10.22 0:07, Nadav Amit wrote:
>>>>>
>>>>> I was looking through the series and I did not see actual users of the
>>>>> notifier. Usually, it is not great to build an API without users.
>>>>
>>>>
>>>> You are right. I hope to get some feedback/interest from potential users that i mentioned in the cover letter. I will probably split the notifier
>>>> in separate series. To make it usefull it will require more changes.
>>>> See bellow more about them.
>>> Fair, but this is something that is more suitable for RFC. Otherwise, more
>>> likely than not - your patches would go in as is.
>>
>> Yes, i will remove the notifier and resend both as RFC. I think that every patch is an RFC and RFC tag is used for more general changes that could affect unexpected areas, change functionality, change design and in general can lead to bigger impact. In the case with this it adds functionality that is missing and it could hardly affect anything else.
>> In essence it provides information that you can not get without it.
>> But i will take your advice and push everything thru RFC from now on.
>
> Jus keep the version numbers as you had before. That’s fine and better to
> prevent confusion.

Sure, i will.

>>>>> [snip]
>>>>>> +
>>>>>> +static int balloon_notify(unsigned long val)
>>>>>> +{
>>>>>> + return srcu_notifier_call_chain(&balloon_chain, val, NULL);
>>>>> Since you know the inflated_kb value here, why not to use it as an argument
>>>>> to the callback? I think casting to (void *) and back is best. But you can
>>>>> also provide pointer to the value. Doesn’t it sound better than having
>>>>> potentially different notifiers reading different values?
>>>>
>>>> My current idea is to have a struct with current and previous value,
>>>> may be change in percents. The actual value does not matter to anyone
>>>> but the size of change does. When a user gets notified it can act upon
>>>> the change - if it is small it can ignore it , if it is above some threshold it can act - if it makes sense for some receiver is can accumulate changes from several notification. Other option/addition is to have si_meminfo_current(..) and totalram_pages_current(..) that return values adjusted with the balloon values.
>>>>
>>>> Going further - there are few places that calculate something based on available memory that do not have sysfs/proc interface for setting limits. Most of them work in percents so they can be converted to do calculations when they get notification.
>>>>
>>>> The one that have interface for configuration but use memory values can be handled in two ways - convert to use percents of what is available or extend the notifier to notify userspace which in turn to do calculations and update configuration.
>>> I really need to see code to fully understand what you have in mind.
>>
>> Sure - you can check some of the users with git grep totalram_pages - shows self explanatory results of usage like:
>> fs/f2fs/node.c:bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type) - calculations in percents - one good example
>> fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10);
>> fs/fuse/inode.c: *limit = ((totalram_pages() << PAGE_SHIFT) >> 13) / 392;
>> fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages())) << (PAGE_SHIFT-10);
>> fs/nfsd/nfscache.c: unsigned long low_pages = totalram_pages() - totalhigh_pages()
>> mm/oom_kill.c: oc->totalpages = totalram_pages() + total_swap_pages;
>>
>>
>> So all balloon drivers give large amount of RAM on boot , then inflate the balloon. But this places have already been initiallized and they know that the system have given amount of totalram which is not true the moment they start to operate. the result is that too much space gets used and it degrades the userspace performance.
>> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of ram for eventpool - when you inflate half of the ram it becomes 8% of the ram - do you really need 8% of your ram to be used for eventpool?
>>
>> To solve this you need to register and when notified update - cache size, limits and for whatever is the calculated amount of memory used.
>
> Hmm.. Not sure about all of that. Most balloon drivers are manually managed,
> and call adjust_managed_page_count(), and tas a result might want to redo
> all the calculations that are based on totalram_pages().

Yes, i will say that it looks like mixed manual - for large changes and
automatic for small changes. VMWare and HyperV have automatic and
manual/not sure exactly what you can change on a running VM but i guess
you can/ - Virtio is only manual. I do not know about dlpar / xen.

Scenario is like this start a VM with 4GB ram, reduce to 2GB with
balloon - vm can be upgraded.

All we are talking about relates to memory hotplug/unplug /where unplug
is close to nonexistant hence the balloons are used./

All values should be recalculated on memory hotplug too, so you can use
the newly available RAM.

RAM is the most valuable resource of all so i consider using it
optimally to be of a great importance.

> Side-note: That’s not the case for VMware balloon. I actually considered
> calling adjust_managed_page_count() just to conform with other balloon
> drivers. But since we use totalram_pages() to communicate to the hypervisor
> the total-ram, this would create endless (and wrong) feedback loop. I am not
> claiming it is not possible to VMware balloon driver to call
> adjust_managed_page_count(), but the chances are that it would create more
> harm than good.

Virtio does both - depending on the deflate on OOM option. I suggested
already to unify all drivers to inflate the used memory as it seems more
logical to me since no body expects the totalram_pages() to change but
the current state is that both ways are accepted and if changed can
break existing users.
See discussion here
https://lore.kernel.org/lkml/[email protected]/.


> Back to the matter at hand. It seems that you wish that the notifiers would
> be called following any changes that would be reflected in totalram_pages().
> So, doesn't it make more sense to call it from adjust_managed_page_count() ?

It will hurt performance - all drivers work page by page , i.e. they
update by +1/-1 and they do so under locks which as you already noted
can lead to bad things. The notifier will accumulate the change and let
its user know how much changed, so the can decide if they have to
recalculate - it even can do so async in order to not disturb the drivers.

>> The difference is here:
>>
>> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
>> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
>> uses percents and you can recalculate easy with
>>
>> +static inline unsigned long totalram_pages_current(void)
>> +{
>> + unsigned long inflated = 0;
>> +#ifdef CONFIG_MEMORY_BALLOON
>> + extern atomic_long_t mem_balloon_inflated_free_kb;
>> + inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
>> + inflated >>= (PAGE_SHIFT - 10);
>> +#endif
>> + return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
>> +}
>>
>> And you are good when you switch to _current version - si_meminfo_current is alike .
>>
>> On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this.
>
> I don’t get all of that. Now that you provided some more explanations, it
> sounds that what you want is adjust_managed_page_count(), which we already
> have and affects the output of totalram_pages(). Therefore, totalram_pages()
> anyhow accounts for the balloon memory (excluding VMware’s). So why do we
> need to take mem_balloon_inflated_free_kb into account?
Ok, you have this:
/ totalram
|----used----|b1|----free------|b2|

drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram
pages get smaller. so when you need totalram_pages() to do a calculation
you need to adjust it with the pages that are inflated in free/used
(b1). VMWare is not exception , Virtio does the same.
And according to to mst and davidh it is okay like this.
So i am proposing a way to handle both cases.

> Sounds to me that all you want is some notifier to be called from
> adjust_managed_page_count(). What am I missing?

Notifier will act as an accumulator to report size of change and it will
make things easier for the drivers and users wrt locking.
Notifier is similar to the memory hotplug notifier.

--
Regards,
Alexander Atanasov

2022-10-10 15:16:25

by Nadav Amit

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On Oct 10, 2022, at 12:24 AM, Alexander Atanasov <[email protected]> wrote:

> Hello,
>
> On 10.10.22 9:18, Nadav Amit wrote:
>> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <[email protected]> wrote:
>>> So all balloon drivers give large amount of RAM on boot , then inflate the balloon. But this places have already been initiallized and they know that the system have given amount of totalram which is not true the moment they start to operate. the result is that too much space gets used and it degrades the userspace performance.
>>> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of ram for eventpool - when you inflate half of the ram it becomes 8% of the ram - do you really need 8% of your ram to be used for eventpool?
>>>
>>> To solve this you need to register and when notified update - cache size, limits and for whatever is the calculated amount of memory used.
>> Hmm.. Not sure about all of that. Most balloon drivers are manually managed,
>> and call adjust_managed_page_count(), and tas a result might want to redo
>> all the calculations that are based on totalram_pages().
>
> Yes, i will say that it looks like mixed manual - for large changes and automatic for small changes. VMWare and HyperV have automatic and manual/not sure exactly what you can change on a running VM but i guess you can/ - Virtio is only manual. I do not know about dlpar / xen.
>
> Scenario is like this start a VM with 4GB ram, reduce to 2GB with balloon - vm can be upgraded.
>
> All we are talking about relates to memory hotplug/unplug /where unplug is close to nonexistant hence the balloons are used./
>
> All values should be recalculated on memory hotplug too, so you can use the newly available RAM.
>
> RAM is the most valuable resource of all so i consider using it optimally to be of a great importance.
>
>> Side-note: That’s not the case for VMware balloon. I actually considered
>> calling adjust_managed_page_count() just to conform with other balloon
>> drivers. But since we use totalram_pages() to communicate to the hypervisor
>> the total-ram, this would create endless (and wrong) feedback loop. I am not
>> claiming it is not possible to VMware balloon driver to call
>> adjust_managed_page_count(), but the chances are that it would create more
>> harm than good.
>
> Virtio does both - depending on the deflate on OOM option. I suggested already to unify all drivers to inflate the used memory as it seems more logical to me since no body expects the totalram_pages() to change but the current state is that both ways are accepted and if changed can break existing users.
> See discussion here https://lore.kernel.org/lkml/[email protected]/.

Thanks for the reminder. I wish you can somehow summarize all of that into the
cover-letter and/or the commit messages for these patches.

>
>
>> Back to the matter at hand. It seems that you wish that the notifiers would
>> be called following any changes that would be reflected in totalram_pages().
>> So, doesn't it make more sense to call it from adjust_managed_page_count() ?
>
> It will hurt performance - all drivers work page by page , i.e. they update by +1/-1 and they do so under locks which as you already noted can lead to bad things. The notifier will accumulate the change and let its user know how much changed, so the can decide if they have to recalculate - it even can do so async in order to not disturb the drivers.

So updating the counters by 1 is ok (using atomic operation, which is not
free)? And the reason it is (relatively) cheap is because nobody actually
looks at the value (i.e., nobody actually acts on the value)?

If nobody considers the value, then doesn’t it make sense just to update it
less frequently, and then call the notifiers?

>>> The difference is here:
>>>
>>> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
>>> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
>>> uses percents and you can recalculate easy with
>>>
>>> +static inline unsigned long totalram_pages_current(void)
>>> +{
>>> + unsigned long inflated = 0;
>>> +#ifdef CONFIG_MEMORY_BALLOON
>>> + extern atomic_long_t mem_balloon_inflated_free_kb;
>>> + inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
>>> + inflated >>= (PAGE_SHIFT - 10);
>>> +#endif
>>> + return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
>>> +}

So we have here two values and it appears there is a hidden assumption that
they are both updated atomically. Otherwise, it appears, inflated
theoretically might be greater that _totalram_pages dn we get negative value
and all hell breaks loose.

But _totalram_pages and mem_balloon_inflated_free_kb are not updated
atomically together (each one is, but not together).

>>> And you are good when you switch to _current version - si_meminfo_current is alike .
>>>
>>> On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this.
>> I don’t get all of that. Now that you provided some more explanations, it
>> sounds that what you want is adjust_managed_page_count(), which we already
>> have and affects the output of totalram_pages(). Therefore, totalram_pages()
>> anyhow accounts for the balloon memory (excluding VMware’s). So why do we
>> need to take mem_balloon_inflated_free_kb into account?
> Ok, you have this:
> / totalram
> |----used----|b1|----free------|b2|
>
> drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram pages get smaller. so when you need totalram_pages() to do a calculation you need to adjust it with the pages that are inflated in free/used (b1). VMWare is not exception , Virtio does the same.
> And according to to mst and davidh it is okay like this.
> So i am proposing a way to handle both cases.

Ugh. What about BALLOON_INFLATE and BALLOON_DEFLATE vm-events? Can’t this
information be used instead of yet another counter? Unless, of course, you
get the atomicity that I mentioned before.

>> Sounds to me that all you want is some notifier to be called from
>> adjust_managed_page_count(). What am I missing?
>
> Notifier will act as an accumulator to report size of change and it will make things easier for the drivers and users wrt locking.
> Notifier is similar to the memory hotplug notifier.

Overall, I am not convinced that there is any value of separating the value
and the notifier. You can batch both or not batch both. In addition, as I
mentioned, having two values seems racy.

2022-10-11 09:22:26

by Alexander Atanasov

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

Hello,

On 10.10.22 17:47, Nadav Amit wrote:
> On Oct 10, 2022, at 12:24 AM, Alexander Atanasov <[email protected]> wrote:
>
>> Hello,
>>
>> On 10.10.22 9:18, Nadav Amit wrote:
>>> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov <[email protected]> wrote:>

[snip]

>>> Side-note: That’s not the case for VMware balloon. I actually considered
>>> calling adjust_managed_page_count() just to conform with other balloon
>>> drivers. But since we use totalram_pages() to communicate to the hypervisor
>>> the total-ram, this would create endless (and wrong) feedback loop. I am not
>>> claiming it is not possible to VMware balloon driver to call
>>> adjust_managed_page_count(), but the chances are that it would create more
>>> harm than good.
>>
>> Virtio does both - depending on the deflate on OOM option. I suggested already to unify all drivers to inflate the used memory as it seems more logical to me since no body expects the totalram_pages() to change but the current state is that both ways are accepted and if changed can break existing users.
>> See discussion here https://lore.kernel.org/lkml/[email protected]/.
>
> Thanks for the reminder. I wish you can somehow summarize all of that into the
> cover-letter and/or the commit messages for these patches.


I will put excerpts in the next versions and relevant links in the next
versions. I see that the more i dig into it the deeper it becomes so it
needs more explanations.


>>
>>
>>> Back to the matter at hand. It seems that you wish that the notifiers would
>>> be called following any changes that would be reflected in totalram_pages().
>>> So, doesn't it make more sense to call it from adjust_managed_page_count() ?
>>
>> It will hurt performance - all drivers work page by page , i.e. they update by +1/-1 and they do so under locks which as you already noted can lead to bad things. The notifier will accumulate the change and let its user know how much changed, so the can decide if they have to recalculate - it even can do so async in order to not disturb the drivers.
>
> So updating the counters by 1 is ok (using atomic operation, which is not
> free)? And the reason it is (relatively) cheap is because nobody actually
> looks at the value (i.e., nobody actually acts on the value)?
>
> If nobody considers the value, then doesn’t it make sense just to update it
> less frequently, and then call the notifiers?

That's my point too.
The drivers update managed page count by 1.
My goal is when they are done to fire the notifier.

All drivers are similiar and work like this:
HV sends request inflate up/down
driver up/down
lock
get_page()/put_page()
optionally - adjust_managed_page_count(... +-1);
unlock
update_core and notify_balloon_changed

>>>> The difference is here:
>>>>
>>>> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
>>>> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
>>>> uses percents and you can recalculate easy with
>>>>
>>>> +static inline unsigned long totalram_pages_current(void)
>>>> +{
>>>> + unsigned long inflated = 0;
>>>> +#ifdef CONFIG_MEMORY_BALLOON
>>>> + extern atomic_long_t mem_balloon_inflated_free_kb;
>>>> + inflated = atomic_long_read(&mem_balloon_inflated_free_kb);
>>>> + inflated >>= (PAGE_SHIFT - 10);
>>>> +#endif
>>>> + return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
>>>> +}
>
> So we have here two values and it appears there is a hidden assumption that
> they are both updated atomically. Otherwise, it appears, inflated
> theoretically might be greater that _totalram_pages dn we get negative value
> and all hell breaks loose.
>
> But _totalram_pages and mem_balloon_inflated_free_kb are not updated
> atomically together (each one is, but not together).
>

I do not think that can happen - in that case totalram_pages() is not
adjusted and you can never inflate more than total ram.

Yes, they are not set atomic but see the use cases:

- a driver that does calculations on init.
It will use notifier to redo the calculations.
The notifier will bring the values and the size of change to help the
driver decide if it needs to recalculate.

- a user of totalram_pages() that does calculations at run time -
i have to research are there any users that could be affected by not
setting the two values atomicaly - assuming there can be a slight
difference. I.e. do we need precise calculations or they are calculating
fractions.


>>>> And you are good when you switch to _current version - si_meminfo_current is alike .
>>>>
>>>> On init (probably) all use some kind of fractions to calculate but when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it is just a value and you can not recalculate it. And here, please, share your ideas how to solve this.
>>> I don’t get all of that. Now that you provided some more explanations, it
>>> sounds that what you want is adjust_managed_page_count(), which we already
>>> have and affects the output of totalram_pages(). Therefore, totalram_pages()
>>> anyhow accounts for the balloon memory (excluding VMware’s). So why do we
>>> need to take mem_balloon_inflated_free_kb into account?
>> Ok, you have this:
>> / totalram
>> |----used----|b1|----free------|b2|
>>
>> drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram pages get smaller. so when you need totalram_pages() to do a calculation you need to adjust it with the pages that are inflated in free/used (b1). VMWare is not exception , Virtio does the same.
>> And according to to mst and davidh it is okay like this.
>> So i am proposing a way to handle both cases.
>
> Ugh. What about BALLOON_INFLATE and BALLOON_DEFLATE vm-events? Can’t this
> information be used instead of yet another counter? Unless, of course, you
> get the atomicity that I mentioned before.

What do you mean by vm-events ?


>>> Sounds to me that all you want is some notifier to be called from
>>> adjust_managed_page_count(). What am I missing?
>>
>> Notifier will act as an accumulator to report size of change and it will make things easier for the drivers and users wrt locking.
>> Notifier is similar to the memory hotplug notifier.
>
> Overall, I am not convinced that there is any value of separating the value
> and the notifier. You can batch both or not batch both. In addition, as I
> mentioned, having two values seems racy.

I have identified two users so far above - may be more to come.
One type needs the value to adjust. Also having the value is necessary
to report it to users and oom. There are options with callbacks and so
on but it will complicate things with no real gain. You are right about
the atomicity but i guess if that's a problem for some user it could
find a way to ensure it. i am yet to find such place.

--
Regards,
Alexander Atanasov

2022-10-11 09:39:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

>>>> Sounds to me that all you want is some notifier to be called from
>>>> adjust_managed_page_count(). What am I missing?
>>>
>>> Notifier will act as an accumulator to report size of change and it will make things easier for the drivers and users wrt locking.
>>> Notifier is similar to the memory hotplug notifier.
>>
>> Overall, I am not convinced that there is any value of separating the value
>> and the notifier. You can batch both or not batch both. In addition, as I
>> mentioned, having two values seems racy.
>
> I have identified two users so far above - may be more to come.
> One type needs the value to adjust. Also having the value is necessary
> to report it to users and oom. There are options with callbacks and so
> on but it will complicate things with no real gain. You are right about
> the atomicity but i guess if that's a problem for some user it could
> find a way to ensure it. i am yet to find such place.
>

I haven't followed the whole discussion, but I just wanted to raise that
having a generic mechanism to notify on such changes could be valuable.

For example, virtio-mem also uses adjust_managed_page_count() and might
sometimes not trigger memory hotplug notifiers when adding more memory
(essentially, when it fake-adds memory part of an already added Linux
memory block).

What might make sense is schedule some kind of deferred notification on
adjust_managed_page_count() changes. This way, we could notify without
caring about locking and would naturally batch notifications.

adjust_managed_page_count() users would not require changes.

--
Thanks,

David / dhildenb

2022-10-14 12:56:31

by Alexander Atanasov

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

Hello,

On 11.10.22 12:23, David Hildenbrand wrote:
>>>>> Sounds to me that all you want is some notifier to be called from
>>>>> adjust_managed_page_count(). What am I missing?
>>>>
>>>> Notifier will act as an accumulator to report size of change and it
>>>> will make things easier for the drivers and users wrt locking.
>>>> Notifier is similar to the memory hotplug notifier.
>>>
>>> Overall, I am not convinced that there is any value of separating the
>>> value
>>> and the notifier. You can batch both or not batch both. In addition,
>>> as I
>>> mentioned, having two values seems racy.
>>
>> I have identified two users so far above - may be more to come.
>> One type needs the value to adjust. Also having the value is necessary
>> to report it to users and oom. There are options with callbacks and so
>> on but it will complicate things with no real gain. You are right about
>> the atomicity but i guess if that's a problem for some user it could
>> find a way to ensure it. i am yet to find such place.
>>
>
> I haven't followed the whole discussion, but I just wanted to raise that
> having a generic mechanism to notify on such changes could be valuable.
>
> For example, virtio-mem also uses adjust_managed_page_count() and might
> sometimes not trigger memory hotplug notifiers when adding more memory
> (essentially, when it fake-adds memory part of an already added Linux
> memory block).
>
> What might make sense is schedule some kind of deferred notification on
> adjust_managed_page_count() changes. This way, we could notify without
> caring about locking and would naturally batch notifications.
>
> adjust_managed_page_count() users would not require changes.

Making it deferred will bring issues for both the users of the
adjust_managed_page_count and the receivers of the notification -
locking as first. And it is hard to know when the adjustment will
finish, some of the drivers wait and retry in blocks. It will bring
complexity and it will not be possible to convert users in small steps.

Other problem is that there are drivers that do not use
adjust_managed_page_count().

Extending the current memory hotplug notifier is not an option too much
things can break.

I think some new option CONFIG_DYN_MEM_CONSTRAINTS or a better name.
Add a dynmem_notifier that bring size changes to the subscribers.
CONFIG_DYN_MEM_CONSTRAINTS to be able to convert current __init
functions without growing kernel in size when not using it.

One idea about the absolut values in sysfs/proc is to check if the
option is set by the user or it is the same as was calculated at init
time - if it is not set by the user it can be adjusted up/down by
percent of the change in memory - it may require min/max value boundaries.

Does anyone know - if it is possible to know if a sysfs/proc file has
been written to since boot?


--
Regards,
Alexander Atanasov

2022-10-14 13:20:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On 14.10.22 14:50, Alexander Atanasov wrote:
> Hello,
>
> On 11.10.22 12:23, David Hildenbrand wrote:
>>>>>> Sounds to me that all you want is some notifier to be called from
>>>>>> adjust_managed_page_count(). What am I missing?
>>>>>
>>>>> Notifier will act as an accumulator to report size of change and it
>>>>> will make things easier for the drivers and users wrt locking.
>>>>> Notifier is similar to the memory hotplug notifier.
>>>>
>>>> Overall, I am not convinced that there is any value of separating the
>>>> value
>>>> and the notifier. You can batch both or not batch both. In addition,
>>>> as I
>>>> mentioned, having two values seems racy.
>>>
>>> I have identified two users so far above - may be more to come.
>>> One type needs the value to adjust. Also having the value is necessary
>>> to report it to users and oom. There are options with callbacks and so
>>> on but it will complicate things with no real gain. You are right about
>>> the atomicity but i guess if that's a problem for some user it could
>>> find a way to ensure it. i am yet to find such place.
>>>
>>
>> I haven't followed the whole discussion, but I just wanted to raise that
>> having a generic mechanism to notify on such changes could be valuable.
>>
>> For example, virtio-mem also uses adjust_managed_page_count() and might
>> sometimes not trigger memory hotplug notifiers when adding more memory
>> (essentially, when it fake-adds memory part of an already added Linux
>> memory block).
>>
>> What might make sense is schedule some kind of deferred notification on
>> adjust_managed_page_count() changes. This way, we could notify without
>> caring about locking and would naturally batch notifications.
>>
>> adjust_managed_page_count() users would not require changes.
>
> Making it deferred will bring issues for both the users of the
> adjust_managed_page_count and the receivers of the notification -
> locking as first. And it is hard to know when the adjustment will
> finish, some of the drivers wait and retry in blocks. It will bring
> complexity and it will not be possible to convert users in small steps.

What exactly is the issue about handling that deferred? Who needs an
immediate, 100% precise notification?

Locking from a separate workqueue shouldn't be too hard, or what am i
missing?

>
> Other problem is that there are drivers that do not use
> adjust_managed_page_count().

Which ones? Do we care?

--
Thanks,

David / dhildenb

2022-10-14 13:52:52

by Alexander Atanasov

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

On 14.10.22 16:01, David Hildenbrand wrote:
> On 14.10.22 14:50, Alexander Atanasov wrote:
>> Hello,
>>
>> On 11.10.22 12:23, David Hildenbrand wrote:
>>>>>>> Sounds to me that all you want is some notifier to be called from
>>>>>>> adjust_managed_page_count(). What am I missing?
>>>>>>
>>>>>> Notifier will act as an accumulator to report size of change and it
>>>>>> will make things easier for the drivers and users wrt locking.
>>>>>> Notifier is similar to the memory hotplug notifier.
>>>>>
>>>>> Overall, I am not convinced that there is any value of separating the
>>>>> value
>>>>> and the notifier. You can batch both or not batch both. In addition,
>>>>> as I
>>>>> mentioned, having two values seems racy.
>>>>
>>>> I have identified two users so far above - may be more to come.
>>>> One type needs the value to adjust. Also having the value is necessary
>>>> to report it to users and oom. There are options with callbacks and so
>>>> on but it will complicate things with no real gain. You are right about
>>>> the atomicity but i guess if that's a problem for some user it could
>>>> find a way to ensure it. i am yet to find such place.
>>>>
>>>
>>> I haven't followed the whole discussion, but I just wanted to raise that
>>> having a generic mechanism to notify on such changes could be valuable.
>>>
>>> For example, virtio-mem also uses adjust_managed_page_count() and might
>>> sometimes not trigger memory hotplug notifiers when adding more memory
>>> (essentially, when it fake-adds memory part of an already added Linux
>>> memory block).
>>>
>>> What might make sense is schedule some kind of deferred notification on
>>> adjust_managed_page_count() changes. This way, we could notify without
>>> caring about locking and would naturally batch notifications.
>>>
>>> adjust_managed_page_count() users would not require changes.
>>
>> Making it deferred will bring issues for both the users of the
>> adjust_managed_page_count and the receivers of the notification -
>> locking as first. And it is hard to know when the adjustment will
>> finish, some of the drivers wait and retry in blocks. It will bring
>> complexity and it will not be possible to convert users in small steps.
>
> What exactly is the issue about handling that deferred? Who needs an
> immediate, 100% precise notification? >
> Locking from a separate workqueue shouldn't be too hard, or what am i
> missing?
>

We do not need immediate but most of the current callers of
adjust_managed_page_count work in +1/-1 updates - so we want to defer
the notification until they are done with changes. Deferring to a wq is
not the problem, it would need to be done most likely.


>>
>> Other problem is that there are drivers that do not use
>> adjust_managed_page_count().
>
> Which ones? Do we care?

VMWare and Virtio balloon drivers. I recently proposed to unify them and
the objection was that it would break existing users - which is valid so
we must care i guess.

--
Regards,
Alexander Atanasov

2022-10-14 14:08:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

>>>
>>> Other problem is that there are drivers that do not use
>>> adjust_managed_page_count().
>>
>> Which ones? Do we care?
>
> VMWare and Virtio balloon drivers. I recently proposed to unify them and
> the objection was that it would break existing users - which is valid so
> we must care i guess.

I'm confused, I think we care about actual adjustment of the total pages
available here, that we want to notify the system about. These
approaches (vmware, virtio-balloon with deflate-on-oom) don't adjust
totalpages, because the assumption is that we can get back the inflated
memory any time we really need it automatically.

--
Thanks,

David / dhildenb

2022-10-14 14:18:59

by Alexander Atanasov

[permalink] [raw]
Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

Hello,

On 14.10.22 16:40, David Hildenbrand wrote:
>>>>
>>>> Other problem is that there are drivers that do not use
>>>> adjust_managed_page_count().
>>>
>>> Which ones? Do we care?
>>
>> VMWare and Virtio balloon drivers. I recently proposed to unify them and
>> the objection was that it would break existing users - which is valid so
>> we must care i guess.
>
> I'm confused, I think we care about actual adjustment of the total pages
> available here, that we want to notify the system about. These
> approaches (vmware, virtio-balloon with deflate-on-oom) don't adjust
> totalpages, because the assumption is that we can get back the inflated
> memory any time we really need it automatically.

We want to notify about the actual adjustment of available pages no
matter where they are accounted free or total. Users don't care where
the ram came from or has gone. They need the total change, so they can
decided if they need to recalculate.

The example i wrote earlier:
Kernel boots with 4GB.
Balloon takes back 2GB.
epoll_events allocated 4% of the total memory at boot.
For simpler math after total ram is reduced to 2GB, that 4% become
really 8% of the total ram.
We want to tell epoll that there is 2GB change in total ram, so it can
update to really use 4%.

Reverse direction is true too - you hot plug 4GB and the 4% become just
2% so you are not using newly available ram optimally.

And it is not only about epoll.

When not recalculated/updated this allocations/limits/etc the balance of
memory usage between userspace and kernel gets a bit off, and i think
not a bit but a way off.

About the assumption drivers can get back the ram at anytime - if they
use the oom_notifier - they can update the totalpages without problem.
oom_killer doesn't check totalram_pages() but tries to free memory with
the notifier and only if it fails totalram_pages() is consulted.

--
Regards,
Alexander Atanasov