2020-05-29 01:10:03

by Feng Tang

[permalink] [raw]
Subject: [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy

When checking a performance change for will-it-scale scalability
mmap test [1], we found very high lock contention for spinlock of
percpu counter 'vm_committed_as':

94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch
number for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy,
and enlarge it for not-so-strict OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
policies.

Benchmark with the same testcase in [1] shows 53% improvement on a
8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And for that
case, whether it shows improvements depends on if the test mmap size
is bigger than the batch number computed.

We tested 10+ platforms in 0day (server, desktop and laptop). If we
lift it to 64X, 80%+ platforms show improvements, and for 16X lift,
1/3 of the platforms will show improvements.

And generally it should help the mmap/unmap usage,as Michal Hocko
mentioned:
"
I believe that there are non-synthetic worklaods which would benefit
from a larger batch. E.g. large in memory databases which do large
mmaps during startups from multiple threads.
"

Note: There are some style complain from checkpatch for patch 3,
as sysctl handler declaration follows the similar format of sibling
functions

patch1: a cleanup for /proc/meminfo
patch2: a preparation patch which also improve the accuracy of
vm_memory_committed
patch3: remove the VM_WARN_ONCE for vm_committed_as underflow check
patch4: main change

This is against today's linux-mm git tree on github.

Please help to review, thanks!

- Feng

----------------------------------------------------------------
Changelog:

v4:
* Remove the VM_WARN_ONCE check for vm_committed_as underflow,
thanks to Qian Cai for finding and testing the warning

v3:
* refine commit log and cleanup code, according to comments
from Michal Hocko and Matthew Wilcox
* change the lift from 16X and 64X after test

v2:
* add the sysctl handler to cover runtime overcommit policy
change, as suggested by Andres Morton
* address the accuracy concern of vm_memory_committed()
from Andi Kleen


Feng Tang (4):
proc/meminfo: avoid open coded reading of vm_committed_as
mm/util.c: make vm_memory_committed() more accurate
mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check
mm: adjust vm_committed_as_batch according to vm overcommit policy

fs/proc/meminfo.c | 2 +-
include/linux/mm.h | 2 ++
include/linux/mman.h | 4 ++++
kernel/sysctl.c | 2 +-
mm/mm_init.c | 18 ++++++++++++++----
mm/util.c | 22 +++++++++++++---------
6 files changed, 35 insertions(+), 15 deletions(-)

--
2.7.4


2020-05-29 01:10:57

by Feng Tang

[permalink] [raw]
Subject: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate

percpu_counter_sum_positive() will provide more accurate info.

As with percpu_counter_read_positive(), in worst case the deviation
could be 'batch * nr_cpus', which is totalram_pages/256 for now,
and will be more when the batch gets enlarged.

Its time cost is about 800 nanoseconds on a 2C/4T platform and
2~3 microseconds on a 2S/36C/72T server in normal case, and in
worst case where vm_committed_as's spinlock is under severe
contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
which should be fine for its only two users: /proc/meminfo and
HyperV balloon driver's status trace per second.

Signed-off-by: Feng Tang <[email protected]>
---
mm/util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 9b3be03..3c7a08c 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -790,7 +790,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
*/
unsigned long vm_memory_committed(void)
{
- return percpu_counter_read_positive(&vm_committed_as);
+ return percpu_counter_sum_positive(&vm_committed_as);
}
EXPORT_SYMBOL_GPL(vm_memory_committed);

--
2.7.4

2020-05-29 01:11:53

by Feng Tang

[permalink] [raw]
Subject: [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy

When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':

94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
add a sysctl handler to adjust it when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test. And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.

And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:

: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch. E.g. large in memory databases which do large mmaps
: during startups from multiple threads.

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Feng Tang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Huang Ying <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/mm.h | 2 ++
include/linux/mman.h | 4 ++++
kernel/sysctl.c | 2 +-
mm/mm_init.c | 18 ++++++++++++++----
mm/util.c | 12 ++++++++++++
5 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 573947c..c2efea6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -206,6 +206,8 @@ int overcommit_ratio_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
+int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
+ loff_t *);

#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c..91c93c1 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;

#ifdef CONFIG_SMP
extern s32 vm_committed_as_batch;
+extern void mm_compute_batch(void);
#else
#define vm_committed_as_batch 0
+static inline void mm_compute_batch(void)
+{
+}
#endif

unsigned long vm_memory_committed(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index db1ce7a..9456c86 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2650,7 +2650,7 @@ static struct ctl_table vm_table[] = {
.data = &sysctl_overcommit_memory,
.maxlen = sizeof(sysctl_overcommit_memory),
.mode = 0644,
- .proc_handler = proc_dointvec_minmax,
+ .proc_handler = overcommit_policy_handler,
.extra1 = SYSCTL_ZERO,
.extra2 = &two,
},
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 435e5f7..c5a6fb1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -13,6 +13,7 @@
#include <linux/memory.h>
#include <linux/notifier.h>
#include <linux/sched.h>
+#include <linux/mman.h>
#include "internal.h"

#ifdef CONFIG_DEBUG_MEMORY_INIT
@@ -144,14 +145,23 @@ EXPORT_SYMBOL_GPL(mm_kobj);
#ifdef CONFIG_SMP
s32 vm_committed_as_batch = 32;

-static void __meminit mm_compute_batch(void)
+void mm_compute_batch(void)
{
u64 memsized_batch;
s32 nr = num_present_cpus();
s32 batch = max_t(s32, nr*2, 32);
-
- /* batch size set to 0.4% of (total memory/#cpus), or max int32 */
- memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
+ unsigned long ram_pages = totalram_pages();
+
+ /*
+ * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
+ * of (total memory/#cpus), and lift it to 25% for other
+ * policies to easy the possible lock contention for percpu_counter
+ * vm_committed_as, while the max limit is INT_MAX
+ */
+ if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+ memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
+ else
+ memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX);

vm_committed_as_batch = max_t(s32, memsized_batch, batch);
}
diff --git a/mm/util.c b/mm/util.c
index fe63271..580d268 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -746,6 +746,18 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
return ret;
}

+int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ if (ret == 0 && write)
+ mm_compute_batch();
+
+ return ret;
+}
+
int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
--
2.7.4

2020-06-03 13:39:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate

On Fri 29-05-20 09:06:08, Feng Tang wrote:
> percpu_counter_sum_positive() will provide more accurate info.
>
> As with percpu_counter_read_positive(), in worst case the deviation
> could be 'batch * nr_cpus', which is totalram_pages/256 for now,
> and will be more when the batch gets enlarged.
>
> Its time cost is about 800 nanoseconds on a 2C/4T platform and
> 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> worst case where vm_committed_as's spinlock is under severe
> contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
> which should be fine for its only two users: /proc/meminfo and
> HyperV balloon driver's status trace per second.
>
> Signed-off-by: Feng Tang <[email protected]>

I cannot speak for HyperV part. Cc maintainers but this shouldn't be a
problem for meminfo.

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/util.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 9b3be03..3c7a08c 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -790,7 +790,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
> */
> unsigned long vm_memory_committed(void)
> {
> - return percpu_counter_read_positive(&vm_committed_as);
> + return percpu_counter_sum_positive(&vm_committed_as);
> }
> EXPORT_SYMBOL_GPL(vm_memory_committed);
>
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

2020-06-03 13:41:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy

On Fri 29-05-20 09:06:10, Feng Tang wrote:
> When checking a performance change for will-it-scale scalability mmap test
> [1], we found very high lock contention for spinlock of percpu counter
> 'vm_committed_as':
>
> 94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> 45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
>
> Actually this heavy lock contention is not always necessary. The
> 'vm_committed_as' needs to be very precise when the strict
> OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
> for the percpu counter.
>
> So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
> lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
> add a sysctl handler to adjust it when the policy is reconfigured.
>
> Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
> desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
> platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
> improvements with that test. And whether it shows improvements depends on
> if the test mmap size is bigger than the batch number computed.
>
> And if the lift is 16X, 1/3 of the platforms will show improvements,
> though it should help the mmap/unmap usage generally, as Michal Hocko
> mentioned:
>
> : I believe that there are non-synthetic worklaods which would benefit from
> : a larger batch. E.g. large in memory databases which do large mmaps
> : during startups from multiple threads.
>
> [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Feng Tang <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> include/linux/mm.h | 2 ++
> include/linux/mman.h | 4 ++++
> kernel/sysctl.c | 2 +-
> mm/mm_init.c | 18 ++++++++++++++----
> mm/util.c | 12 ++++++++++++
> 5 files changed, 33 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 573947c..c2efea6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -206,6 +206,8 @@ int overcommit_ratio_handler(struct ctl_table *, int, void *, size_t *,
> loff_t *);
> int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
> loff_t *);
> +int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
> + loff_t *);
>
> #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 4b08e9c..91c93c1 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
>
> #ifdef CONFIG_SMP
> extern s32 vm_committed_as_batch;
> +extern void mm_compute_batch(void);
> #else
> #define vm_committed_as_batch 0
> +static inline void mm_compute_batch(void)
> +{
> +}
> #endif
>
> unsigned long vm_memory_committed(void);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index db1ce7a..9456c86 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2650,7 +2650,7 @@ static struct ctl_table vm_table[] = {
> .data = &sysctl_overcommit_memory,
> .maxlen = sizeof(sysctl_overcommit_memory),
> .mode = 0644,
> - .proc_handler = proc_dointvec_minmax,
> + .proc_handler = overcommit_policy_handler,
> .extra1 = SYSCTL_ZERO,
> .extra2 = &two,
> },
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 435e5f7..c5a6fb1 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -13,6 +13,7 @@
> #include <linux/memory.h>
> #include <linux/notifier.h>
> #include <linux/sched.h>
> +#include <linux/mman.h>
> #include "internal.h"
>
> #ifdef CONFIG_DEBUG_MEMORY_INIT
> @@ -144,14 +145,23 @@ EXPORT_SYMBOL_GPL(mm_kobj);
> #ifdef CONFIG_SMP
> s32 vm_committed_as_batch = 32;
>
> -static void __meminit mm_compute_batch(void)
> +void mm_compute_batch(void)
> {
> u64 memsized_batch;
> s32 nr = num_present_cpus();
> s32 batch = max_t(s32, nr*2, 32);
> -
> - /* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> - memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> + unsigned long ram_pages = totalram_pages();
> +
> + /*
> + * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> + * of (total memory/#cpus), and lift it to 25% for other
> + * policies to easy the possible lock contention for percpu_counter
> + * vm_committed_as, while the max limit is INT_MAX
> + */
> + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> + memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> + else
> + memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX);
>
> vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> }
> diff --git a/mm/util.c b/mm/util.c
> index fe63271..580d268 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -746,6 +746,18 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
> return ret;
> }
>
> +int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> + size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> +
> + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> + if (ret == 0 && write)
> + mm_compute_batch();
> +
> + return ret;
> +}
> +
> int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer,
> size_t *lenp, loff_t *ppos)
> {
> --
> 2.7.4
>

--
Michal Hocko
SUSE Labs

2020-06-03 14:31:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate

> Its time cost is about 800 nanoseconds on a 2C/4T platform and
> 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> worst case where vm_committed_as's spinlock is under severe
> contention, it costs 30~40 microseconds for the 2S/36C/72T sever,

This will be likely 40-80us on larger systems, although the overhead
is often non linear so it might get worse.

> which should be fine for its only two users: /proc/meminfo and
> HyperV balloon driver's status trace per second.

There are some setups who do frequent sampling of /proc/meminfo
in the background. Increased overhead could be a problem for them.
But not proposing a change now. If someone complains have to
revisit I guess, perhaps adding a rate limit of some sort.

-Andi

2020-06-04 01:55:22

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate

On Wed, Jun 03, 2020 at 07:28:53AM -0700, Andi Kleen wrote:
> > Its time cost is about 800 nanoseconds on a 2C/4T platform and
> > 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> > worst case where vm_committed_as's spinlock is under severe
> > contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
>
> This will be likely 40-80us on larger systems, although the overhead
> is often non linear so it might get worse.
>
> > which should be fine for its only two users: /proc/meminfo and
> > HyperV balloon driver's status trace per second.
>
> There are some setups who do frequent sampling of /proc/meminfo
> in the background. Increased overhead could be a problem for them.
> But not proposing a change now. If someone complains have to
> revisit I guess, perhaps adding a rate limit of some sort.

Agree. Maybe I should also put the time cost info into the code
comments in case someone noticed the slowdown.

Thanks,
Feng

>
> -Andi