Hi,
On Thu, Jul 02, 2020 at 02:32:01PM +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 4e2c82a40911c19419349918e675aa202b113b4d ("[PATCH v5 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy")
> url: https://github.com/0day-ci/linux/commits/Feng-Tang/make-vm_committed_as_batch-aware-of-vm-overcommit-policy/20200621-153906
>
>
> in testcase: ltp
> with following parameters:
>
> disk: 1HDD
> test: mm-01
>
> test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features.
> test-url: http://linux-test-project.github.io/
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
>
> <<<test_start>>>
> tag=overcommit_memory01 stime=1593425044
> cmdline="overcommit_memory"
> contacts=""
> analysis=exit
> <<<test_output>>>
> tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s
> overcommit_memory.c:116: INFO: MemTotal is 16394252 kB
> overcommit_memory.c:118: INFO: SwapTotal is 268435452 kB
> overcommit_memory.c:122: INFO: CommitLimit is 276632576 kB
> mem.c:817: INFO: set overcommit_ratio to 50
> mem.c:817: INFO: set overcommit_memory to 2
> overcommit_memory.c:187: INFO: malloc 551061440 kB failed
> overcommit_memory.c:208: PASS: alloc failed as expected
> overcommit_memory.c:183: INFO: malloc 276632576 kB successfully
> overcommit_memory.c:210: FAIL: alloc passed, expected to fail
Thanks for the report!
I took a rough look, and it all happens after changing the
overcommit policy from a looser one to OVERCOMMIT_NEVER. I suspect
it is due to the same cause as the previous warning message reported
by Qian Cai https://lore.kernel.org/lkml/[email protected]/
Will further check it.
Thanks,
Feng
> overcommit_memory.c:183: INFO: malloc 137765294 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> mem.c:817: INFO: set overcommit_memory to 0
> overcommit_memory.c:183: INFO: malloc 140770308 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> overcommit_memory.c:187: INFO: malloc 569659408 kB failed
> overcommit_memory.c:208: PASS: alloc failed as expected
> mem.c:817: INFO: set overcommit_memory to 1
> overcommit_memory.c:183: INFO: malloc 142414852 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> overcommit_memory.c:183: INFO: malloc 284829704 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> overcommit_memory.c:183: INFO: malloc 569659408 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> mem.c:817: INFO: set overcommit_memory to 0
> mem.c:817: INFO: set overcommit_ratio to 50
>
> Summary:
> passed 7
> failed 1
> skipped 0
> warnings 0
> <<<execution_status>>>
> initiation_status="ok"
> duration=0 termination_type=exited termination_id=1 corefile=no
> cutime=0 cstime=1
> <<<test_end>>>
> <<<test_start>>>
> tag=overcommit_memory02 stime=1593425044
> cmdline="overcommit_memory -R 0"
> contacts=""
> analysis=exit
> <<<test_output>>>
> tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s
> overcommit_memory.c:116: INFO: MemTotal is 16394252 kB
> overcommit_memory.c:118: INFO: SwapTotal is 268435452 kB
> overcommit_memory.c:122: INFO: CommitLimit is 276632576 kB
> mem.c:817: INFO: set overcommit_ratio to 0
> mem.c:817: INFO: set overcommit_memory to 2
> overcommit_memory.c:187: INFO: malloc 534667184 kB failed
> overcommit_memory.c:208: PASS: alloc failed as expected
> overcommit_memory.c:183: INFO: malloc 268435452 kB successfully
> overcommit_memory.c:210: FAIL: alloc passed, expected to fail
> overcommit_memory.c:183: INFO: malloc 133666730 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> mem.c:817: INFO: set overcommit_memory to 0
> overcommit_memory.c:183: INFO: malloc 140770304 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> overcommit_memory.c:187: INFO: malloc 569659408 kB failed
> overcommit_memory.c:208: PASS: alloc failed as expected
> mem.c:817: INFO: set overcommit_memory to 1
> overcommit_memory.c:183: INFO: malloc 142414852 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> overcommit_memory.c:183: INFO: malloc 284829704 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> overcommit_memory.c:183: INFO: malloc 569659408 kB successfully
> overcommit_memory.c:202: PASS: alloc passed as expected
> mem.c:817: INFO: set overcommit_memory to 0
> mem.c:817: INFO: set overcommit_ratio to 50
>
On Thu, Jul 02, 2020 at 03:12:30PM +0800, Feng Tang wrote:
> Hi,
>
> On Thu, Jul 02, 2020 at 02:32:01PM +0800, kernel test robot wrote:
> > Greeting,
> >
> > FYI, we noticed the following commit (built with gcc-9):
> >
> > commit: 4e2c82a40911c19419349918e675aa202b113b4d ("[PATCH v5 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy")
> > url: https://github.com/0day-ci/linux/commits/Feng-Tang/make-vm_committed_as_batch-aware-of-vm-overcommit-policy/20200621-153906
> >
> >
> > in testcase: ltp
> > with following parameters:
> >
> > disk: 1HDD
> > test: mm-01
> >
> > test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features.
> > test-url: http://linux-test-project.github.io/
> >
> >
> > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> >
> >
> >
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot <[email protected]>
> >
> >
> >
> > <<<test_start>>>
> > tag=overcommit_memory01 stime=1593425044
> > cmdline="overcommit_memory"
> > contacts=""
> > analysis=exit
> > <<<test_output>>>
> > tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s
> > overcommit_memory.c:116: INFO: MemTotal is 16394252 kB
> > overcommit_memory.c:118: INFO: SwapTotal is 268435452 kB
> > overcommit_memory.c:122: INFO: CommitLimit is 276632576 kB
> > mem.c:817: INFO: set overcommit_ratio to 50
> > mem.c:817: INFO: set overcommit_memory to 2
> > overcommit_memory.c:187: INFO: malloc 551061440 kB failed
> > overcommit_memory.c:208: PASS: alloc failed as expected
> > overcommit_memory.c:183: INFO: malloc 276632576 kB successfully
> > overcommit_memory.c:210: FAIL: alloc passed, expected to fail
>
> Thanks for the report!
>
> I took a rough look, and it all happens after changing the
> overcommit policy from a looser one to OVERCOMMIT_NEVER. I suspect
> it is due to the same cause as the previous warning message reported
> by Qian Cai https://lore.kernel.org/lkml/[email protected]/
Hmm, the LTP test [1] looks like a faithful implementation of
Documentation/vm/overcommit-accounting.rst which is now failing because
of this patchset.
Also, It was a mistake to merge c571686a92ff ("mm/util.c: remove the
VM_WARN_ONCE for vm_committed_as underflow check") separately (I am
taking a blame to ACK it separately and I forgot to run those tests to
double-check earlier) which is now making me wonder that VM_WARN_ONCE is
actually legitimate to catch the sign of brokenness in the first place.
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
>
> Will further check it.
>
> Thanks,
> Feng
>
> > overcommit_memory.c:183: INFO: malloc 137765294 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > mem.c:817: INFO: set overcommit_memory to 0
> > overcommit_memory.c:183: INFO: malloc 140770308 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > overcommit_memory.c:187: INFO: malloc 569659408 kB failed
> > overcommit_memory.c:208: PASS: alloc failed as expected
> > mem.c:817: INFO: set overcommit_memory to 1
> > overcommit_memory.c:183: INFO: malloc 142414852 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > overcommit_memory.c:183: INFO: malloc 284829704 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > overcommit_memory.c:183: INFO: malloc 569659408 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > mem.c:817: INFO: set overcommit_memory to 0
> > mem.c:817: INFO: set overcommit_ratio to 50
> >
> > Summary:
> > passed 7
> > failed 1
> > skipped 0
> > warnings 0
> > <<<execution_status>>>
> > initiation_status="ok"
> > duration=0 termination_type=exited termination_id=1 corefile=no
> > cutime=0 cstime=1
> > <<<test_end>>>
> > <<<test_start>>>
> > tag=overcommit_memory02 stime=1593425044
> > cmdline="overcommit_memory -R 0"
> > contacts=""
> > analysis=exit
> > <<<test_output>>>
> > tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s
> > overcommit_memory.c:116: INFO: MemTotal is 16394252 kB
> > overcommit_memory.c:118: INFO: SwapTotal is 268435452 kB
> > overcommit_memory.c:122: INFO: CommitLimit is 276632576 kB
> > mem.c:817: INFO: set overcommit_ratio to 0
> > mem.c:817: INFO: set overcommit_memory to 2
> > overcommit_memory.c:187: INFO: malloc 534667184 kB failed
> > overcommit_memory.c:208: PASS: alloc failed as expected
> > overcommit_memory.c:183: INFO: malloc 268435452 kB successfully
> > overcommit_memory.c:210: FAIL: alloc passed, expected to fail
> > overcommit_memory.c:183: INFO: malloc 133666730 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > mem.c:817: INFO: set overcommit_memory to 0
> > overcommit_memory.c:183: INFO: malloc 140770304 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > overcommit_memory.c:187: INFO: malloc 569659408 kB failed
> > overcommit_memory.c:208: PASS: alloc failed as expected
> > mem.c:817: INFO: set overcommit_memory to 1
> > overcommit_memory.c:183: INFO: malloc 142414852 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > overcommit_memory.c:183: INFO: malloc 284829704 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > overcommit_memory.c:183: INFO: malloc 569659408 kB successfully
> > overcommit_memory.c:202: PASS: alloc passed as expected
> > mem.c:817: INFO: set overcommit_memory to 0
> > mem.c:817: INFO: set overcommit_ratio to 50
> >
>
On Thu, Jul 02, 2020 at 03:12:30PM +0800, Feng Tang wrote:
> > <<<test_start>>>
> > tag=overcommit_memory01 stime=1593425044
> > cmdline="overcommit_memory"
> > contacts=""
> > analysis=exit
> > <<<test_output>>>
> > tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s
> > overcommit_memory.c:116: INFO: MemTotal is 16394252 kB
> > overcommit_memory.c:118: INFO: SwapTotal is 268435452 kB
> > overcommit_memory.c:122: INFO: CommitLimit is 276632576 kB
> > mem.c:817: INFO: set overcommit_ratio to 50
> > mem.c:817: INFO: set overcommit_memory to 2
> > overcommit_memory.c:187: INFO: malloc 551061440 kB failed
> > overcommit_memory.c:208: PASS: alloc failed as expected
> > overcommit_memory.c:183: INFO: malloc 276632576 kB successfully
> > overcommit_memory.c:210: FAIL: alloc passed, expected to fail
>
> Thanks for the report!
>
> I took a rough look, and it all happens after changing the
> overcommit policy from a looser one to OVERCOMMIT_NEVER. I suspect
> it is due to the same cause as the previous warning message reported
> by Qian Cai https://lore.kernel.org/lkml/[email protected]/
>
> Will further check it.
I did reproduce the problem, and from the debugging, this should
be the same root cause as https://lore.kernel.org/lkml/[email protected]/
that loosing the batch cause some accuracy problem, and the solution of
adding some sync is still needed, which is dicussed in
First thing I tried a simple patch of using percpucounter_sum_read, and
the problem can't be reproduced:
--- a/mm/util.c
+++ b/mm/util.c
@@ -845,7 +845,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
allowed -= min_t(long, mm->total_vm / 32, reserve);
}
- if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+ if (percpu_counter_sum(&vm_committed_as) < allowed)
return 0;
error:
vm_unacct_memory(pages);
Then, I tried the sync patch we've discussed one month ago
https://lore.kernel.org/lkml/[email protected]/
with it, I run the case 200 times and the problem was not reproduced,
can we consider taking this patch?
Thanks,
Feng
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 0a4f54d..01861ee 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -44,6 +44,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
s32 batch);
s64 __percpu_counter_sum(struct percpu_counter *fbc);
int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_sync(struct percpu_counter *fbc);
static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
{
@@ -172,6 +173,9 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
return true;
}
+static inline void percpu_counter_sync(struct percpu_counter *fbc)
+{
+}
#endif /* CONFIG_SMP */
static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index a66595b..d025137 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -98,6 +98,20 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
}
EXPORT_SYMBOL(percpu_counter_add_batch);
+void percpu_counter_sync(struct percpu_counter *fbc)
+{
+ unsigned long flags;
+ s64 count;
+
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ count = __this_cpu_read(*fbc->counters);
+ fbc->count += count;
+ __this_cpu_sub(*fbc->counters, count);
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_sync);
+
+
/*
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
diff --git a/mm/util.c b/mm/util.c
index 98813da..8b9664e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -746,14 +746,23 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
return ret;
}
+static void sync_overcommit_as(struct work_struct *dummy)
+{
+ percpu_counter_sync(&vm_committed_as);
+}
+
int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
int ret;
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
- if (ret == 0 && write)
+ if (ret == 0 && write) {
+ if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+ schedule_on_each_cpu(sync_overcommit_as);
+
mm_compute_batch();
+ }
pr_info("ocommit=%lld, real=%lld policy[%d] ratio=%d\n\n\n",
percpu_counter_read_positive(&vm_committed_as),
> On Jul 5, 2020, at 12:45 AM, Feng Tang <[email protected]> wrote:
>
> I did reproduce the problem, and from the debugging, this should
> be the same root cause as lore.kernel.org/lkml/[email protected]/
> that loosing the batch cause some accuracy problem, and the solution of
> adding some sync is still needed, which is dicussed in
Well, before taking any of those patches now to fix the regression, we will need some performance data first. If it turned out the original performance gain is no longer relevant anymore due to this regression fix on top, it is best to drop this patchset and restore that VM_WARN_ONCE, so you can retry later once you found a better way to optimize.
On Sun, Jul 05, 2020 at 08:15:03AM -0400, Qian Cai wrote:
>
>
> > On Jul 5, 2020, at 12:45 AM, Feng Tang <[email protected]> wrote:
> >
> > I did reproduce the problem, and from the debugging, this should
> > be the same root cause as lore.kernel.org/lkml/[email protected]/
> > that loosing the batch cause some accuracy problem, and the solution of
> > adding some sync is still needed, which is dicussed in
>
> Well, before taking any of those patches now to fix the regression, we will need some performance data first. If it turned out the original performance gain is no longer relevant anymore due to this regression fix on top, it is best to drop this patchset and restore that VM_WARN_ONCE, so you can retry later once you found a better way to optimize.
The fix of adding sync only happens when the memory policy is being
changed to OVERCOMMIT_NEVER, which is not a frequent operation in
normal cases.
For the performance improvment data both in commit log and 0day report
https://lore.kernel.org/lkml/20200622132548.GS5535@shao2-debian/
it is for the will-it-scale's mmap testcase, which will not runtime
change memory overcommit policy, so the data should be still valid
with this fix.
Thanks,
Feng
On Sun, Jul 05, 2020 at 08:58:54PM +0800, Feng Tang wrote:
> On Sun, Jul 05, 2020 at 08:15:03AM -0400, Qian Cai wrote:
> >
> >
> > > On Jul 5, 2020, at 12:45 AM, Feng Tang <[email protected]> wrote:
> > >
> > > I did reproduce the problem, and from the debugging, this should
> > > be the same root cause as lore.kernel.org/lkml/[email protected]/
> > > that loosing the batch cause some accuracy problem, and the solution of
> > > adding some sync is still needed, which is dicussed in
> >
> > Well, before taking any of those patches now to fix the regression,
> > we will need some performance data first. If it turned out the
> > original performance gain is no longer relevant anymore due to this
> > regression fix on top, it is best to drop this patchset and restore
> > that VM_WARN_ONCE, so you can retry later once you found a better
> > way to optimize.
>
> The fix of adding sync only happens when the memory policy is being
> changed to OVERCOMMIT_NEVER, which is not a frequent operation in
> normal cases.
>
> For the performance improvment data both in commit log and 0day report
> https://lore.kernel.org/lkml/20200622132548.GS5535@shao2-debian/
> it is for the will-it-scale's mmap testcase, which will not runtime
> change memory overcommit policy, so the data should be still valid
> with this fix.
Well, I would expect people are perfectly reasonable to use
OVERCOMMIT_NEVER for some workloads making it more frequent operations.
The question is now if any of those regression fixes would now regress
performance of OVERCOMMIT_NEVER workloads or just in-par with the data
before the patchset?
Given now this patchset has had so much churn recently, I would think
"should be still valid" is not really the answer we are looking for.
>
> Thanks,
> Feng
>
>
On Sun, Jul 05, 2020 at 11:52:32AM -0400, Qian Cai wrote:
> On Sun, Jul 05, 2020 at 08:58:54PM +0800, Feng Tang wrote:
> > On Sun, Jul 05, 2020 at 08:15:03AM -0400, Qian Cai wrote:
> > >
> > >
> > > > On Jul 5, 2020, at 12:45 AM, Feng Tang <[email protected]> wrote:
> > > >
> > > > I did reproduce the problem, and from the debugging, this should
> > > > be the same root cause as lore.kernel.org/lkml/[email protected]/
> > > > that loosing the batch cause some accuracy problem, and the solution of
> > > > adding some sync is still needed, which is dicussed in
> > >
> > > Well, before taking any of those patches now to fix the regression,
> > > we will need some performance data first. If it turned out the
> > > original performance gain is no longer relevant anymore due to this
> > > regression fix on top, it is best to drop this patchset and restore
> > > that VM_WARN_ONCE, so you can retry later once you found a better
> > > way to optimize.
> >
> > The fix of adding sync only happens when the memory policy is being
> > changed to OVERCOMMIT_NEVER, which is not a frequent operation in
> > normal cases.
> >
> > For the performance improvment data both in commit log and 0day report
> > https://lore.kernel.org/lkml/20200622132548.GS5535@shao2-debian/
> > it is for the will-it-scale's mmap testcase, which will not runtime
> > change memory overcommit policy, so the data should be still valid
> > with this fix.
>
> Well, I would expect people are perfectly reasonable to use
> OVERCOMMIT_NEVER for some workloads making it more frequent operations.
In my last email, I was not saying OVERCOMMIT_NEVER is not a normal case,
but I don't think user will too frequently runtime change the overcommit
policy. And the fix patch of syncing 'vm_committed_as' is only called when
user calls 'sysctl -w vm.overcommit_memory=2'.
> The question is now if any of those regression fixes would now regress
> performance of OVERCOMMIT_NEVER workloads or just in-par with the data
> before the patchset?
For the original patchset, it keeps vm_committed_as unchanged for
OVERCOMMIT_NEVER policy and enlarge it for the other 2 loose policies
OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS, and I don't expect the "OVERCOMMIT_NEVER
workloads" performance will be impacted. If you have suggetions for this
kind of benchmarks, I can test them to better verify the patchset, thanks!
- Feng
>
> Given now this patchset has had so much churn recently, I would think
> "should be still valid" is not really the answer we are looking for.
>
> >
> > Thanks,
> > Feng
> >
> >
On Mon, Jul 06, 2020 at 09:43:13AM +0800, Feng Tang wrote:
> On Sun, Jul 05, 2020 at 11:52:32AM -0400, Qian Cai wrote:
> > On Sun, Jul 05, 2020 at 08:58:54PM +0800, Feng Tang wrote:
> > > On Sun, Jul 05, 2020 at 08:15:03AM -0400, Qian Cai wrote:
> > > >
> > > >
> > > > > On Jul 5, 2020, at 12:45 AM, Feng Tang <[email protected]> wrote:
> > > > >
> > > > > I did reproduce the problem, and from the debugging, this should
> > > > > be the same root cause as lore.kernel.org/lkml/[email protected]/
> > > > > that loosing the batch cause some accuracy problem, and the solution of
> > > > > adding some sync is still needed, which is dicussed in
> > > >
> > > > Well, before taking any of those patches now to fix the regression,
> > > > we will need some performance data first. If it turned out the
> > > > original performance gain is no longer relevant anymore due to this
> > > > regression fix on top, it is best to drop this patchset and restore
> > > > that VM_WARN_ONCE, so you can retry later once you found a better
> > > > way to optimize.
> > >
> > > The fix of adding sync only happens when the memory policy is being
> > > changed to OVERCOMMIT_NEVER, which is not a frequent operation in
> > > normal cases.
> > >
> > > For the performance improvment data both in commit log and 0day report
> > > https://lore.kernel.org/lkml/20200622132548.GS5535@shao2-debian/
> > > it is for the will-it-scale's mmap testcase, which will not runtime
> > > change memory overcommit policy, so the data should be still valid
> > > with this fix.
> >
> > Well, I would expect people are perfectly reasonable to use
> > OVERCOMMIT_NEVER for some workloads making it more frequent operations.
>
> In my last email, I was not saying OVERCOMMIT_NEVER is not a normal case,
> but I don't think user will too frequently runtime change the overcommit
> policy. And the fix patch of syncing 'vm_committed_as' is only called when
> user calls 'sysctl -w vm.overcommit_memory=2'.
>
> > The question is now if any of those regression fixes would now regress
> > performance of OVERCOMMIT_NEVER workloads or just in-par with the data
> > before the patchset?
>
> For the original patchset, it keeps vm_committed_as unchanged for
> OVERCOMMIT_NEVER policy and enlarge it for the other 2 loose policies
> OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS, and I don't expect the "OVERCOMMIT_NEVER
> workloads" performance will be impacted. If you have suggetions for this
> kind of benchmarks, I can test them to better verify the patchset, thanks!
Then, please capture those information into a proper commit log when you
submit the regression fix on top of the patchset, and CC PER-CPU MEMORY
ALLOCATOR maintainers, so they might be able to review it properly.
>
> - Feng
>
> >
> > Given now this patchset has had so much churn recently, I would think
> > "should be still valid" is not really the answer we are looking for.
> >
> > >
> > > Thanks,
> > > Feng
> > >
> > >
Hi All,
Please help to review this fix patch, thanks!
It is against today's linux-mm tree. For easy review, I put the fix
into one patch, and I could split it to 2 parts for percpu-counter
and mm/util.c if it's preferred.
From 593f9dc139181a7c3bb1705aacd1f625f400e458 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Mon, 6 Jul 2020 14:48:29 +0800
Subject: [PATCH] mm/util.c: sync vm_committed_as when changing memory policy
to OVERCOMMIT_NEVER
With the patch to improve scalability of vm_committed_as [1], 0day reported
the ltp overcommit_memory test case could fail (fail rate is about 5/50) [2].
The root cause is when system is running with loose memory overcommit policy
like OVERCOMMIT_GUESS/ALWAYS, the deviation of vm_committed_as could be big,
and once the policy is runtime changed to OVERCOMMIT_NEVER, vm_committed_as's
batch is decreased to 1/64 of original one, but the deviation is not
compensated accordingly, and following __vm_enough_memory() check for vm
overcommit could be wrong due to this deviation, which breaks the ltp
overcommit_memory case.
Fix it by forcing a sync for percpu counter vm_committed_as when overcommit
policy is changed to OVERCOMMIT_NEVER (sysctl -w vm.overcommit_memory=2).
The sync itself is not a fast operation, and is toleratable given user is
not expected to frequently changing policy to OVERCOMMIT_NEVER.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://marc.info/?l=linux-mm&m=159367156428286 (can't find a link in lore.kernel.org)
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/percpu_counter.h | 4 ++++
lib/percpu_counter.c | 14 ++++++++++++++
mm/util.c | 11 ++++++++++-
3 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 0a4f54d..01861ee 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -44,6 +44,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
s32 batch);
s64 __percpu_counter_sum(struct percpu_counter *fbc);
int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_sync(struct percpu_counter *fbc);
static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
{
@@ -172,6 +173,9 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
return true;
}
+static inline void percpu_counter_sync(struct percpu_counter *fbc)
+{
+}
#endif /* CONFIG_SMP */
static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index a66595b..02d87fc 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -98,6 +98,20 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
}
EXPORT_SYMBOL(percpu_counter_add_batch);
+void percpu_counter_sync(struct percpu_counter *fbc)
+{
+ unsigned long flags;
+ s64 count;
+
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ count = __this_cpu_read(*fbc->counters);
+ fbc->count += count;
+ __this_cpu_sub(*fbc->counters, count);
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_sync);
+
+
/*
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
diff --git a/mm/util.c b/mm/util.c
index 52ed9c1..5fb62c0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -746,14 +746,23 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
return ret;
}
+static void sync_overcommit_as(struct work_struct *dummy)
+{
+ percpu_counter_sync(&vm_committed_as);
+}
+
int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
int ret;
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
- if (ret == 0 && write)
+ if (ret == 0 && write) {
+ if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+ schedule_on_each_cpu(sync_overcommit_as);
+
mm_compute_batch();
+ }
return ret;
}
--
2.7.4
On Sun, Jul 05, 2020 at 10:36:14PM -0400, Qian Cai wrote:
> > In my last email, I was not saying OVERCOMMIT_NEVER is not a normal case,
> > but I don't think user will too frequently runtime change the overcommit
> > policy. And the fix patch of syncing 'vm_committed_as' is only called when
> > user calls 'sysctl -w vm.overcommit_memory=2'.
> >
> > > The question is now if any of those regression fixes would now regress
> > > performance of OVERCOMMIT_NEVER workloads or just in-par with the data
> > > before the patchset?
> >
> > For the original patchset, it keeps vm_committed_as unchanged for
> > OVERCOMMIT_NEVER policy and enlarge it for the other 2 loose policies
> > OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS, and I don't expect the "OVERCOMMIT_NEVER
> > workloads" performance will be impacted. If you have suggetions for this
> > kind of benchmarks, I can test them to better verify the patchset, thanks!
>
> Then, please capture those information into a proper commit log when you
> submit the regression fix on top of the patchset, and CC PER-CPU MEMORY
> ALLOCATOR maintainers, so they might be able to review it properly.
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> - if (ret == 0 && write)
> + if (ret == 0 && write) {
> + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> + schedule_on_each_cpu(sync_overcommit_as);
The schedule_on_each_cpu is not atomic, so the problem could still happen
in that window.
I think it may be ok if it eventually resolves, but certainly needs
a comment explaining it. Can you do some stress testing toggling the
policy all the time on different CPUs and running the test on
other CPUs and see if the test fails?
The other alternative would be to define some intermediate state
for the sysctl variable and only switch to never once the schedule_on_each_cpu
returned. But that's more complexity.
-Andi
On Mon, 6 Jul 2020 06:34:34 -0700 Andi Kleen <[email protected]> wrote:
> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > - if (ret == 0 && write)
> > + if (ret == 0 && write) {
> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > + schedule_on_each_cpu(sync_overcommit_as);
>
> The schedule_on_each_cpu is not atomic, so the problem could still happen
> in that window.
>
> I think it may be ok if it eventually resolves, but certainly needs
> a comment explaining it.
It sure does.
The new exported-to-everything percpu_counter_sync() should have full
formal documentation as well, please.
> Can you do some stress testing toggling the
> policy all the time on different CPUs and running the test on
> other CPUs and see if the test fails?
>
> The other alternative would be to define some intermediate state
> for the sysctl variable and only switch to never once the schedule_on_each_cpu
> returned. But that's more complexity.
>
>
> -Andi
On Mon, Jul 06, 2020 at 09:24:43PM +0800, Feng Tang wrote:
> Hi All,
>
> Please help to review this fix patch, thanks!
>
> It is against today's linux-mm tree. For easy review, I put the fix
> into one patch, and I could split it to 2 parts for percpu-counter
> and mm/util.c if it's preferred.
>
> From 593f9dc139181a7c3bb1705aacd1f625f400e458 Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Mon, 6 Jul 2020 14:48:29 +0800
> Subject: [PATCH] mm/util.c: sync vm_committed_as when changing memory policy
> to OVERCOMMIT_NEVER
>
> With the patch to improve scalability of vm_committed_as [1], 0day reported
> the ltp overcommit_memory test case could fail (fail rate is about 5/50) [2].
> The root cause is when system is running with loose memory overcommit policy
> like OVERCOMMIT_GUESS/ALWAYS, the deviation of vm_committed_as could be big,
> and once the policy is runtime changed to OVERCOMMIT_NEVER, vm_committed_as's
> batch is decreased to 1/64 of original one, but the deviation is not
> compensated accordingly, and following __vm_enough_memory() check for vm
> overcommit could be wrong due to this deviation, which breaks the ltp
> overcommit_memory case.
>
> Fix it by forcing a sync for percpu counter vm_committed_as when overcommit
> policy is changed to OVERCOMMIT_NEVER (sysctl -w vm.overcommit_memory=2).
> The sync itself is not a fast operation, and is toleratable given user is
> not expected to frequently changing policy to OVERCOMMIT_NEVER.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://marc.info/?l=linux-mm&m=159367156428286 (can't find a link in lore.kernel.org)
>
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> include/linux/percpu_counter.h | 4 ++++
> lib/percpu_counter.c | 14 ++++++++++++++
> mm/util.c | 11 ++++++++++-
> 3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 0a4f54d..01861ee 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -44,6 +44,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
> s32 batch);
> s64 __percpu_counter_sum(struct percpu_counter *fbc);
> int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
> +void percpu_counter_sync(struct percpu_counter *fbc);
>
> static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
> {
> @@ -172,6 +173,9 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
> return true;
> }
>
> +static inline void percpu_counter_sync(struct percpu_counter *fbc)
> +{
> +}
> #endif /* CONFIG_SMP */
>
> static inline void percpu_counter_inc(struct percpu_counter *fbc)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index a66595b..02d87fc 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -98,6 +98,20 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
> }
> EXPORT_SYMBOL(percpu_counter_add_batch);
>
> +void percpu_counter_sync(struct percpu_counter *fbc)
> +{
> + unsigned long flags;
> + s64 count;
> +
> + raw_spin_lock_irqsave(&fbc->lock, flags);
> + count = __this_cpu_read(*fbc->counters);
> + fbc->count += count;
> + __this_cpu_sub(*fbc->counters, count);
> + raw_spin_unlock_irqrestore(&fbc->lock, flags);
> +}
> +EXPORT_SYMBOL(percpu_counter_sync);
> +
> +
> /*
> * Add up all the per-cpu counts, return the result. This is a more accurate
> * but much slower version of percpu_counter_read_positive()
> diff --git a/mm/util.c b/mm/util.c
> index 52ed9c1..5fb62c0 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -746,14 +746,23 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
> return ret;
> }
>
> +static void sync_overcommit_as(struct work_struct *dummy)
> +{
> + percpu_counter_sync(&vm_committed_as);
> +}
> +
This seems like a rather niche use case as it's currently coupled with a
schedule_on_each_cpu(). I can't imagine a use case where you'd want to
do this without being called by schedule_on_each_cpu().
Would it be better to modify or introduce something akin to
percpu_counter_sum() which sums and folds in the counter state? I'd be
curious to see what the cost of always folding would be as this is
already considered the cold path and would help with the next batch too.
> int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> size_t *lenp, loff_t *ppos)
> {
> int ret;
>
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> - if (ret == 0 && write)
> + if (ret == 0 && write) {
> + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> + schedule_on_each_cpu(sync_overcommit_as);
> +
> mm_compute_batch();
> + }
>
> return ret;
> }
> --
> 2.7.4
>
>
> On Sun, Jul 05, 2020 at 10:36:14PM -0400, Qian Cai wrote:
> > > In my last email, I was not saying OVERCOMMIT_NEVER is not a normal case,
> > > but I don't think user will too frequently runtime change the overcommit
> > > policy. And the fix patch of syncing 'vm_committed_as' is only called when
> > > user calls 'sysctl -w vm.overcommit_memory=2'.
> > >
> > > > The question is now if any of those regression fixes would now regress
> > > > performance of OVERCOMMIT_NEVER workloads or just in-par with the data
> > > > before the patchset?
> > >
> > > For the original patchset, it keeps vm_committed_as unchanged for
> > > OVERCOMMIT_NEVER policy and enlarge it for the other 2 loose policies
> > > OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS, and I don't expect the "OVERCOMMIT_NEVER
> > > workloads" performance will be impacted. If you have suggetions for this
> > > kind of benchmarks, I can test them to better verify the patchset, thanks!
> >
> > Then, please capture those information into a proper commit log when you
> > submit the regression fix on top of the patchset, and CC PER-CPU MEMORY
> > ALLOCATOR maintainers, so they might be able to review it properly.
>
>
>
Thanks,
Dennis
On Mon, Jul 06, 2020 at 06:34:34AM -0700, Andi Kleen wrote:
> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > - if (ret == 0 && write)
> > + if (ret == 0 && write) {
> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > + schedule_on_each_cpu(sync_overcommit_as);
>
> The schedule_on_each_cpu is not atomic, so the problem could still happen
> in that window.
>
> I think it may be ok if it eventually resolves, but certainly needs
> a comment explaining it. Can you do some stress testing toggling the
> policy all the time on different CPUs and running the test on
> other CPUs and see if the test fails?
For the raw test case reported by 0day, this patch passed in 200 times
run. And I will read the ltp code and try stress testing it as you
suggested.
> The other alternative would be to define some intermediate state
> for the sysctl variable and only switch to never once the schedule_on_each_cpu
> returned. But that's more complexity.
One thought I had is to put this schedule_on_each_cpu() before
the proc_dointvec_minmax() to do the sync before sysctl_overcommit_memory
is really changed. But the window still exists, as the batch is
still the larger one.
Thanks,
Feng
>
>
> -Andi
On Tue, Jul 07, 2020 at 01:06:51AM +0000, Dennis Zhou wrote:
> On Mon, Jul 06, 2020 at 09:24:43PM +0800, Feng Tang wrote:
> > Hi All,
> >
> > Please help to review this fix patch, thanks!
> >
> > It is against today's linux-mm tree. For easy review, I put the fix
> > into one patch, and I could split it to 2 parts for percpu-counter
> > and mm/util.c if it's preferred.
> >
> > From 593f9dc139181a7c3bb1705aacd1f625f400e458 Mon Sep 17 00:00:00 2001
> > From: Feng Tang <[email protected]>
> > Date: Mon, 6 Jul 2020 14:48:29 +0800
> > Subject: [PATCH] mm/util.c: sync vm_committed_as when changing memory policy
> > to OVERCOMMIT_NEVER
> >
> > With the patch to improve scalability of vm_committed_as [1], 0day reported
> > the ltp overcommit_memory test case could fail (fail rate is about 5/50) [2].
> > The root cause is when system is running with loose memory overcommit policy
> > like OVERCOMMIT_GUESS/ALWAYS, the deviation of vm_committed_as could be big,
> > and once the policy is runtime changed to OVERCOMMIT_NEVER, vm_committed_as's
> > batch is decreased to 1/64 of original one, but the deviation is not
> > compensated accordingly, and following __vm_enough_memory() check for vm
> > overcommit could be wrong due to this deviation, which breaks the ltp
> > overcommit_memory case.
> >
> > Fix it by forcing a sync for percpu counter vm_committed_as when overcommit
> > policy is changed to OVERCOMMIT_NEVER (sysctl -w vm.overcommit_memory=2).
> > The sync itself is not a fast operation, and is toleratable given user is
> > not expected to frequently changing policy to OVERCOMMIT_NEVER.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://marc.info/?l=linux-mm&m=159367156428286 (can't find a link in lore.kernel.org)
> >
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > include/linux/percpu_counter.h | 4 ++++
> > lib/percpu_counter.c | 14 ++++++++++++++
> > mm/util.c | 11 ++++++++++-
> > 3 files changed, 28 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> > index 0a4f54d..01861ee 100644
> > --- a/include/linux/percpu_counter.h
> > +++ b/include/linux/percpu_counter.h
> > @@ -44,6 +44,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
> > s32 batch);
> > s64 __percpu_counter_sum(struct percpu_counter *fbc);
> > int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
> > +void percpu_counter_sync(struct percpu_counter *fbc);
> >
> > static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
> > {
> > @@ -172,6 +173,9 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
> > return true;
> > }
> >
> > +static inline void percpu_counter_sync(struct percpu_counter *fbc)
> > +{
> > +}
> > #endif /* CONFIG_SMP */
> >
> > static inline void percpu_counter_inc(struct percpu_counter *fbc)
> > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> > index a66595b..02d87fc 100644
> > --- a/lib/percpu_counter.c
> > +++ b/lib/percpu_counter.c
> > @@ -98,6 +98,20 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
> > }
> > EXPORT_SYMBOL(percpu_counter_add_batch);
> >
> > +void percpu_counter_sync(struct percpu_counter *fbc)
> > +{
> > + unsigned long flags;
> > + s64 count;
> > +
> > + raw_spin_lock_irqsave(&fbc->lock, flags);
> > + count = __this_cpu_read(*fbc->counters);
> > + fbc->count += count;
> > + __this_cpu_sub(*fbc->counters, count);
> > + raw_spin_unlock_irqrestore(&fbc->lock, flags);
> > +}
> > +EXPORT_SYMBOL(percpu_counter_sync);
> > +
> > +
> > /*
> > * Add up all the per-cpu counts, return the result. This is a more accurate
> > * but much slower version of percpu_counter_read_positive()
> > diff --git a/mm/util.c b/mm/util.c
> > index 52ed9c1..5fb62c0 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -746,14 +746,23 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
> > return ret;
> > }
> >
> > +static void sync_overcommit_as(struct work_struct *dummy)
> > +{
> > + percpu_counter_sync(&vm_committed_as);
> > +}
> > +
>
> This seems like a rather niche use case as it's currently coupled with a
> schedule_on_each_cpu(). I can't imagine a use case where you'd want to
> do this without being called by schedule_on_each_cpu().
Yes!
>
> Would it be better to modify or introduce something akin to
> percpu_counter_sum() which sums and folds in the counter state? I'd be
> curious to see what the cost of always folding would be as this is
> already considered the cold path and would help with the next batch too.
Initially, I also thought about doing the sync just like percpu_counter_sum():
raw_spin_lock_irqsave
for_each_online_cpu(cpu) }
do-the-sync
raw_spin_unlock_irqrestore
One problem is the per_cpu_ptr(fbc->counters, cpu) could still be
updated on other CPUs as the fast path update is not protected by
fbc->lock.
As for cost, it is about about 800 nanoseconds on a 2C/4T platform
and 2~3 microseconds on a 2S/36C/72T Skylake server in normal case,
and in worst case where vm_committed_as's spinlock is under severe
contention, it costs 30~40 microseconds for the 2S/36C/72T Skylake
sever.
Thanks,
Feng
> > int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> > size_t *lenp, loff_t *ppos)
> > {
> > int ret;
> >
> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > - if (ret == 0 && write)
> > + if (ret == 0 && write) {
> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > + schedule_on_each_cpu(sync_overcommit_as);
> > +
> > mm_compute_batch();
> > + }
> >
> > return ret;
> > }
> > --
> > 2.7.4
Feng Tang <[email protected]> writes:
> On Mon, Jul 06, 2020 at 06:34:34AM -0700, Andi Kleen wrote:
>> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> > - if (ret == 0 && write)
>> > + if (ret == 0 && write) {
>> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
>> > + schedule_on_each_cpu(sync_overcommit_as);
>>
>> The schedule_on_each_cpu is not atomic, so the problem could still happen
>> in that window.
>>
>> I think it may be ok if it eventually resolves, but certainly needs
>> a comment explaining it. Can you do some stress testing toggling the
>> policy all the time on different CPUs and running the test on
>> other CPUs and see if the test fails?
>
> For the raw test case reported by 0day, this patch passed in 200 times
> run. And I will read the ltp code and try stress testing it as you
> suggested.
>
>
>> The other alternative would be to define some intermediate state
>> for the sysctl variable and only switch to never once the schedule_on_each_cpu
>> returned. But that's more complexity.
>
> One thought I had is to put this schedule_on_each_cpu() before
> the proc_dointvec_minmax() to do the sync before sysctl_overcommit_memory
> is really changed. But the window still exists, as the batch is
> still the larger one.
Can we change the batch firstly, then sync the global counter, finally
change the overcommit policy?
Best Regards,
Huang, Ying
On Tue, Jul 07, 2020 at 12:00:09PM +0800, Huang, Ying wrote:
> Feng Tang <[email protected]> writes:
>
> > On Mon, Jul 06, 2020 at 06:34:34AM -0700, Andi Kleen wrote:
> >> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> >> > - if (ret == 0 && write)
> >> > + if (ret == 0 && write) {
> >> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> >> > + schedule_on_each_cpu(sync_overcommit_as);
> >>
> >> The schedule_on_each_cpu is not atomic, so the problem could still happen
> >> in that window.
> >>
> >> I think it may be ok if it eventually resolves, but certainly needs
> >> a comment explaining it. Can you do some stress testing toggling the
> >> policy all the time on different CPUs and running the test on
> >> other CPUs and see if the test fails?
> >
> > For the raw test case reported by 0day, this patch passed in 200 times
> > run. And I will read the ltp code and try stress testing it as you
> > suggested.
> >
> >
> >> The other alternative would be to define some intermediate state
> >> for the sysctl variable and only switch to never once the schedule_on_each_cpu
> >> returned. But that's more complexity.
> >
> > One thought I had is to put this schedule_on_each_cpu() before
> > the proc_dointvec_minmax() to do the sync before sysctl_overcommit_memory
> > is really changed. But the window still exists, as the batch is
> > still the larger one.
>
> Can we change the batch firstly, then sync the global counter, finally
> change the overcommit policy?
These reorderings are really head scratching :)
I've thought about this before when Qian Cai first reported the warning
message, as kernel had a check:
VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-(s64)vm_committed_as_batch * num_online_cpus(),
"memory commitment underflow");
If the batch is decreased first, the warning will be easier/earlier to be
triggered, so I didn't brought this up when handling the warning message.
But it might work now, as the warning has been removed.
Thanks,
Feng
On Sun 05-07-20 11:52:32, Qian Cai wrote:
> On Sun, Jul 05, 2020 at 08:58:54PM +0800, Feng Tang wrote:
> > On Sun, Jul 05, 2020 at 08:15:03AM -0400, Qian Cai wrote:
> > >
> > >
> > > > On Jul 5, 2020, at 12:45 AM, Feng Tang <[email protected]> wrote:
> > > >
> > > > I did reproduce the problem, and from the debugging, this should
> > > > be the same root cause as lore.kernel.org/lkml/[email protected]/
> > > > that loosing the batch cause some accuracy problem, and the solution of
> > > > adding some sync is still needed, which is dicussed in
> > >
> > > Well, before taking any of those patches now to fix the regression,
> > > we will need some performance data first. If it turned out the
> > > original performance gain is no longer relevant anymore due to this
> > > regression fix on top, it is best to drop this patchset and restore
> > > that VM_WARN_ONCE, so you can retry later once you found a better
> > > way to optimize.
> >
> > The fix of adding sync only happens when the memory policy is being
> > changed to OVERCOMMIT_NEVER, which is not a frequent operation in
> > normal cases.
> >
> > For the performance improvment data both in commit log and 0day report
> > https://lore.kernel.org/lkml/20200622132548.GS5535@shao2-debian/
> > it is for the will-it-scale's mmap testcase, which will not runtime
> > change memory overcommit policy, so the data should be still valid
> > with this fix.
>
> Well, I would expect people are perfectly reasonable to use
> OVERCOMMIT_NEVER for some workloads making it more frequent operations.
Would you have any examples? Because I find this highly unlikely.
OVERCOMMIT_NEVER only works when virtual memory is not largerly
overcommited wrt to real memory demand. And that tends to be more of
an exception rather than a rule. "Modern" userspace (whatever that
means) tends to be really hungry with virtual memory which is only used
very sparsely.
I would argue that either somebody is running an "OVERCOMMIT_NEVER"
friendly SW and this is a permanent setting or this is not used at all.
At least this is my experience.
So I strongly suspect that LTP test failure is not something we should
really lose sleep over. It would be nice to find a way to flush existing
batches but I would rather see a real workload that would suffer from
this imprecision.
On the other hand perf. boost with larger batches with defualt overcommit
setting sounds like a nice improvement to have.
--
Michal Hocko
SUSE Labs
On Tue, Jul 07, 2020 at 01:41:20PM +0800, Feng Tang wrote:
> On Tue, Jul 07, 2020 at 12:00:09PM +0800, Huang, Ying wrote:
> > Feng Tang <[email protected]> writes:
> >
> > > On Mon, Jul 06, 2020 at 06:34:34AM -0700, Andi Kleen wrote:
> > >> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > >> > - if (ret == 0 && write)
> > >> > + if (ret == 0 && write) {
> > >> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > >> > + schedule_on_each_cpu(sync_overcommit_as);
> > >>
> > >> The schedule_on_each_cpu is not atomic, so the problem could still happen
> > >> in that window.
> > >>
> > >> I think it may be ok if it eventually resolves, but certainly needs
> > >> a comment explaining it. Can you do some stress testing toggling the
> > >> policy all the time on different CPUs and running the test on
> > >> other CPUs and see if the test fails?
> > >
> > > For the raw test case reported by 0day, this patch passed in 200 times
> > > run. And I will read the ltp code and try stress testing it as you
> > > suggested.
> > >
> > >
> > >> The other alternative would be to define some intermediate state
> > >> for the sysctl variable and only switch to never once the schedule_on_each_cpu
> > >> returned. But that's more complexity.
> > >
> > > One thought I had is to put this schedule_on_each_cpu() before
> > > the proc_dointvec_minmax() to do the sync before sysctl_overcommit_memory
> > > is really changed. But the window still exists, as the batch is
> > > still the larger one.
> >
> > Can we change the batch firstly, then sync the global counter, finally
> > change the overcommit policy?
>
> These reorderings are really head scratching :)
>
> I've thought about this before when Qian Cai first reported the warning
> message, as kernel had a check:
>
> VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> -(s64)vm_committed_as_batch * num_online_cpus(),
> "memory commitment underflow");
>
> If the batch is decreased first, the warning will be easier/earlier to be
> triggered, so I didn't brought this up when handling the warning message.
>
> But it might work now, as the warning has been removed.
I tested the reorder way, and the test could pass in 100 times run. The
new order when changing policy to OVERCOMMIT_NEVER:
1. re-compute the batch ( to the smaller one)
2. do the on_each_cpu sync
3. really change the policy to NEVER.
It solves one of previous concern, that after the sync is done on cpuX,
but before the whole sync on all cpus are done, there is a window that
the percpu-counter could be enlarged again.
IIRC Andi had concern about read side cost when doing the sync, my
understanding is most of the readers (malloc/free/map/unmap) are using
percpu_counter_read_positive, which is a fast path without involving lock.
As for the problem itself, I agree with Michal's point, that usually there
is no normal case that will change the overcommit_policy too frequently.
The code logic is mainly in overcommit_policy_handler(), based on the
previous sync fix. please help to review, thanks!
int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
int ret;
if (write) {
int new_policy;
struct ctl_table t;
t = *table;
t.data = &new_policy;
ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
if (ret)
return ret;
mm_compute_batch(new_policy);
if (new_policy == OVERCOMMIT_NEVER)
schedule_on_each_cpu(sync_overcommit_as);
sysctl_overcommit_memory = new_policy;
} else {
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
}
return ret;
}
- Feng
On Thu, Jul 09, 2020 at 12:55:54PM +0800, Feng Tang wrote:
> On Tue, Jul 07, 2020 at 01:41:20PM +0800, Feng Tang wrote:
> > On Tue, Jul 07, 2020 at 12:00:09PM +0800, Huang, Ying wrote:
> > > Feng Tang <[email protected]> writes:
> > >
> > > > On Mon, Jul 06, 2020 at 06:34:34AM -0700, Andi Kleen wrote:
> > > >> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > > >> > - if (ret == 0 && write)
> > > >> > + if (ret == 0 && write) {
> > > >> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > > >> > + schedule_on_each_cpu(sync_overcommit_as);
> > > >>
> > > >> The schedule_on_each_cpu is not atomic, so the problem could still happen
> > > >> in that window.
> > > >>
> > > >> I think it may be ok if it eventually resolves, but certainly needs
> > > >> a comment explaining it. Can you do some stress testing toggling the
> > > >> policy all the time on different CPUs and running the test on
> > > >> other CPUs and see if the test fails?
> > > >
> > > > For the raw test case reported by 0day, this patch passed in 200 times
> > > > run. And I will read the ltp code and try stress testing it as you
> > > > suggested.
> > > >
> > > >
> > > >> The other alternative would be to define some intermediate state
> > > >> for the sysctl variable and only switch to never once the schedule_on_each_cpu
> > > >> returned. But that's more complexity.
> > > >
> > > > One thought I had is to put this schedule_on_each_cpu() before
> > > > the proc_dointvec_minmax() to do the sync before sysctl_overcommit_memory
> > > > is really changed. But the window still exists, as the batch is
> > > > still the larger one.
> > >
> > > Can we change the batch firstly, then sync the global counter, finally
> > > change the overcommit policy?
> >
> > These reorderings are really head scratching :)
> >
> > I've thought about this before when Qian Cai first reported the warning
> > message, as kernel had a check:
> >
> > VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> > -(s64)vm_committed_as_batch * num_online_cpus(),
> > "memory commitment underflow");
> >
> > If the batch is decreased first, the warning will be easier/earlier to be
> > triggered, so I didn't brought this up when handling the warning message.
> >
> > But it might work now, as the warning has been removed.
>
> I tested the reorder way, and the test could pass in 100 times run. The
> new order when changing policy to OVERCOMMIT_NEVER:
> 1. re-compute the batch ( to the smaller one)
> 2. do the on_each_cpu sync
> 3. really change the policy to NEVER.
>
> It solves one of previous concern, that after the sync is done on cpuX,
> but before the whole sync on all cpus are done, there is a window that
> the percpu-counter could be enlarged again.
>
> IIRC Andi had concern about read side cost when doing the sync, my
> understanding is most of the readers (malloc/free/map/unmap) are using
> percpu_counter_read_positive, which is a fast path without involving lock.
>
> As for the problem itself, I agree with Michal's point, that usually there
> is no normal case that will change the overcommit_policy too frequently.
>
> The code logic is mainly in overcommit_policy_handler(), based on the
> previous sync fix. please help to review, thanks!
>
> int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> size_t *lenp, loff_t *ppos)
> {
> int ret;
>
> if (write) {
> int new_policy;
> struct ctl_table t;
>
> t = *table;
> t.data = &new_policy;
> ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> if (ret)
> return ret;
>
> mm_compute_batch(new_policy);
> if (new_policy == OVERCOMMIT_NEVER)
> schedule_on_each_cpu(sync_overcommit_as);
> sysctl_overcommit_memory = new_policy;
> } else {
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> }
>
> return ret;
> }
Rather than having to indent those many lines, how about this?
t = *table;
t.data = &new_policy;
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (ret || !write)
return ret;
mm_compute_batch(new_policy);
if (new_policy == OVERCOMMIT_NEVER)
schedule_on_each_cpu(sync_overcommit_as);
sysctl_overcommit_memory = new_policy;
return ret;
Hi Qian Cai,
On Thu, Jul 09, 2020 at 09:40:40AM -0400, Qian Cai wrote:
> > > > Can we change the batch firstly, then sync the global counter, finally
> > > > change the overcommit policy?
> > >
> > > These reorderings are really head scratching :)
> > >
> > > I've thought about this before when Qian Cai first reported the warning
> > > message, as kernel had a check:
> > >
> > > VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> > > -(s64)vm_committed_as_batch * num_online_cpus(),
> > > "memory commitment underflow");
> > >
> > > If the batch is decreased first, the warning will be easier/earlier to be
> > > triggered, so I didn't brought this up when handling the warning message.
> > >
> > > But it might work now, as the warning has been removed.
> >
> > I tested the reorder way, and the test could pass in 100 times run. The
> > new order when changing policy to OVERCOMMIT_NEVER:
> > 1. re-compute the batch ( to the smaller one)
> > 2. do the on_each_cpu sync
> > 3. really change the policy to NEVER.
> >
> > It solves one of previous concern, that after the sync is done on cpuX,
> > but before the whole sync on all cpus are done, there is a window that
> > the percpu-counter could be enlarged again.
> >
> > IIRC Andi had concern about read side cost when doing the sync, my
> > understanding is most of the readers (malloc/free/map/unmap) are using
> > percpu_counter_read_positive, which is a fast path without involving lock.
> >
> > As for the problem itself, I agree with Michal's point, that usually there
> > is no normal case that will change the overcommit_policy too frequently.
> >
> > The code logic is mainly in overcommit_policy_handler(), based on the
> > previous sync fix. please help to review, thanks!
> >
> > int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> > size_t *lenp, loff_t *ppos)
> > {
> > int ret;
> >
> > if (write) {
> > int new_policy;
> > struct ctl_table t;
> >
> > t = *table;
> > t.data = &new_policy;
> > ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> > if (ret)
> > return ret;
> >
> > mm_compute_batch(new_policy);
> > if (new_policy == OVERCOMMIT_NEVER)
> > schedule_on_each_cpu(sync_overcommit_as);
> > sysctl_overcommit_memory = new_policy;
> > } else {
> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > }
> >
> > return ret;
> > }
>
> Rather than having to indent those many lines, how about this?
Thanks for the cleanup suggestion.
> t = *table;
> t.data = &new_policy;
The input table->data is actually &sysctl_overcommit_memory, so
there is a problem for "read" case, it will return the 'new_policy'
value instead of real sysctl_overcommit_memory.
It should work after adding a check
if (write)
t.data = &new_policy;
> ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
--> &t
Thanks,
Feng
> if (ret || !write)
> return ret;
> mm_compute_batch(new_policy);
> if (new_policy == OVERCOMMIT_NEVER)
> schedule_on_each_cpu(sync_overcommit_as);
>
> sysctl_overcommit_memory = new_policy;
> return ret;
On Thu, Jul 09, 2020 at 10:15:19PM +0800, Feng Tang wrote:
> Hi Qian Cai,
>
> On Thu, Jul 09, 2020 at 09:40:40AM -0400, Qian Cai wrote:
> > > > > Can we change the batch firstly, then sync the global counter, finally
> > > > > change the overcommit policy?
> > > >
> > > > These reorderings are really head scratching :)
> > > >
> > > > I've thought about this before when Qian Cai first reported the warning
> > > > message, as kernel had a check:
> > > >
> > > > VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> > > > -(s64)vm_committed_as_batch * num_online_cpus(),
> > > > "memory commitment underflow");
> > > >
> > > > If the batch is decreased first, the warning will be easier/earlier to be
> > > > triggered, so I didn't brought this up when handling the warning message.
> > > >
> > > > But it might work now, as the warning has been removed.
> > >
> > > I tested the reorder way, and the test could pass in 100 times run. The
> > > new order when changing policy to OVERCOMMIT_NEVER:
> > > 1. re-compute the batch ( to the smaller one)
> > > 2. do the on_each_cpu sync
> > > 3. really change the policy to NEVER.
> > >
> > > It solves one of previous concern, that after the sync is done on cpuX,
> > > but before the whole sync on all cpus are done, there is a window that
> > > the percpu-counter could be enlarged again.
> > >
> > > IIRC Andi had concern about read side cost when doing the sync, my
> > > understanding is most of the readers (malloc/free/map/unmap) are using
> > > percpu_counter_read_positive, which is a fast path without involving lock.
> > >
> > > As for the problem itself, I agree with Michal's point, that usually there
> > > is no normal case that will change the overcommit_policy too frequently.
> > >
> > > The code logic is mainly in overcommit_policy_handler(), based on the
> > > previous sync fix. please help to review, thanks!
> > >
> > > int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> > > size_t *lenp, loff_t *ppos)
> > > {
> > > int ret;
> > >
> > > if (write) {
> > > int new_policy;
> > > struct ctl_table t;
> > >
> > > t = *table;
> > > t.data = &new_policy;
> > > ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> > > if (ret)
> > > return ret;
> > >
> > > mm_compute_batch(new_policy);
> > > if (new_policy == OVERCOMMIT_NEVER)
> > > schedule_on_each_cpu(sync_overcommit_as);
> > > sysctl_overcommit_memory = new_policy;
> > > } else {
> > > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > > }
> > >
> > > return ret;
> > > }
> >
> > Rather than having to indent those many lines, how about this?
>
> Thanks for the cleanup suggestion.
>
> > t = *table;
> > t.data = &new_policy;
>
> The input table->data is actually &sysctl_overcommit_memory, so
> there is a problem for "read" case, it will return the 'new_policy'
> value instead of real sysctl_overcommit_memory.
>
> It should work after adding a check
> if (write)
> t.data = &new_policy;
>
> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> --> &t
Give it a second thought, my previous way has more indents and lines,
but it is easier to be understood that we have special handling for
'write' case. So I would prefer using it.
Thoughts?
Thanks,
Feng
> Thanks,
> Feng
>
> > if (ret || !write)
> > return ret;
> > mm_compute_batch(new_policy);
> > if (new_policy == OVERCOMMIT_NEVER)
> > schedule_on_each_cpu(sync_overcommit_as);
> >
> > sysctl_overcommit_memory = new_policy;
> > return ret;
> On Jul 9, 2020, at 9:38 PM, Feng Tang <[email protected]> wrote:
>
> Give it a second thought, my previous way has more indents and lines,
> but it is easier to be understood that we have special handling for
> 'write' case. So I would prefer using it.
>
> Thoughts?
I don’t feel it is easier to understand. I generally prefer to bail out early if possible to also make code a bit more solid for future extensions (once the indentation reached 3+ levels, we will need to rework it).
But, I realize that I have spent too much time debugging than actually writing code those days, so my taste is probably not all that good. Thus, feel free to submit what style you prefer, so other people have more experience coding could review them more.