2019-12-11 14:34:26

by Zhang Yi

[permalink] [raw]
Subject: Re: Patch "mm, vmstat: make quiet_vmstat lighter" has been added to the 4.4-stable tree

Hi, all

We find a performance degradation under lmbench af_unix[1] test case after
mergeing this patch on my x86 qemu 4.4 machine. The test result is basically
stable for each teses.

Host machine: CPU: Intel(R) Xeon(R) CPU E5-2690 v3
CPU(s): 48
MEM: 193047 MB

Guest machine: CPU: QEMU Virtual CPU version 2.5+
CPU(s): 8
MEM: 26065 MB

Before this patch:
[root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
AF_UNIX sock stream latency: 133.7073 microseconds

After this patch:
[root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
AF_UNIX sock stream latency: 156.4722 microseconds

If we set task to a constant cpu, the degradation does not appear.

Before this patch:
[root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
AF_UNIX sock stream latency: 17.9296 microseconds

After this patch:
[root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
AF_UNIX sock stream latency: 17.7500 microseconds

We also test it on the aarch64 hi1215 machine with 8 cpu cores.

Before this patch:
[root@localhost ~]# ./lat_unix -P 1
AF_UNIX sock stream latency: 30.7 microseconds

After this patch:
[root@localhost ~]# ./lat_unix -P 1
AF_UNIX sock stream latency: 37.5 microseconds

Accessories included my reproduce config for x86 qemu. Any thoughts?

Thanks,
Yi.

[1] http://sourceforge.mirrorservice.org/l/lm/lmbench/development/lmbench-3.0-a9/

On 2019/7/30 2:02, [email protected] wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> mm, vmstat: make quiet_vmstat lighter
>
> to the 4.4-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
>
> The filename of the patch is:
> mm-vmstat-make-quiet_vmstat-lighter.patch
> and it can be found in the queue-4.4 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <[email protected]> know about it.
>
>
>>From f01f17d3705bb6081c9e5728078f64067982be36 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 5 Feb 2016 15:36:24 -0800
> Subject: mm, vmstat: make quiet_vmstat lighter
>
> From: Michal Hocko <[email protected]>
>
> commit f01f17d3705bb6081c9e5728078f64067982be36 upstream.
>
> Mike has reported a considerable overhead of refresh_cpu_vm_stats from
> the idle entry during pipe test:
>
> 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12
> 4.75% [kernel] [k] __schedule
> 4.70% [kernel] [k] mutex_unlock
> 3.14% [kernel] [k] __switch_to
>
> This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
> deferrable again and shut down on idle") which has placed quiet_vmstat
> into cpu_idle_loop. The main reason here seems to be that the idle
> entry has to get over all zones and perform atomic operations for each
> vmstat entry even though there might be no per cpu diffs. This is a
> pointless overhead for _each_ idle entry.
>
> Make sure that quiet_vmstat is as light as possible.
>
> First of all it doesn't make any sense to do any local sync if the
> current cpu is already set in oncpu_stat_off because vmstat_update puts
> itself there only if there is nothing to do.
>
> Then we can check need_update which should be a cheap way to check for
> potential per-cpu diffs and only then do refresh_cpu_vm_stats.
>
> The original patch also did cancel_delayed_work which we are not doing
> here. There are two reasons for that. Firstly cancel_delayed_work from
> idle context will blow up on RT kernels (reported by Mike):
>
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
> Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> Call Trace:
> dump_stack+0x49/0x67
> ___might_sleep+0xf5/0x180
> rt_spin_lock+0x20/0x50
> try_to_grab_pending+0x69/0x240
> cancel_delayed_work+0x26/0xe0
> quiet_vmstat+0x75/0xa0
> cpu_idle_loop+0x38/0x3e0
> cpu_startup_entry+0x13/0x20
> start_secondary+0x114/0x140
>
> And secondly, even on !RT kernels it might add some non trivial overhead
> which is not necessary. Even if the vmstat worker wakes up and preempts
> idle then it will be most likely a single shot noop because the stats
> were already synced and so it would end up on the oncpu_stat_off anyway.
> We just need to teach both vmstat_shepherd and vmstat_update to stop
> scheduling the worker if there is nothing to do.
>
> [[email protected]: cancel pending work of the cpu_stat_off CPU]
> Signed-off-by: Michal Hocko <[email protected]>
> Reported-by: Mike Galbraith <[email protected]>
> Acked-by: Christoph Lameter <[email protected]>
> Signed-off-by: Mike Galbraith <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Daniel Wagner <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> mm/vmstat.c | 68 ++++++++++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 46 insertions(+), 22 deletions(-)
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1395,10 +1395,15 @@ static void vmstat_update(struct work_st
> * Counters were updated so we expect more updates
> * to occur in the future. Keep on running the
> * update worker thread.
> + * If we were marked on cpu_stat_off clear the flag
> + * so that vmstat_shepherd doesn't schedule us again.
> */
> - queue_delayed_work_on(smp_processor_id(), vmstat_wq,
> - this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + if (!cpumask_test_and_clear_cpu(smp_processor_id(),
> + cpu_stat_off)) {
> + queue_delayed_work_on(smp_processor_id(), vmstat_wq,
> + this_cpu_ptr(&vmstat_work),
> + round_jiffies_relative(sysctl_stat_interval));
> + }
> } else {
> /*
> * We did not update any counters so the app may be in
> @@ -1416,18 +1421,6 @@ static void vmstat_update(struct work_st
> * until the diffs stay at zero. The function is used by NOHZ and can only be
> * invoked when tick processing is not active.
> */
> -void quiet_vmstat(void)
> -{
> - if (system_state != SYSTEM_RUNNING)
> - return;
> -
> - do {
> - if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> - cancel_delayed_work(this_cpu_ptr(&vmstat_work));
> -
> - } while (refresh_cpu_vm_stats(false));
> -}
> -
> /*
> * Check if the diffs for a certain cpu indicate that
> * an update is needed.
> @@ -1451,6 +1444,30 @@ static bool need_update(int cpu)
> return false;
> }
>
> +void quiet_vmstat(void)
> +{
> + if (system_state != SYSTEM_RUNNING)
> + return;
> +
> + /*
> + * If we are already in hands of the shepherd then there
> + * is nothing for us to do here.
> + */
> + if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> + return;
> +
> + if (!need_update(smp_processor_id()))
> + return;
> +
> + /*
> + * Just refresh counters and do not care about the pending delayed
> + * vmstat_update. It doesn't fire that often to matter and canceling
> + * it would be too expensive from this path.
> + * vmstat_shepherd will take care about that for us.
> + */
> + refresh_cpu_vm_stats(false);
> +}
> +
>
> /*
> * Shepherd worker thread that checks the
> @@ -1468,18 +1485,25 @@ static void vmstat_shepherd(struct work_
>
> get_online_cpus();
> /* Check processors whose vmstat worker threads have been disabled */
> - for_each_cpu(cpu, cpu_stat_off)
> - if (need_update(cpu) &&
> - cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
> -
> - queue_delayed_work_on(cpu, vmstat_wq,
> - &per_cpu(vmstat_work, cpu), 0);
> + for_each_cpu(cpu, cpu_stat_off) {
> + struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>
> + if (need_update(cpu)) {
> + if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
> + queue_delayed_work_on(cpu, vmstat_wq, dw, 0);
> + } else {
> + /*
> + * Cancel the work if quiet_vmstat has put this
> + * cpu on cpu_stat_off because the work item might
> + * be still scheduled
> + */
> + cancel_delayed_work(dw);
> + }
> + }
> put_online_cpus();
>
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> -
> }
>
> static void __init start_shepherd_timer(void)
>
>
> Patches currently in stable-queue which might be from [email protected] are
>
> queue-4.4/mm-mmu_notifier-use-hlist_add_head_rcu.patch
> queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch
> queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch
>
> .
>


Attachments:
reproduce.config (69.15 kB)

2019-12-11 14:42:21

by Michal Hocko

[permalink] [raw]
Subject: Re: Patch "mm, vmstat: make quiet_vmstat lighter" has been added to the 4.4-stable tree

On Wed 11-12-19 22:32:49, zhangyi (F) wrote:
> Hi, all
>
> We find a performance degradation under lmbench af_unix[1] test case after
> mergeing this patch on my x86 qemu 4.4 machine. The test result is basically
> stable for each teses.
>
> Host machine: CPU: Intel(R) Xeon(R) CPU E5-2690 v3
> CPU(s): 48
> MEM: 193047 MB
>
> Guest machine: CPU: QEMU Virtual CPU version 2.5+
> CPU(s): 8
> MEM: 26065 MB

Does the same happen on the bare metal? Also what are the numbers for
the current vanilla kernel?
--
Michal Hocko
SUSE Labs

2019-12-11 14:47:00

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Patch "mm, vmstat: make quiet_vmstat lighter" has been added to the 4.4-stable tree

On Wed, Dec 11, 2019 at 10:32:49PM +0800, zhangyi (F) wrote:
> Hi, all
>
> We find a performance degradation under lmbench af_unix[1] test case after
> mergeing this patch on my x86 qemu 4.4 machine. The test result is basically
> stable for each teses.
>
> Host machine: CPU: Intel(R) Xeon(R) CPU E5-2690 v3
> CPU(s): 48
> MEM: 193047 MB
>
> Guest machine: CPU: QEMU Virtual CPU version 2.5+
> CPU(s): 8
> MEM: 26065 MB
>
> Before this patch:
> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
> AF_UNIX sock stream latency: 133.7073 microseconds
>
> After this patch:
> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
> AF_UNIX sock stream latency: 156.4722 microseconds
>
> If we set task to a constant cpu, the degradation does not appear.
>
> Before this patch:
> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
> AF_UNIX sock stream latency: 17.9296 microseconds
>
> After this patch:
> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
> AF_UNIX sock stream latency: 17.7500 microseconds
>
> We also test it on the aarch64 hi1215 machine with 8 cpu cores.
>
> Before this patch:
> [root@localhost ~]# ./lat_unix -P 1
> AF_UNIX sock stream latency: 30.7 microseconds
>
> After this patch:
> [root@localhost ~]# ./lat_unix -P 1
> AF_UNIX sock stream latency: 37.5 microseconds
>
> Accessories included my reproduce config for x86 qemu. Any thoughts?

This fixes a bug, as reported by Daniel Wagner. So it's probably better
to have a stable system instead of a broken one, right? :)

Daniel can provide more information if needed.

What about when you run your tests on a 4.9 or newer kernel that already
has this integrated?

thanks,

greg k-h

2019-12-11 15:55:21

by Daniel Wagner

[permalink] [raw]
Subject: Re: Patch "mm, vmstat: make quiet_vmstat lighter" has been added to the 4.4-stable tree

Hi,

On 2019-12-11 15:46, Greg KH wrote:
> On Wed, Dec 11, 2019 at 10:32:49PM +0800, zhangyi (F) wrote:
>> We find a performance degradation under lmbench af_unix[1] test case after
>> mergeing this patch on my x86 qemu 4.4 machine. The test result is basically
>> stable for each teses.
>>
>> Host machine: CPU: Intel(R) Xeon(R) CPU E5-2690 v3
>> CPU(s): 48
>> MEM: 193047 MB
>>
>> Guest machine: CPU: QEMU Virtual CPU version 2.5+
>> CPU(s): 8
>> MEM: 26065 MB
>>
>> Before this patch:
>> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
>> AF_UNIX sock stream latency: 133.7073 microseconds
>>
>> After this patch:
>> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
>> AF_UNIX sock stream latency: 156.4722 microseconds
>>
>> If we set task to a constant cpu, the degradation does not appear.
>>
>> Before this patch:
>> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
>> AF_UNIX sock stream latency: 17.9296 microseconds
>>
>> After this patch:
>> [root@localhost ~]# lmbench-3.0-a9/bin/x86_64-linux-gnu/lat_unix -P 1
>> AF_UNIX sock stream latency: 17.7500 microseconds
>>
>> We also test it on the aarch64 hi1215 machine with 8 cpu cores.
>>
>> Before this patch:
>> [root@localhost ~]# ./lat_unix -P 1
>> AF_UNIX sock stream latency: 30.7 microseconds
>>
>> After this patch:
>> [root@localhost ~]# ./lat_unix -P 1
>> AF_UNIX sock stream latency: 37.5 microseconds
>>
>> Accessories included my reproduce config for x86 qemu. Any thoughts?
>
> This fixes a bug, as reported by Daniel Wagner. So it's probably better
> to have a stable system instead of a broken one, right? :)
>
> Daniel can provide more information if needed.

IIRC, this patch got necessary because bdf3c006b9a2 ("vmstat: make
vmstat_updater deferrable again and shut down on idle") was added to
stable. Without it v4.4-rt is not working at all.

> What about when you run your tests on a 4.9 or newer kernel that
> already has this integrated?

That said, I was going through all changes in vmstat.c upstream and we
backported almost all changes to v4.4 at that point. So if this is a
really giving you a performance hit, I suspect you would see it upstream
as well.

Thanks,
Daniel