On Sun, Sep 5, 2021 at 5:27 AM kernel test robot <[email protected]> wrote:
>
>
>
> Greeting,
>
> FYI, we noticed a -14.0% regression of aim7.jobs-per-min due to commit:
>
>
> commit: 45208c9105bd96015b98cdf31fbd6a3bcff234b6 ("[patch 097/212] memcg: infrastructure to flush memcg stats")
> url: https://github.com/0day-ci/linux/commits/Andrew-Morton/ia64-fix-typo-in-a-comment/20210903-065028
>
>
> in testcase: aim7
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> with following parameters:
>
> disk: 1BRD_48G
> fs: xfs
> test: disk_rr
> load: 3000
> cpufreq_governor: performance
> ucode: 0xd000280
>
> test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | reaim: reaim.jobs_per_min -10.0% regression |
> | test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> | test parameters | cpufreq_governor=performance |
> | | nr_task=100% |
> | | runtime=300s |
> | | test=compute |
> | | ucode=0x5003006 |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -29.8% regression |
> | test machine | 104 threads 2 sockets Skylake with 192G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=process |
> | | nr_task=100% |
> | | test=fallocate2 |
> | | ucode=0x2006a0a |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | fio-basic: fio.read_iops -20.5% regression |
> | test machine | 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 256G memory |
> | test parameters | bs=4k |
> | | cpufreq_governor=performance |
> | | disk=2pmem |
> | | fs=xfs |
> | | ioengine=mmap |
> | | nr_task=50% |
> | | runtime=200s |
> | | rw=read |
> | | test_size=200G |
> | | time_based=tb |
> | | ucode=0x5003006 |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -9.2% regression |
> | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 112G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=process |
> | | nr_task=50% |
> | | test=page_fault3 |
> | | ucode=0x42e |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops -2.3% regression |
> | test machine | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=thread |
> | | nr_task=100% |
> | | test=tlb_flush1 |
> | | ucode=0x700001e |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops -8.9% regression |
> | test machine | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=thread |
> | | nr_task=16 |
> | | test=tlb_flush1 |
> | | ucode=0x700001e |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -6.4% regression |
> | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 112G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=process |
> | | nr_task=50% |
> | | test=page_fault2 |
> | | ucode=0x42e |
> +------------------+-------------------------------------------------------------------------------------+
> | testcase: change | netperf: netperf.Throughput_Mbps 89.8% improvement |
> | test machine | 192 threads 4 sockets Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> | test parameters | cluster=cs-localhost |
> | | cpufreq_governor=performance |
> | | ip=ipv4 |
> | | nr_threads=200% |
> | | runtime=900s |
> | | test=TCP_MAERTS |
> | | ucode=0x5003006 |
> +------------------+-------------------------------------------------------------------------------------+
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> bin/lkp run generated-yaml-file
>
> =========================================================================================
> compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase/ucode:
> gcc-9/performance/1BRD_48G/xfs/x86_64-rhel-8.3/3000/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/disk_rr/aim7/0xd000280
>
> commit:
> 3c28c7680e ("memcg: switch lruvec stats to rstat")
> 45208c9105 ("memcg: infrastructure to flush memcg stats")
I am looking into this. I was hoping we have resolution for [1] as
these patches touch similar data structures.
[1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
Hi Shakeel,
On Sun, Sep 05, 2021 at 03:15:46PM -0700, Shakeel Butt wrote:
> On Sun, Sep 5, 2021 at 5:27 AM kernel test robot <[email protected]> wrote:
[...]
> > =========================================================================================
> > compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase/ucode:
> > gcc-9/performance/1BRD_48G/xfs/x86_64-rhel-8.3/3000/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/disk_rr/aim7/0xd000280
> >
> > commit:
> > 3c28c7680e ("memcg: switch lruvec stats to rstat")
> > 45208c9105 ("memcg: infrastructure to flush memcg stats")
>
> I am looking into this. I was hoping we have resolution for [1] as
> these patches touch similar data structures.
>
> [1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
I tried 2 debug methods for that 36.4% vm-scalability regression:
1. Disable the HW cache prefetcher, no effect on this case
2. relayout and add padding to 'struct cgroup_subsys_state', reduce
the regression to 3.1%
Thanks,
Feng
On Mon, Sep 6, 2021 at 8:30 PM Feng Tang <[email protected]> wrote:
>
> Hi Shakeel,
>
> On Sun, Sep 05, 2021 at 03:15:46PM -0700, Shakeel Butt wrote:
> > On Sun, Sep 5, 2021 at 5:27 AM kernel test robot <[email protected]> wrote:
> [...]
> > > =========================================================================================
> > > compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase/ucode:
> > > gcc-9/performance/1BRD_48G/xfs/x86_64-rhel-8.3/3000/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/disk_rr/aim7/0xd000280
> > >
> > > commit:
> > > 3c28c7680e ("memcg: switch lruvec stats to rstat")
> > > 45208c9105 ("memcg: infrastructure to flush memcg stats")
> >
> > I am looking into this. I was hoping we have resolution for [1] as
> > these patches touch similar data structures.
> >
> > [1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
>
> I tried 2 debug methods for that 36.4% vm-scalability regression:
>
> 1. Disable the HW cache prefetcher, no effect on this case
> 2. relayout and add padding to 'struct cgroup_subsys_state', reduce
> the regression to 3.1%
>
Thanks Feng but it seems like the issue for this commit is different.
Rearranging the layout didn't help. Actually the cause of slowdown is
the call to queue_work() inside __mod_memcg_lruvec_state().
At the moment, queue_work() is called after 32 updates. I changed it
to 128 and the slowdown of will-it-scale:page_fault[1|2|3] halved
(from around 10% to 5%). I am unable to run reaim or
will-it-scale:fallocate2 as I was getting weird errors.
Feng, is it possible for you to run these benchmarks with the change
(basically changing MEMCG_CHARGE_BATCH to 128 in the if condition
before queue_work() inside __mod_memcg_lruvec_state())?
For the formal patch/fix, I will write down a better explanation on
what should be the batch size.
thanks,
Shakeel
On Thu, Sep 09, 2021 at 05:43:40PM -0700, Shakeel Butt wrote:
> On Mon, Sep 6, 2021 at 8:30 PM Feng Tang <[email protected]> wrote:
> >
> > Hi Shakeel,
> >
> > On Sun, Sep 05, 2021 at 03:15:46PM -0700, Shakeel Butt wrote:
> > > On Sun, Sep 5, 2021 at 5:27 AM kernel test robot <[email protected]> wrote:
> > [...]
> > > > =========================================================================================
> > > > compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase/ucode:
> > > > gcc-9/performance/1BRD_48G/xfs/x86_64-rhel-8.3/3000/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/disk_rr/aim7/0xd000280
> > > >
> > > > commit:
> > > > 3c28c7680e ("memcg: switch lruvec stats to rstat")
> > > > 45208c9105 ("memcg: infrastructure to flush memcg stats")
> > >
> > > I am looking into this. I was hoping we have resolution for [1] as
> > > these patches touch similar data structures.
> > >
> > > [1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
> >
> > I tried 2 debug methods for that 36.4% vm-scalability regression:
> >
> > 1. Disable the HW cache prefetcher, no effect on this case
> > 2. relayout and add padding to 'struct cgroup_subsys_state', reduce
> > the regression to 3.1%
> >
>
> Thanks Feng but it seems like the issue for this commit is different.
> Rearranging the layout didn't help. Actually the cause of slowdown is
> the call to queue_work() inside __mod_memcg_lruvec_state().
>
> At the moment, queue_work() is called after 32 updates. I changed it
> to 128 and the slowdown of will-it-scale:page_fault[1|2|3] halved
> (from around 10% to 5%). I am unable to run reaim or
> will-it-scale:fallocate2 as I was getting weird errors.
>
> Feng, is it possible for you to run these benchmarks with the change
> (basically changing MEMCG_CHARGE_BATCH to 128 in the if condition
> before queue_work() inside __mod_memcg_lruvec_state())?
When I checked this, I tried different changes, including this batch
number change :), but it didn't recover the regression (the regression
is slightly reduced to about 12%)
Please check if my patch is what you want to test:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4d8c9af..a50a69a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -682,7 +682,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
/* Update lruvec */
__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
- if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
+// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
+ if (!(__this_cpu_inc_return(stats_flush_threshold) % 128))
queue_work(system_unbound_wq, &stats_flush_work);
}
Thanks,
Feng
> For the formal patch/fix, I will write down a better explanation on
> what should be the batch size.
>
> thanks,
> Shakeel
On Thu, Sep 9, 2021 at 6:08 PM Feng Tang <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 05:43:40PM -0700, Shakeel Butt wrote:
> > On Mon, Sep 6, 2021 at 8:30 PM Feng Tang <[email protected]> wrote:
> > >
> > > Hi Shakeel,
> > >
> > > On Sun, Sep 05, 2021 at 03:15:46PM -0700, Shakeel Butt wrote:
> > > > On Sun, Sep 5, 2021 at 5:27 AM kernel test robot <[email protected]> wrote:
> > > [...]
> > > > > =========================================================================================
> > > > > compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase/ucode:
> > > > > gcc-9/performance/1BRD_48G/xfs/x86_64-rhel-8.3/3000/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/disk_rr/aim7/0xd000280
> > > > >
> > > > > commit:
> > > > > 3c28c7680e ("memcg: switch lruvec stats to rstat")
> > > > > 45208c9105 ("memcg: infrastructure to flush memcg stats")
> > > >
> > > > I am looking into this. I was hoping we have resolution for [1] as
> > > > these patches touch similar data structures.
> > > >
> > > > [1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
> > >
> > > I tried 2 debug methods for that 36.4% vm-scalability regression:
> > >
> > > 1. Disable the HW cache prefetcher, no effect on this case
> > > 2. relayout and add padding to 'struct cgroup_subsys_state', reduce
> > > the regression to 3.1%
> > >
> >
> > Thanks Feng but it seems like the issue for this commit is different.
> > Rearranging the layout didn't help. Actually the cause of slowdown is
> > the call to queue_work() inside __mod_memcg_lruvec_state().
> >
> > At the moment, queue_work() is called after 32 updates. I changed it
> > to 128 and the slowdown of will-it-scale:page_fault[1|2|3] halved
> > (from around 10% to 5%). I am unable to run reaim or
> > will-it-scale:fallocate2 as I was getting weird errors.
> >
> > Feng, is it possible for you to run these benchmarks with the change
> > (basically changing MEMCG_CHARGE_BATCH to 128 in the if condition
> > before queue_work() inside __mod_memcg_lruvec_state())?
>
> When I checked this, I tried different changes, including this batch
> number change :), but it didn't recover the regression (the regression
> is slightly reduced to about 12%)
>
> Please check if my patch is what you want to test:
Yes, the following patch is what I want to test.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4d8c9af..a50a69a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -682,7 +682,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>
> /* Update lruvec */
> __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> - if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> +// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> + if (!(__this_cpu_inc_return(stats_flush_threshold) % 128))
> queue_work(system_unbound_wq, &stats_flush_work);
> }
>
Another change we can try is to remove this specific queue_work()
altogether because this is the only significant change for the
workload. That will give us the base performance number. If that also
has regression then there are more issues to debug. Thanks a lot for
your help.
On Thu, Sep 09, 2021 at 06:19:06PM -0700, Shakeel Butt wrote:
[...]
> > > > > I am looking into this. I was hoping we have resolution for [1] as
> > > > > these patches touch similar data structures.
> > > > >
> > > > > [1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
> > > >
> > > > I tried 2 debug methods for that 36.4% vm-scalability regression:
> > > >
> > > > 1. Disable the HW cache prefetcher, no effect on this case
> > > > 2. relayout and add padding to 'struct cgroup_subsys_state', reduce
> > > > the regression to 3.1%
> > > >
> > >
> > > Thanks Feng but it seems like the issue for this commit is different.
> > > Rearranging the layout didn't help. Actually the cause of slowdown is
> > > the call to queue_work() inside __mod_memcg_lruvec_state().
> > >
> > > At the moment, queue_work() is called after 32 updates. I changed it
> > > to 128 and the slowdown of will-it-scale:page_fault[1|2|3] halved
> > > (from around 10% to 5%). I am unable to run reaim or
> > > will-it-scale:fallocate2 as I was getting weird errors.
> > >
> > > Feng, is it possible for you to run these benchmarks with the change
> > > (basically changing MEMCG_CHARGE_BATCH to 128 in the if condition
> > > before queue_work() inside __mod_memcg_lruvec_state())?
> >
> > When I checked this, I tried different changes, including this batch
> > number change :), but it didn't recover the regression (the regression
> > is slightly reduced to about 12%)
[...]
>
> Another change we can try is to remove this specific queue_work()
> altogether because this is the only significant change for the
> workload. That will give us the base performance number. If that also
> has regression then there are more issues to debug. Thanks a lot for
> your help.
I just tested with patch removing the queue_work() in __mod_memcg_lruvec_state(),
and the regression is gone.
Also to avoid some duplication of debugging, here are some other tries
I did:
* add padding in 'struct lruvec' for 'lru_lock', no effect
* add padding in 'mem_cgroup_per_node' between 'lruvec_stats_percpu' and
'lruvec_stats', no effect.
Thanks,
Feng
On Thu, Sep 9, 2021 at 7:34 PM Feng Tang <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 06:19:06PM -0700, Shakeel Butt wrote:
> [...]
> > > > > > I am looking into this. I was hoping we have resolution for [1] as
> > > > > > these patches touch similar data structures.
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/20210811031734.GA5193@xsang-OptiPlex-9020/T/#u
> > > > >
> > > > > I tried 2 debug methods for that 36.4% vm-scalability regression:
> > > > >
> > > > > 1. Disable the HW cache prefetcher, no effect on this case
> > > > > 2. relayout and add padding to 'struct cgroup_subsys_state', reduce
> > > > > the regression to 3.1%
> > > > >
> > > >
> > > > Thanks Feng but it seems like the issue for this commit is different.
> > > > Rearranging the layout didn't help. Actually the cause of slowdown is
> > > > the call to queue_work() inside __mod_memcg_lruvec_state().
> > > >
> > > > At the moment, queue_work() is called after 32 updates. I changed it
> > > > to 128 and the slowdown of will-it-scale:page_fault[1|2|3] halved
> > > > (from around 10% to 5%). I am unable to run reaim or
> > > > will-it-scale:fallocate2 as I was getting weird errors.
> > > >
> > > > Feng, is it possible for you to run these benchmarks with the change
> > > > (basically changing MEMCG_CHARGE_BATCH to 128 in the if condition
> > > > before queue_work() inside __mod_memcg_lruvec_state())?
> > >
> > > When I checked this, I tried different changes, including this batch
> > > number change :), but it didn't recover the regression (the regression
> > > is slightly reduced to about 12%)
> [...]
> >
> > Another change we can try is to remove this specific queue_work()
> > altogether because this is the only significant change for the
> > workload. That will give us the base performance number. If that also
> > has regression then there are more issues to debug. Thanks a lot for
> > your help.
>
> I just tested with patch removing the queue_work() in __mod_memcg_lruvec_state(),
> and the regression is gone.
Thanks again for confirming this. I will follow this lead and see how
to improve this.