On Sun, Sep 12, 2021 at 07:17:56PM +0800, Hillf Danton wrote:
[...]
> > +// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> > + if (!(__this_cpu_inc_return(stats_flush_threshold) % 128))
> > queue_work(system_unbound_wq, &stats_flush_work);
> > }
>
> Hi Feng,
>
> Would you please check if it helps fix the regression to avoid queuing a
> queued work by adding and checking an atomic counter.
Hi Hillf,
I just tested your patch, and it didn't recover the regression, but
just reduced it from -14% to around -13%, similar to the patch
increasing the batch charge number.
Thanks,
Feng
> Hillf
>
> --- x/mm/memcontrol.c
> +++ y/mm/memcontrol.c
> @@ -108,6 +108,7 @@ static void flush_memcg_stats_dwork(stru
> static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> static void flush_memcg_stats_work(struct work_struct *w);
> static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work);
> +static atomic_t sfwork_queued;
> static DEFINE_PER_CPU(unsigned int, stats_flush_threshold);
> static DEFINE_SPINLOCK(stats_flush_lock);
>
> @@ -660,8 +661,13 @@ void __mod_memcg_lruvec_state(struct lru
>
> /* Update lruvec */
> __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> - if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> - queue_work(system_unbound_wq, &stats_flush_work);
> + if (!(__this_cpu_inc_return(stats_flush_threshold) %
> + MEMCG_CHARGE_BATCH)) {
> + int queued = atomic_read(&sfwork_queued);
> +
> + if (!queued && atomic_try_cmpxchg(&sfwork_queued, &queued, 1))
> + queue_work(system_unbound_wq, &stats_flush_work);
> + }
> }
>
> /**
> @@ -5376,6 +5382,7 @@ static void flush_memcg_stats_dwork(stru
> static void flush_memcg_stats_work(struct work_struct *w)
> {
> mem_cgroup_flush_stats();
> + atomic_dec(&sfwork_queued);
> }
>
> static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
On Sun, Sep 12, 2021 at 6:29 AM Feng Tang <[email protected]> wrote:
>
> On Sun, Sep 12, 2021 at 07:17:56PM +0800, Hillf Danton wrote:
> [...]
> > > +// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> > > + if (!(__this_cpu_inc_return(stats_flush_threshold) % 128))
> > > queue_work(system_unbound_wq, &stats_flush_work);
> > > }
> >
> > Hi Feng,
> >
> > Would you please check if it helps fix the regression to avoid queuing a
> > queued work by adding and checking an atomic counter.
>
> Hi Hillf,
>
> I just tested your patch, and it didn't recover the regression, but
> just reduced it from -14% to around -13%, similar to the patch
> increasing the batch charge number.
>
Thanks Hillf for taking a look and Feng for running the test.
This shows that parallel calls to queue_work() is not the issue (there
is already a test and set at the start of queue_work()) but the actual
work done by queue_work() is costly for this code path.
I wrote a simple anon page fault nohuge c program, profiled it and it
seems like queue_work() is significant enough.
- 51.00% do_anonymous_page
+ 16.68% alloc_pages_vma
11.61% _raw_spin_lock
+ 10.26% mem_cgroup_charge
- 5.25% lru_cache_add_inactive_or_unevictable
- 4.48% __pagevec_lru_add
- 3.71% __pagevec_lru_add_fn
- 1.74% __mod_lruvec_state
- 1.60% __mod_memcg_lruvec_state
- 1.35% queue_work_on
- __queue_work
- 0.93% wake_up_process
- try_to_wake_up
- 0.82% ttwu_queue
0.61% ttwu_do_activate
- 2.97% page_add_new_anon_rmap
- 2.68% __mod_lruvec_page_state
- 2.48% __mod_memcg_lruvec_state
- 1.67% queue_work_on
- 1.53% __queue_work
- 1.25% wake_up_process
- try_to_wake_up
- 0.94% ttwu_queue
+ 0.70% ttwu_do_activate
0.61% cgroup_rstat_updated
2.10% rcu_read_unlock_strict
1.40% cgroup_throttle_swaprate
However when I switch the batch size to 128, it goes away.
Feng, do you see any change with 128 batch size for aim7 benchmark?
+Tejun Heo
[ threads start at
https://lore.kernel.org/all/20210905124439.GA15026@xsang-OptiPlex-9020/T/#ma938a101f415ad784ac08612c7ef31f260a2b678]
On Mon, Sep 13, 2021 at 9:41 AM Shakeel Butt <[email protected]> wrote:
>
> On Sun, Sep 12, 2021 at 6:29 AM Feng Tang <[email protected]> wrote:
> >
> > On Sun, Sep 12, 2021 at 07:17:56PM +0800, Hillf Danton wrote:
> > [...]
> > > > +// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> > > > + if (!(__this_cpu_inc_return(stats_flush_threshold) % 128))
> > > > queue_work(system_unbound_wq, &stats_flush_work);
> > > > }
> > >
> > > Hi Feng,
> > >
> > > Would you please check if it helps fix the regression to avoid queuing a
> > > queued work by adding and checking an atomic counter.
> >
> > Hi Hillf,
> >
> > I just tested your patch, and it didn't recover the regression, but
> > just reduced it from -14% to around -13%, similar to the patch
> > increasing the batch charge number.
> >
>
> Thanks Hillf for taking a look and Feng for running the test.
>
> This shows that parallel calls to queue_work() is not the issue (there
> is already a test and set at the start of queue_work()) but the actual
> work done by queue_work() is costly for this code path.
>
> I wrote a simple anon page fault nohuge c program, profiled it and it
> seems like queue_work() is significant enough.
>
> - 51.00% do_anonymous_page
> + 16.68% alloc_pages_vma
> 11.61% _raw_spin_lock
> + 10.26% mem_cgroup_charge
> - 5.25% lru_cache_add_inactive_or_unevictable
> - 4.48% __pagevec_lru_add
> - 3.71% __pagevec_lru_add_fn
> - 1.74% __mod_lruvec_state
> - 1.60% __mod_memcg_lruvec_state
> - 1.35% queue_work_on
> - __queue_work
> - 0.93% wake_up_process
> - try_to_wake_up
> - 0.82% ttwu_queue
> 0.61% ttwu_do_activate
> - 2.97% page_add_new_anon_rmap
> - 2.68% __mod_lruvec_page_state
> - 2.48% __mod_memcg_lruvec_state
> - 1.67% queue_work_on
> - 1.53% __queue_work
> - 1.25% wake_up_process
> - try_to_wake_up
> - 0.94% ttwu_queue
> + 0.70% ttwu_do_activate
> 0.61% cgroup_rstat_updated
> 2.10% rcu_read_unlock_strict
> 1.40% cgroup_throttle_swaprate
>
> However when I switch the batch size to 128, it goes away.
>
I did one more experiment with same workload but with system_wq
instead system_unbound_wq and there is clear difference in profile:
With system_unbound_wq:
- 4.63% 0.33% mmap [kernel.kallsyms] [k] queue_work_on
4.29% queue_work_on
- __queue_work
- 3.45% wake_up_process
- try_to_wake_up
- 2.46% ttwu_queue
- 1.66% ttwu_do_activate
- 1.14% activate_task
- 0.97% enqueue_task_fair
enqueue_entity
With system_wq:
- 1.36% 0.06% mmap [kernel.kallsyms] [k] queue_work_on
1.30% queue_work_on
- __queue_work
- 1.03% wake_up_process
- try_to_wake_up
- 0.97% ttwu_queue
0.66% ttwu_do_activate
Tejun, is this expected? i.e. queuing work on system_wq has a
different performance impact than on system_unbound_wq?
Hello,
On Mon, Sep 13, 2021 at 12:40:06PM -0700, Shakeel Butt wrote:
> I did one more experiment with same workload but with system_wq
> instead system_unbound_wq and there is clear difference in profile:
>
> With system_unbound_wq:
> - 4.63% 0.33% mmap [kernel.kallsyms] [k] queue_work_on
> 4.29% queue_work_on
> - __queue_work
> - 3.45% wake_up_process
> - try_to_wake_up
> - 2.46% ttwu_queue
> - 1.66% ttwu_do_activate
> - 1.14% activate_task
> - 0.97% enqueue_task_fair
> enqueue_entity
>
> With system_wq:
> - 1.36% 0.06% mmap [kernel.kallsyms] [k] queue_work_on
> 1.30% queue_work_on
> - __queue_work
> - 1.03% wake_up_process
> - try_to_wake_up
> - 0.97% ttwu_queue
> 0.66% ttwu_do_activate
>
> Tejun, is this expected? i.e. queuing work on system_wq has a
> different performance impact than on system_unbound_wq?
Yes, system_unbound_wq is putting the work item on the global shared
workqueue while the system_wq is per-cpu, so on a loaded system, overhead
difference showing up isn't too surprising.
Thanks.
--
tejun
On Mon, Sep 13, 2021 at 12:42 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Mon, Sep 13, 2021 at 12:40:06PM -0700, Shakeel Butt wrote:
> > I did one more experiment with same workload but with system_wq
> > instead system_unbound_wq and there is clear difference in profile:
> >
> > With system_unbound_wq:
> > - 4.63% 0.33% mmap [kernel.kallsyms] [k] queue_work_on
> > 4.29% queue_work_on
> > - __queue_work
> > - 3.45% wake_up_process
> > - try_to_wake_up
> > - 2.46% ttwu_queue
> > - 1.66% ttwu_do_activate
> > - 1.14% activate_task
> > - 0.97% enqueue_task_fair
> > enqueue_entity
> >
> > With system_wq:
> > - 1.36% 0.06% mmap [kernel.kallsyms] [k] queue_work_on
> > 1.30% queue_work_on
> > - __queue_work
> > - 1.03% wake_up_process
> > - try_to_wake_up
> > - 0.97% ttwu_queue
> > 0.66% ttwu_do_activate
> >
> > Tejun, is this expected? i.e. queuing work on system_wq has a
> > different performance impact than on system_unbound_wq?
>
> Yes, system_unbound_wq is putting the work item on the global shared
> workqueue while the system_wq is per-cpu, so on a loaded system, overhead
> difference showing up isn't too surprising.
>
Thanks a lot for the explanation. Are there any concerns to call
cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup) in system_wq?
This will be called every 2 seconds, so, we can assume the updated
tree would be small most of the time.
Thanks,
Shakeel
On Mon, Sep 13, 2021 at 01:09:11PM -0700, Shakeel Butt wrote:
> Thanks a lot for the explanation. Are there any concerns to call
> cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup) in system_wq?
> This will be called every 2 seconds, so, we can assume the updated
> tree would be small most of the time.
I can't think of a reason why this would be problematic.
Thanks.
--
tejun
On Mon, Sep 13, 2021 at 1:10 PM Tejun Heo <[email protected]> wrote:
>
> On Mon, Sep 13, 2021 at 01:09:11PM -0700, Shakeel Butt wrote:
> > Thanks a lot for the explanation. Are there any concerns to call
> > cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup) in system_wq?
> > This will be called every 2 seconds, so, we can assume the updated
> > tree would be small most of the time.
>
> I can't think of a reason why this would be problematic.
>
Thanks again.
Feng, is it possible to re-run these benchmarks with
queue_work(system_wq) instead of queue_work(system_unbound_wq)?
Hi Shakeel,
On Mon, Sep 13, 2021 at 01:13:57PM -0700, Shakeel Butt wrote:
> On Mon, Sep 13, 2021 at 1:10 PM Tejun Heo <[email protected]> wrote:
> >
> > On Mon, Sep 13, 2021 at 01:09:11PM -0700, Shakeel Butt wrote:
> > > Thanks a lot for the explanation. Are there any concerns to call
> > > cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup) in system_wq?
> > > This will be called every 2 seconds, so, we can assume the updated
> > > tree would be small most of the time.
> >
> > I can't think of a reason why this would be problematic.
> >
>
> Thanks again.
>
> Feng, is it possible to re-run these benchmarks with
> queue_work(system_wq) instead of queue_work(system_unbound_wq)?
I just run the patch twice, and there was no obvious change, the
hotspot is still the spinlock.
Thanks,
Feng
$git-diff aa48e47e
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4d8c9af..fa9cae9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -683,7 +683,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
/* Update lruvec */
__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
- queue_work(system_unbound_wq, &stats_flush_work);
+ queue_work(system_wq, &stats_flush_work);
}
7e1c0d6f58207e7e aa48e47e3906c332eaf1e5d7b58 1638eee6432c1a5175685a7945a
---------------- --------------------------- ---------------------------
\ | \ | \
648.62 +243.0% 2224 ± 3% +246.1% 2244 aim7.cpu
588139 -13.4% 509421 -13.5% 508738 aim7.jobs-per-min
196.05 -13.4% 169.81 -13.5% 169.58 aim7.jobs-per-min-per-task
3.93 ± 3% +62.8 66.70 +63.4 67.37 pp.child.native_queued_spin_lock_slowpath
3.96 ± 4% +62.8 66.76 +63.5 67.46 pp.child._raw_spin_lock_irqsave
3.66 ± 5% +62.9 66.54 +63.6 67.22 pp.child.lock_page_lruvec_irqsave
0.00 +0.1 0.10 ± 5% +0.1 0.09 pp.self.queue_work_on
0.00 +0.2 0.18 ± 5% +0.2 0.22 pp.self.cgroup_rstat_flush_locked
0.00 +0.6 0.60 ± 7% +0.7 0.72 ± 3% pp.self.mem_cgroup_css_rstat_flush
0.19 +0.7 0.86 ± 5% +0.7 0.84 ± 4% pp.self.cgroup_rstat_updated
3.92 ± 3% +62.8 66.70 +63.4 67.37 pp.self.native_queued_spin_lock_slowpath
wi
On Mon, Sep 13, 2021 at 7:13 PM Feng Tang <[email protected]> wrote:
>
> Hi Shakeel,
>
> On Mon, Sep 13, 2021 at 01:13:57PM -0700, Shakeel Butt wrote:
> > On Mon, Sep 13, 2021 at 1:10 PM Tejun Heo <[email protected]> wrote:
> > >
> > > On Mon, Sep 13, 2021 at 01:09:11PM -0700, Shakeel Butt wrote:
> > > > Thanks a lot for the explanation. Are there any concerns to call
> > > > cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup) in system_wq?
> > > > This will be called every 2 seconds, so, we can assume the updated
> > > > tree would be small most of the time.
> > >
> > > I can't think of a reason why this would be problematic.
> > >
> >
> > Thanks again.
> >
> > Feng, is it possible to re-run these benchmarks with
> > queue_work(system_wq) instead of queue_work(system_unbound_wq)?
>
> I just run the patch twice, and there was no obvious change, the
> hotspot is still the spinlock.
>
> Thanks,
> Feng
>
> $git-diff aa48e47e
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4d8c9af..fa9cae9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -683,7 +683,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> /* Update lruvec */
> __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> - queue_work(system_unbound_wq, &stats_flush_work);
> + queue_work(system_wq, &stats_flush_work);
> }
>
>
> 7e1c0d6f58207e7e aa48e47e3906c332eaf1e5d7b58 1638eee6432c1a5175685a7945a
> ---------------- --------------------------- ---------------------------
> \ | \ | \
> 648.62 +243.0% 2224 ± 3% +246.1% 2244 aim7.cpu
> 588139 -13.4% 509421 -13.5% 508738 aim7.jobs-per-min
> 196.05 -13.4% 169.81 -13.5% 169.58 aim7.jobs-per-min-per-task
> 3.93 ± 3% +62.8 66.70 +63.4 67.37 pp.child.native_queued_spin_lock_slowpath
> 3.96 ± 4% +62.8 66.76 +63.5 67.46 pp.child._raw_spin_lock_irqsave
> 3.66 ± 5% +62.9 66.54 +63.6 67.22 pp.child.lock_page_lruvec_irqsave
> 0.00 +0.1 0.10 ± 5% +0.1 0.09 pp.self.queue_work_on
> 0.00 +0.2 0.18 ± 5% +0.2 0.22 pp.self.cgroup_rstat_flush_locked
> 0.00 +0.6 0.60 ± 7% +0.7 0.72 ± 3% pp.self.mem_cgroup_css_rstat_flush
> 0.19 +0.7 0.86 ± 5% +0.7 0.84 ± 4% pp.self.cgroup_rstat_updated
> 3.92 ± 3% +62.8 66.70 +63.4 67.37 pp.self.native_queued_spin_lock_slowpath
>
>
Thanks Feng, the spinlock in lock_page_lruvec_irqsave() is unrelated
and not taken in __mod_memcg_lruvec_state().
If removing queue_work() from __mod_memcg_lruvec_state() removes the
regression then I was expecting some improvement by replacing
system_unbound_wq with system_wq. Anyways thanks for your help. I will
run some benchmarks as well (the page fault ones from will-it-scale).