LinuxLists.cc - Re: 答复: [PATCH ] mm/memcontrol.c: speed up to force empty a memory cgroup

2018-03-19 10:39:52

Subject: Re: 答复: [PATCH ] mm/memcontrol.c: speed up to force empty a memory cgroup

On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
>
>
> > -----邮件原件-----
> > 发件人: Michal Hocko [mailto:[email protected]]
> > 发送时间: 2018年3月19日 16:54
> > 收件人: Li,Rongqing <[email protected]>
> > 抄送: [email protected]; [email protected];
> > [email protected]; [email protected]; Andrey Ryabinin
> > <[email protected]>
> > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a memory
> > cgroup
> >
> > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > mem_cgroup_force_empty() tries to free only 32 (SWAP_CLUSTER_MAX)
> > > pages on each iteration, if a memory cgroup has lots of page cache, it
> > > will take many iterations to empty all page cache, so increase the
> > > reclaimed number per iteration to speed it up. same as in
> > > mem_cgroup_resize_limit()
> > >
> > > a simple test show:
> > >
> > > $dd if=aaa of=bbb bs=1k count=3886080
> > > $rm -f bbb
> > > $time echo 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > >
> > > Before: 0m0.252s ===> after: 0m0.178s
> >
> > Andrey was proposing something similar [1]. My main objection was that his
> > approach might lead to over-reclaim. Your approach is more conservative
> > because it just increases the batch size. The size is still rather arbitrary. Same
> > as SWAP_CLUSTER_MAX but that one is a commonly used unit of reclaim in
> > the MM code.
> >
> > I would be really curious about more detailed explanation why having a
> > larger batch yields to a better performance because we are doingg
> > SWAP_CLUSTER_MAX batches at the lower reclaim level anyway.
> >
>
> Although SWAP_CLUSTER_MAX is used at the lower level, but the call stack of
> try_to_free_mem_cgroup_pages is too long, increase the nr_to_reclaim can reduce
> times of calling function[do_try_to_free_pages, shrink_zones, hrink_node ]
>
> mem_cgroup_resize_limit
> --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024, SWAP_CLUSTER_MAX),
> ---> do_try_to_free_pages
> ---> shrink_zones
> --->shrink_node
> ---> shrink_node_memcg
> ---> shrink_list <-------loop will happen in this place [times=1024/32]
> ---> shrink_page_list

Can you actually measure this to be the culprit. Because we should
rethink our call path if it is too complicated/deep to perform well.
Adding arbitrary batch sizes doesn't sound like a good way to go to me.
--
Michal Hocko
SUSE Labs

2018-03-19 10:54:25

by Li RongQing

[permalink] [raw]

Subject: 答复: 答复: [PATCH] mm/memcontrol.c: spee d up to force empty a memory cgroup

> -----邮件原件-----
> 发件人: Michal Hocko [mailto:[email protected]]
> 发送时间: 2018年3月19日 18:38
> 收件人: Li,Rongqing <[email protected]>
> 抄送: [email protected]; [email protected];
> [email protected]; [email protected]; Andrey Ryabinin
> <[email protected]>
> 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory cgroup
>
> On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> >
> >
> > > -----邮件原件-----
> > > 发件人: Michal Hocko [mailto:[email protected]]
> > > 发送时间: 2018年3月19日 16:54
> > > 收件人: Li,Rongqing <[email protected]>
> > > 抄送: [email protected]; [email protected];
> > > [email protected]; [email protected]; Andrey Ryabinin
> > > <[email protected]>
> > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory
> > > cgroup
> > >
> > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > mem_cgroup_force_empty() tries to free only 32
> (SWAP_CLUSTER_MAX)
> > > > pages on each iteration, if a memory cgroup has lots of page
> > > > cache, it will take many iterations to empty all page cache, so
> > > > increase the reclaimed number per iteration to speed it up. same
> > > > as in
> > > > mem_cgroup_resize_limit()
> > > >
> > > > a simple test show:
> > > >
> > > > $dd if=aaa of=bbb bs=1k count=3886080
> > > > $rm -f bbb
> > > > $time echo
> 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > >
> > > > Before: 0m0.252s ===> after: 0m0.178s
> > >
> > > Andrey was proposing something similar [1]. My main objection was
> > > that his approach might lead to over-reclaim. Your approach is more
> > > conservative because it just increases the batch size. The size is
> > > still rather arbitrary. Same as SWAP_CLUSTER_MAX but that one is a
> > > commonly used unit of reclaim in the MM code.
> > >
> > > I would be really curious about more detailed explanation why having
> > > a larger batch yields to a better performance because we are doingg
> > > SWAP_CLUSTER_MAX batches at the lower reclaim level anyway.
> > >
> >
> > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > nr_to_reclaim can reduce times of calling
> > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> >
> > mem_cgroup_resize_limit
> > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > --->SWAP_CLUSTER_MAX),
> > ---> do_try_to_free_pages
> > ---> shrink_zones
> > --->shrink_node
> > ---> shrink_node_memcg
> > ---> shrink_list <-------loop will happen in this place
> [times=1024/32]
> > ---> shrink_page_list
>
> Can you actually measure this to be the culprit. Because we should rethink
> our call path if it is too complicated/deep to perform well.
> Adding arbitrary batch sizes doesn't sound like a good way to go to me.

Ok, I will try

-RongQing
> --
> Michal Hocko
> SUSE Labs

2018-03-20 00:17:37

by David Rientjes

[permalink] [raw]

Subject: Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup

On Mon, 19 Mar 2018, Li,Rongqing wrote:

> > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > nr_to_reclaim can reduce times of calling
> > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > >
> > > mem_cgroup_resize_limit
> > > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > > --->SWAP_CLUSTER_MAX),
> > > ---> do_try_to_free_pages
> > > ---> shrink_zones
> > > --->shrink_node
> > > ---> shrink_node_memcg
> > > ---> shrink_list <-------loop will happen in this place
> > [times=1024/32]
> > > ---> shrink_page_list
> >
> > Can you actually measure this to be the culprit. Because we should rethink
> > our call path if it is too complicated/deep to perform well.
> > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
>
> Ok, I will try
>

Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on
every iteration which contends with lowering limits in other cgroups (on
our systems, thousands), calling try_to_free_mem_cgroup_pages() with less
than SWAP_CLUSTER_MAX is lame. It would probably be best to limit the
nr_pages to the amount that needs to be reclaimed, though, rather than
over reclaiming.

If you wanted to be invasive, you could change page_counter_limit() to
return the count - limit, fix up the callers that look for -EBUSY, and
then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.

2018-03-20 08:41:25

by Michal Hocko

[permalink] [raw]

Subject: Re: 答复: 答复: [PATC H] mm/memcontrol.c: speed up to force empty a memory cgroup

On Mon 19-03-18 10:51:57, David Rientjes wrote:
> On Mon, 19 Mar 2018, Li,Rongqing wrote:
>
> > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > > nr_to_reclaim can reduce times of calling
> > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > >
> > > > mem_cgroup_resize_limit
> > > > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > > > --->SWAP_CLUSTER_MAX),
> > > > ---> do_try_to_free_pages
> > > > ---> shrink_zones
> > > > --->shrink_node
> > > > ---> shrink_node_memcg
> > > > ---> shrink_list <-------loop will happen in this place
> > > [times=1024/32]
> > > > ---> shrink_page_list
> > >
> > > Can you actually measure this to be the culprit. Because we should rethink
> > > our call path if it is too complicated/deep to perform well.
> > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> >
> > Ok, I will try
> >
>
> Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on
> every iteration which contends with lowering limits in other cgroups (on
> our systems, thousands), calling try_to_free_mem_cgroup_pages() with less
> than SWAP_CLUSTER_MAX is lame.

Well, if the global lock is a bottleneck in your deployments then we
can come up with something more clever. E.g. per hierarchy locking
or even drop the lock for the reclaim altogether. If we reclaim in
SWAP_CLUSTER_MAX then the potential over-reclaim risk quite low when
multiple users are shrinking the same (sub)hierarchy.

> It would probably be best to limit the
> nr_pages to the amount that needs to be reclaimed, though, rather than
> over reclaiming.

How do you achieve that? The charging path is not synchornized with the
shrinking one at all.

> If you wanted to be invasive, you could change page_counter_limit() to
> return the count - limit, fix up the callers that look for -EBUSY, and
> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.

I am not sure I understand

--
Michal Hocko
SUSE Labs

2018-03-20 20:31:26

by David Rientjes

[permalink] [raw]

Subject: Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup

On Tue, 20 Mar 2018, Michal Hocko wrote:

> > > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > > > nr_to_reclaim can reduce times of calling
> > > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > > >
> > > > > mem_cgroup_resize_limit
> > > > > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > > > > --->SWAP_CLUSTER_MAX),
> > > > > ---> do_try_to_free_pages
> > > > > ---> shrink_zones
> > > > > --->shrink_node
> > > > > ---> shrink_node_memcg
> > > > > ---> shrink_list <-------loop will happen in this place
> > > > [times=1024/32]
> > > > > ---> shrink_page_list
> > > >
> > > > Can you actually measure this to be the culprit. Because we should rethink
> > > > our call path if it is too complicated/deep to perform well.
> > > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> > >
> > > Ok, I will try
> > >
> >
> > Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on
> > every iteration which contends with lowering limits in other cgroups (on
> > our systems, thousands), calling try_to_free_mem_cgroup_pages() with less
> > than SWAP_CLUSTER_MAX is lame.
>
> Well, if the global lock is a bottleneck in your deployments then we
> can come up with something more clever. E.g. per hierarchy locking
> or even drop the lock for the reclaim altogether. If we reclaim in
> SWAP_CLUSTER_MAX then the potential over-reclaim risk quite low when
> multiple users are shrinking the same (sub)hierarchy.
>

I don't believe this to be a bottleneck if nr_pages is increased in
mem_cgroup_resize_limit().

> > It would probably be best to limit the
> > nr_pages to the amount that needs to be reclaimed, though, rather than
> > over reclaiming.
>
> How do you achieve that? The charging path is not synchornized with the
> shrinking one at all.
>

The point is to get a better guess at how many pages, up to
SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.

> > If you wanted to be invasive, you could change page_counter_limit() to
> > return the count - limit, fix up the callers that look for -EBUSY, and
> > then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
>
> I am not sure I understand
>

Have page_counter_limit() return the number of pages over limit, i.e.
count - limit, since it compares the two anyway. Fix up existing callers
and then clamp that value to SWAP_CLUSTER_MAX in
mem_cgroup_resize_limit(). It's a more accurate guess than either 1 or
1024.

2018-03-20 22:11:08

by Andrey Ryabinin

[permalink] [raw]

Subject: Re: 答复: 答复: [PATCH] mm/memcontrol. c: speed up to force empty a memory cgroup

On 03/20/2018 11:29 PM, David Rientjes wrote:
> On Tue, 20 Mar 2018, Michal Hocko wrote:
>
>>>>>> Although SWAP_CLUSTER_MAX is used at the lower level, but the call
>>>>>> stack of try_to_free_mem_cgroup_pages is too long, increase the
>>>>>> nr_to_reclaim can reduce times of calling
>>>>>> function[do_try_to_free_pages, shrink_zones, hrink_node ]
>>>>>>
>>>>>> mem_cgroup_resize_limit
>>>>>> --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
>>>>>> --->SWAP_CLUSTER_MAX),
>>>>>> ---> do_try_to_free_pages
>>>>>> ---> shrink_zones
>>>>>> --->shrink_node
>>>>>> ---> shrink_node_memcg
>>>>>> ---> shrink_list <-------loop will happen in this place
>>>>> [times=1024/32]
>>>>>> ---> shrink_page_list
>>>>>
>>>>> Can you actually measure this to be the culprit. Because we should rethink
>>>>> our call path if it is too complicated/deep to perform well.
>>>>> Adding arbitrary batch sizes doesn't sound like a good way to go to me.
>>>>
>>>> Ok, I will try
>>>>
>>>
>>> Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on
>>> every iteration which contends with lowering limits in other cgroups (on
>>> our systems, thousands), calling try_to_free_mem_cgroup_pages() with less
>>> than SWAP_CLUSTER_MAX is lame.
>>
>> Well, if the global lock is a bottleneck in your deployments then we
>> can come up with something more clever. E.g. per hierarchy locking
>> or even drop the lock for the reclaim altogether. If we reclaim in
>> SWAP_CLUSTER_MAX then the potential over-reclaim risk quite low when
>> multiple users are shrinking the same (sub)hierarchy.
>>
>
> I don't believe this to be a bottleneck if nr_pages is increased in
> mem_cgroup_resize_limit().
>
>>> It would probably be best to limit the
>>> nr_pages to the amount that needs to be reclaimed, though, rather than
>>> over reclaiming.
>>
>> How do you achieve that? The charging path is not synchornized with the
>> shrinking one at all.
>>
>
> The point is to get a better guess at how many pages, up to
> SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
>
>>> If you wanted to be invasive, you could change page_counter_limit() to
>>> return the count - limit, fix up the callers that look for -EBUSY, and
>>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
>>
>> I am not sure I understand
>>
>
> Have page_counter_limit() return the number of pages over limit, i.e.
> count - limit, since it compares the two anyway. Fix up existing callers
> and then clamp that value to SWAP_CLUSTER_MAX in
> mem_cgroup_resize_limit(). It's a more accurate guess than either 1 or
> 1024.
>

JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
See try_to_free_mem_cgroup_pages():
....
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),

2018-03-20 22:16:28

by David Rientjes

[permalink] [raw]

Subject: Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup

On Wed, 21 Mar 2018, Andrey Ryabinin wrote:

> >>> It would probably be best to limit the
> >>> nr_pages to the amount that needs to be reclaimed, though, rather than
> >>> over reclaiming.
> >>
> >> How do you achieve that? The charging path is not synchornized with the
> >> shrinking one at all.
> >>
> >
> > The point is to get a better guess at how many pages, up to
> > SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
> >
> >>> If you wanted to be invasive, you could change page_counter_limit() to
> >>> return the count - limit, fix up the callers that look for -EBUSY, and
> >>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
> >>
> >> I am not sure I understand
> >>
> >
> > Have page_counter_limit() return the number of pages over limit, i.e.
> > count - limit, since it compares the two anyway. Fix up existing callers
> > and then clamp that value to SWAP_CLUSTER_MAX in
> > mem_cgroup_resize_limit(). It's a more accurate guess than either 1 or
> > 1024.
> >
>
> JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
> See try_to_free_mem_cgroup_pages():
> ....
> struct scan_control sc = {
> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>

Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?

2018-03-20 22:38:01

by Andrey Ryabinin

[permalink] [raw]

Subject: Re: 答复: 答复: [PATCH] mm/memcontrol. c: speed up to force empty a memory cgroup

On 03/21/2018 01:15 AM, David Rientjes wrote:
> On Wed, 21 Mar 2018, Andrey Ryabinin wrote:
>
>>>>> It would probably be best to limit the
>>>>> nr_pages to the amount that needs to be reclaimed, though, rather than
>>>>> over reclaiming.
>>>>
>>>> How do you achieve that? The charging path is not synchornized with the
>>>> shrinking one at all.
>>>>
>>>
>>> The point is to get a better guess at how many pages, up to
>>> SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
>>>
>>>>> If you wanted to be invasive, you could change page_counter_limit() to
>>>>> return the count - limit, fix up the callers that look for -EBUSY, and
>>>>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
>>>>
>>>> I am not sure I understand
>>>>
>>>
>>> Have page_counter_limit() return the number of pages over limit, i.e.
>>> count - limit, since it compares the two anyway. Fix up existing callers
>>> and then clamp that value to SWAP_CLUSTER_MAX in
>>> mem_cgroup_resize_limit(). It's a more accurate guess than either 1 or
>>> 1024.
>>>
>>
>> JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
>> See try_to_free_mem_cgroup_pages():
>> ....
>> struct scan_control sc = {
>> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>>
>
> Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?
>

Absolutely not. I completely on your side here.
I've tried to fix this recently - http://lkml.kernel.org/r/[email protected]
I guess that Andrew decided to not take my patch, because Michal wasn't
happy about it (see mail archives if you want more details).

2018-03-20 22:46:59

by David Rientjes

[permalink] [raw]

Subject: Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup

On Wed, 21 Mar 2018, Andrey Ryabinin wrote:

> > Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?
> >
>
> Absolutely not. I completely on your side here.
> I've tried to fix this recently - http://lkml.kernel.org/r/[email protected]
> I guess that Andrew decided to not take my patch, because Michal wasn't
> happy about it (see mail archives if you want more details).
>

I unfortunately didn't see this patch in January, it seems very similar to
what I was suggesting in this thread. You do a page_counter_read()
directly in mem_cgroup_resize_limit() where my suggestion was to have
page_counter_limit() return the difference, but there's nothing
significantly different about what you proposed and what I suggested.

Perhaps the patch would be better off as a compromise between what you, I,
and Li RongQing have proposed/suggested: have page_counter_limit() return
the difference, and clamp it to some value proportional to the new limit.

2018-03-21 10:01:40

by Michal Hocko

[permalink] [raw]

Subject: Re: 答复: 答复: [PATC H] mm/memcontrol.c: speed up to force empty a memory cgroup

On Wed 21-03-18 01:35:05, Andrey Ryabinin wrote:
> On 03/21/2018 01:15 AM, David Rientjes wrote:
> > On Wed, 21 Mar 2018, Andrey Ryabinin wrote:
> >
> >>>>> It would probably be best to limit the
> >>>>> nr_pages to the amount that needs to be reclaimed, though, rather than
> >>>>> over reclaiming.
> >>>>
> >>>> How do you achieve that? The charging path is not synchornized with the
> >>>> shrinking one at all.
> >>>>
> >>>
> >>> The point is to get a better guess at how many pages, up to
> >>> SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
> >>>
> >>>>> If you wanted to be invasive, you could change page_counter_limit() to
> >>>>> return the count - limit, fix up the callers that look for -EBUSY, and
> >>>>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
> >>>>
> >>>> I am not sure I understand
> >>>>
> >>>
> >>> Have page_counter_limit() return the number of pages over limit, i.e.
> >>> count - limit, since it compares the two anyway. Fix up existing callers
> >>> and then clamp that value to SWAP_CLUSTER_MAX in
> >>> mem_cgroup_resize_limit(). It's a more accurate guess than either 1 or
> >>> 1024.
> >>>
> >>
> >> JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
> >> See try_to_free_mem_cgroup_pages():
> >> ....
> >> struct scan_control sc = {
> >> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> >>
> >
> > Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?
> >
>
> Absolutely not. I completely on your side here.
> I've tried to fix this recently - http://lkml.kernel.org/r/[email protected]
> I guess that Andrew decided to not take my patch, because Michal wasn't
> happy about it (see mail archives if you want more details).

I was unhappy about the explanation and justification of the patch. It
is still not clear to me why try_to_free_mem_cgroup_pages with a single
target should be slower than multiple calls of this function with
smaller batches when the real reclaim is still SWAP_CLUSTER_MAX batched.

There is also a theoretical risk of over reclaim. Especially with large
targets.

--
Michal Hocko
SUSE Labs

2018-03-23 03:00:16

by Li RongQing

[permalink] [raw]

Subject: 答复: 答复: [PATCH] mm/memcontrol.c: spee d up to force empty a memory cgroup

> -----邮件原件-----
> 发件人: [email protected]
> [mailto:[email protected]] 代表 Li,Rongqing
> 发送时间: 2018年3月19日 18:52
> 收件人: Michal Hocko <[email protected]>
> 抄送: [email protected]; [email protected];
> [email protected]; [email protected]; Andrey Ryabinin
> <[email protected]>
> 主题: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory cgroup
>
>
>
> > -----邮件原件-----
> > 发件人: Michal Hocko [mailto:[email protected]]
> > 发送时间: 2018年3月19日 18:38
> > 收件人: Li,Rongqing <[email protected]>
> > 抄送: [email protected]; [email protected];
> > [email protected]; [email protected]; Andrey Ryabinin
> > <[email protected]>
> > 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory
> > cgroup
> >
> > On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> > >
> > >
> > > > -----邮件原件-----
> > > > 发件人: Michal Hocko [mailto:[email protected]]
> > > > 发送时间: 2018年3月19日 16:54
> > > > 收件人: Li,Rongqing <[email protected]>
> > > > 抄送: [email protected]; [email protected];
> > > > [email protected]; [email protected]; Andrey Ryabinin
> > > > <[email protected]>
> > > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> > memory
> > > > cgroup
> > > >
> > > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > > mem_cgroup_force_empty() tries to free only 32
> > (SWAP_CLUSTER_MAX)
> > > > > pages on each iteration, if a memory cgroup has lots of page
> > > > > cache, it will take many iterations to empty all page cache, so
> > > > > increase the reclaimed number per iteration to speed it up. same
> > > > > as in
> > > > > mem_cgroup_resize_limit()
> > > > >
> > > > > a simple test show:
> > > > >
> > > > > $dd if=aaa of=bbb bs=1k count=3886080
> > > > > $rm -f bbb
> > > > > $time echo
> > 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > > >
> > > > > Before: 0m0.252s ===> after: 0m0.178s
> > > >
> > > > Andrey was proposing something similar [1]. My main objection was
> > > > that his approach might lead to over-reclaim. Your approach is
> > > > more conservative because it just increases the batch size. The
> > > > size is still rather arbitrary. Same as SWAP_CLUSTER_MAX but that
> > > > one is a commonly used unit of reclaim in the MM code.
> > > >
> > > > I would be really curious about more detailed explanation why
> > > > having a larger batch yields to a better performance because we
> > > > are doingg SWAP_CLUSTER_MAX batches at the lower reclaim level
> anyway.
> > > >
> > >
> > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > nr_to_reclaim can reduce times of calling
> > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > >
> > > mem_cgroup_resize_limit
> > > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > > --->SWAP_CLUSTER_MAX),
> > > ---> do_try_to_free_pages
> > > ---> shrink_zones
> > > --->shrink_node
> > > ---> shrink_node_memcg
> > > ---> shrink_list <-------loop will happen in this place
> > [times=1024/32]
> > > ---> shrink_page_list
> >
> > Can you actually measure this to be the culprit. Because we should
> > rethink our call path if it is too complicated/deep to perform well.
> > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
>
> Ok, I will try
>
http://pasted.co/4edbcfff

This is result from ftrace graph, it maybe prove that the deep call path leads to low performance.

And when increase reclaiming page in try_to_free_mem_cgroup_pages, it can reduce calling of shrink_slab, which save times, in my cases, page caches occupy most memory, slab is little, but shrink_slab will be called everytime

Mutex_lock 1 us

try_to_free_mem_cgroup_pages
do_try_to_free_pages ! 185.020 us
shrink_node ! 116.529 us
shrink_node_memcg 39.203
shrink_inactive_list 33.960
shrink_slab 72.955

shrink_node 61.502 us
shrink_node_memcg 3.955
shrink_slab 54.296 us

-RongQing

> -RongQing
> > --
> > Michal Hocko
> > SUSE Labs

2018-03-23 11:20:48

by Michal Hocko

[permalink] [raw]

Subject: Re: 答复: 答复: [PATC H] mm/memcontrol.c: speed up to force empty a memory cgroup

On Fri 23-03-18 02:58:36, Li,Rongqing wrote:
>
>
> > -----邮件原件-----
> > 发件人: [email protected]
> > [mailto:[email protected]] 代表 Li,Rongqing
> > 发送时间: 2018年3月19日 18:52
> > 收件人: Michal Hocko <[email protected]>
> > 抄送: [email protected]; [email protected];
> > [email protected]; [email protected]; Andrey Ryabinin
> > <[email protected]>
> > 主题: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> > memory cgroup
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Michal Hocko [mailto:[email protected]]
> > > 发送时间: 2018年3月19日 18:38
> > > 收件人: Li,Rongqing <[email protected]>
> > > 抄送: [email protected]; [email protected];
> > > [email protected]; [email protected]; Andrey Ryabinin
> > > <[email protected]>
> > > 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> > memory
> > > cgroup
> > >
> > > On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> > > >
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Michal Hocko [mailto:[email protected]]
> > > > > 发送时间: 2018年3月19日 16:54
> > > > > 收件人: Li,Rongqing <[email protected]>
> > > > > 抄送: [email protected]; [email protected];
> > > > > [email protected]; [email protected]; Andrey Ryabinin
> > > > > <[email protected]>
> > > > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> > > memory
> > > > > cgroup
> > > > >
> > > > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > > > mem_cgroup_force_empty() tries to free only 32
> > > (SWAP_CLUSTER_MAX)
> > > > > > pages on each iteration, if a memory cgroup has lots of page
> > > > > > cache, it will take many iterations to empty all page cache, so
> > > > > > increase the reclaimed number per iteration to speed it up. same
> > > > > > as in
> > > > > > mem_cgroup_resize_limit()
> > > > > >
> > > > > > a simple test show:
> > > > > >
> > > > > > $dd if=aaa of=bbb bs=1k count=3886080
> > > > > > $rm -f bbb
> > > > > > $time echo
> > > 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > > > >
> > > > > > Before: 0m0.252s ===> after: 0m0.178s
> > > > >
> > > > > Andrey was proposing something similar [1]. My main objection was
> > > > > that his approach might lead to over-reclaim. Your approach is
> > > > > more conservative because it just increases the batch size. The
> > > > > size is still rather arbitrary. Same as SWAP_CLUSTER_MAX but that
> > > > > one is a commonly used unit of reclaim in the MM code.
> > > > >
> > > > > I would be really curious about more detailed explanation why
> > > > > having a larger batch yields to a better performance because we
> > > > > are doingg SWAP_CLUSTER_MAX batches at the lower reclaim level
> > anyway.
> > > > >
> > > >
> > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > > nr_to_reclaim can reduce times of calling
> > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > >
> > > > mem_cgroup_resize_limit
> > > > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > > > --->SWAP_CLUSTER_MAX),
> > > > ---> do_try_to_free_pages
> > > > ---> shrink_zones
> > > > --->shrink_node
> > > > ---> shrink_node_memcg
> > > > ---> shrink_list <-------loop will happen in this place
> > > [times=1024/32]
> > > > ---> shrink_page_list
> > >
> > > Can you actually measure this to be the culprit. Because we should
> > > rethink our call path if it is too complicated/deep to perform well.
> > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> >
> > Ok, I will try
> >
> http://pasted.co/4edbcfff
>
> This is result from ftrace graph, it maybe prove that the deep call
> path leads to low performance.

Does it? Let's have a look at the condensed output:
6) | try_to_free_mem_cgroup_pages() {
6) | mem_cgroup_select_victim_node() {
6) 0.320 us | mem_cgroup_node_nr_lru_pages();
6) 0.151 us | mem_cgroup_node_nr_lru_pages();
6) 2.190 us | }
6) | do_try_to_free_pages() {
6) | shrink_node() {
6) | shrink_node_memcg() {
6) | shrink_inactive_list() {
6) + 23.131 us | shrink_page_list();
6) + 33.960 us | }
6) + 39.203 us | }
6) | shrink_slab() {
6) + 72.955 us | }
6) ! 116.529 us | }
6) | shrink_node() {
6) 0.050 us | mem_cgroup_iter();
6) 0.035 us | mem_cgroup_low();
6) | shrink_node_memcg() {
6) 3.955 us | }
6) | shrink_slab() {
6) + 54.296 us | }
6) + 61.502 us | }
6) ! 185.020 us | }
6) ! 188.165 us | }

try_to_free_mem_cgroup_pages is the full memcg reclaim path taking
188,165 us. The pure reclaim path is shrink_node and that took 116+61 = 177 us.
So we have 11us spent on the way. Is this really making such a difference?
How does the profile look when we do larger batches?

> And when increase reclaiming page in try_to_free_mem_cgroup_pages, it
> can reduce calling of shrink_slab, which save times, in my cases, page
> caches occupy most memory, slab is little, but shrink_slab will be
> called everytime

OK, that makes more sense! shrink_slab is clearly visible here. It is
more expensive than the page reclaim. This is something to look into.

Thanks!
--
Michal Hocko
SUSE Labs

2018-03-23 12:20:52

by Li RongQing

[permalink] [raw]

Subject: 答复: 答复: 答复: [PATCH] mm/memcontrol .c: speed up to force empty a memory cgroup

> -----邮件原件-----
> 发件人: Michal Hocko [mailto:[email protected]]
> 发送时间: 2018年3月23日 18:09
> 收件人: Li,Rongqing <[email protected]>
> 抄送: [email protected]; [email protected];
> [email protected]; [email protected]; Andrey Ryabinin
> <[email protected]>
> 主题: Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty
> a memory cgroup
>
> On Fri 23-03-18 02:58:36, Li,Rongqing wrote:
> >
> >
> > > -----邮件原件-----
> > > 发件人: [email protected]
> > > [mailto:[email protected]] 代表 Li,Rongqing
> > > 发送时间: 2018年3月19日 18:52
> > > 收件人: Michal Hocko <[email protected]>
> > > 抄送: [email protected]; [email protected];
> > > [email protected]; [email protected]; Andrey Ryabinin
> > > <[email protected]>
> > > 主题: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force
> empty a
> > > memory cgroup
> > >
> > >
> > >
> > > > -----邮件原件-----
> > > > 发件人: Michal Hocko [mailto:[email protected]]
> > > > 发送时间: 2018年3月19日 18:38
> > > > 收件人: Li,Rongqing <[email protected]>
> > > > 抄送: [email protected]; [email protected];
> > > > [email protected]; [email protected]; Andrey Ryabinin
> > > > <[email protected]>
> > > > 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty
> a
> > > memory
> > > > cgroup
> > > >
> > > > On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> > > > >
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Michal Hocko [mailto:[email protected]]
> > > > > > 发送时间: 2018年3月19日 16:54
> > > > > > 收件人: Li,Rongqing <[email protected]>
> > > > > > 抄送: [email protected]; [email protected];
> > > > > > [email protected]; [email protected]; Andrey Ryabinin
> > > > > > <[email protected]>
> > > > > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> > > > memory
> > > > > > cgroup
> > > > > >
> > > > > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > > > > mem_cgroup_force_empty() tries to free only 32
> > > > (SWAP_CLUSTER_MAX)
> > > > > > > pages on each iteration, if a memory cgroup has lots of page
> > > > > > > cache, it will take many iterations to empty all page cache,
> > > > > > > so increase the reclaimed number per iteration to speed it
> > > > > > > up. same as in
> > > > > > > mem_cgroup_resize_limit()
> > > > > > >
> > > > > > > a simple test show:
> > > > > > >
> > > > > > > $dd if=aaa of=bbb bs=1k count=3886080
> > > > > > > $rm -f bbb
> > > > > > > $time echo
> > > > 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > > > > >
> > > > > > > Before: 0m0.252s ===> after: 0m0.178s
> > > > > >
> > > > > > Andrey was proposing something similar [1]. My main objection
> > > > > > was that his approach might lead to over-reclaim. Your
> > > > > > approach is more conservative because it just increases the
> > > > > > batch size. The size is still rather arbitrary. Same as
> > > > > > SWAP_CLUSTER_MAX but that one is a commonly used unit of
> reclaim in the MM code.
> > > > > >
> > > > > > I would be really curious about more detailed explanation why
> > > > > > having a larger batch yields to a better performance because
> > > > > > we are doingg SWAP_CLUSTER_MAX batches at the lower reclaim
> > > > > > level
> > > anyway.
> > > > > >
> > > > >
> > > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the
> > > > > call stack of try_to_free_mem_cgroup_pages is too long, increase
> > > > > the nr_to_reclaim can reduce times of calling
> > > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > > >
> > > > > mem_cgroup_resize_limit
> > > > > --->try_to_free_mem_cgroup_pages: .nr_to_reclaim = max(1024,
> > > > > --->SWAP_CLUSTER_MAX),
> > > > > ---> do_try_to_free_pages
> > > > > ---> shrink_zones
> > > > > --->shrink_node
> > > > > ---> shrink_node_memcg
> > > > > ---> shrink_list <-------loop will happen in this
> place
> > > > [times=1024/32]
> > > > > ---> shrink_page_list
> > > >
> > > > Can you actually measure this to be the culprit. Because we should
> > > > rethink our call path if it is too complicated/deep to perform well.
> > > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> > >
> > > Ok, I will try
> > >
> > http://pasted.co/4edbcfff
> >
> > This is result from ftrace graph, it maybe prove that the deep call
> > path leads to low performance.
>
> Does it? Let's have a look at the condensed output:
> 6) | try_to_free_mem_cgroup_pages() {
> 6) | mem_cgroup_select_victim_node() {
> 6) 0.320 us | mem_cgroup_node_nr_lru_pages();
> 6) 0.151 us | mem_cgroup_node_nr_lru_pages();
> 6) 2.190 us | }
> 6) | do_try_to_free_pages() {
> 6) | shrink_node() {
> 6) | shrink_node_memcg() {
> 6) | shrink_inactive_list() {
> 6) + 23.131 us | shrink_page_list();
> 6) + 33.960 us | }
> 6) + 39.203 us | }
> 6) | shrink_slab() {
> 6) + 72.955 us | }
> 6) ! 116.529 us | }
> 6) | shrink_node() {
> 6) 0.050 us | mem_cgroup_iter();
> 6) 0.035 us | mem_cgroup_low();
> 6) | shrink_node_memcg() {
> 6) 3.955 us | }
> 6) | shrink_slab() {
> 6) + 54.296 us | }
> 6) + 61.502 us | }
> 6) ! 185.020 us | }
> 6) ! 188.165 us | }
>
> try_to_free_mem_cgroup_pages is the full memcg reclaim path taking
> 188,165 us. The pure reclaim path is shrink_node and that took 116+61 =
> 177 us.
> So we have 11us spent on the way. Is this really making such a difference?
> How does the profile look when we do larger batches?
>
> > And when increase reclaiming page in try_to_free_mem_cgroup_pages, it
> > can reduce calling of shrink_slab, which save times, in my cases, page
> > caches occupy most memory, slab is little, but shrink_slab will be
> > called everytime
>
> OK, that makes more sense! shrink_slab is clearly visible here. It is more
> expensive than the page reclaim. This is something to look into.
>

shrink_slab() {
6) 0.175 us | down_read_trylock();
6) | super_cache_count() {
6) 0.642 us | list_lru_count_one();
6) 0.587 us | list_lru_count_one();
6) 3.740 us | }
6) | super_cache_count() {
6) 0.625 us | list_lru_count_one();
6) 0.485 us | list_lru_count_one();
6) 2.145 us | }
6) | super_cache_count() {
6) 0.333 us | list_lru_count_one();
6) 0.334 us | list_lru_count_one();
6) 2.109 us | }
6) | super_cache_count() {
6) 0.062 us | list_lru_count_one();
6) 0.188 us | list_lru_count_one();
6) 1.216 us | }
6) | super_cache_count() {
6) 0.217 us | list_lru_count_one();
6) 0.056 us | list_lru_count_one();
6) 1.282 us | }
6) | super_cache_count() {
6) 0.204 us | list_lru_count_one();
6) 0.205 us | list_lru_count_one();
6) 1.237 us | }
6) | super_cache_count() {
6) 0.596 us | list_lru_count_one();
6) 0.493 us | list_lru_count_one();
6) 2.140 us | }
6) | super_cache_count() {
6) 0.130 us | list_lru_count_one();
6) 0.056 us | list_lru_count_one();
6) 1.260 us | }
6) | super_cache_count() {
6) 0.385 us | list_lru_count_one();
6) 0.054 us | list_lru_count_one();
6) 1.186 us | }
6) | super_cache_count() {
6) 0.304 us | list_lru_count_one();
6) 0.286 us | list_lru_count_one();
6) 1.550 us | }
6) | super_cache_count() {
6) 0.230 us | list_lru_count_one();
6) 0.128 us | list_lru_count_one();
6) 1.408 us | }
6) | super_cache_count() {
6) 0.392 us | list_lru_count_one();
6) 0.132 us | list_lru_count_one();
6) 1.694 us | }
6) | super_cache_count() {
6) 0.257 us | list_lru_count_one();
6) 0.258 us | list_lru_count_one();
6) 1.510 us | }
6) | super_cache_count() {
6) 0.132 us | list_lru_count_one();
6) 0.132 us | list_lru_count_one();
6) 1.361 us | }
6) | super_cache_count() {
6) 0.130 us | list_lru_count_one();
6) 0.130 us | list_lru_count_one();
6) 1.246 us | }
6) | count_shadow_nodes() {
6) 0.203 us | list_lru_count_one();
6) 0.042 us | mem_cgroup_node_nr_lru_pages();
6) 1.131 us | }
6) | super_cache_count() {
6) 0.202 us | list_lru_count_one();
6) 0.056 us | list_lru_count_one();
6) 1.115 us | }
6) | super_cache_count() {
6) 0.055 us | list_lru_count_one();
6) 0.107 us | list_lru_count_one();
6) 0.958 us | }
6) | super_cache_count() {
6) 0.147 us | list_lru_count_one();
6) 0.150 us | list_lru_count_one();
6) 1.474 us | }
6) | super_cache_count() {
6) 0.491 us | list_lru_count_one();
6) 0.485 us | list_lru_count_one();
6) 2.569 us | }
6) | super_cache_count() {
6) 0.605 us | list_lru_count_one();
6) 0.590 us | list_lru_count_one();
6) 2.136 us | }
6) | super_cache_count() {
6) 0.572 us | list_lru_count_one();
6) 0.418 us | list_lru_count_one();
6) 1.914 us | }
6) | super_cache_count() {
6) 0.428 us | list_lru_count_one();
6) 0.358 us | list_lru_count_one();
6) 2.073 us | } /* super_cache_count */
6) | super_cache_count() {
6) 0.422 us | list_lru_count_one();
6) 0.433 us | list_lru_count_one();
6) 1.604 us | }
6) | super_cache_count() {
6) 0.532 us | list_lru_count_one();
6) 0.280 us | list_lru_count_one();
6) 1.523 us | }
6) | super_cache_count() {
6) 0.422 us | list_lru_count_one();
6) 0.574 us | list_lru_count_one();
6) 1.554 us | }
6) | super_cache_count() {
6) 0.565 us | list_lru_count_one();
6) 0.587 us | list_lru_count_one();
6) 1.878 us | }
6) | super_cache_count() {
6) 0.563 us | list_lru_count_one();
6) 0.558 us | list_lru_count_one();
6) 1.949 us | }
6) | super_cache_count() {
6) 0.468 us | list_lru_count_one();
6) 0.476 us | list_lru_count_one();
6) 2.149 us | }
6) | super_cache_count() {
6) 0.443 us | list_lru_count_one();
6) 0.483 us | list_lru_count_one();
6) 2.283 us | }
6) | super_cache_count() {
6) 0.332 us | list_lru_count_one();
6) 0.228 us | list_lru_count_one();
6) 1.307 us | }
6) | super_cache_count() {
6) 0.532 us | list_lru_count_one();
6) 0.367 us | list_lru_count_one();
6) 1.956 us | }
6) 0.036 us | up_read();
6) 0.038 us | _cond_resched();
6) + 72.955 us | }

shrink_slab does not reclaim any memory, but take lots of time to count lru

maybe we can use the returning of shrink_slab to control if next shrink_slab should be called?

Or define a slight list_lru_empty to check if sb->s_dentry_lru is empty before calling list_lru_shrink_count, like below

diff --git a/fs/super.c b/fs/super.c
index 672538ca9831..954c22338833 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -130,8 +130,10 @@ static unsigned long super_cache_count(struct shrinker *shrink,
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb, sc);

- total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
- total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
+ if (!list_lru_empty(sb->s_dentry_lru))
+ total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+ if (!list_lru_empty(sb->s_inode_lru))
+ total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);

total_objects = vfs_pressure_ratio(total_objects);
return total_objects;

-RongQing

> Thanks!
> --
> Michal Hocko
> SUSE Labs

2018-03-23 12:30:08

by Michal Hocko

[permalink] [raw]

Subject: Re: 答复: 答复: 答复: [PATCH ] mm/memcontrol.c: speed up to force empty a memory cgroup

On Fri 23-03-18 12:04:16, Li,Rongqing wrote:
[...]
> shrink_slab does not reclaim any memory, but take lots of time to
> count lru
>
> maybe we can use the returning of shrink_slab to control if next
> shrink_slab should be called?

How? Different memcgs might have different amount of shrinkable memory.

> Or define a slight list_lru_empty to check if sb->s_dentry_lru is
> empty before calling list_lru_shrink_count, like below

Does it really help to improve numbers?

> diff --git a/fs/super.c b/fs/super.c
> index 672538ca9831..954c22338833 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -130,8 +130,10 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> if (sb->s_op && sb->s_op->nr_cached_objects)
> total_objects = sb->s_op->nr_cached_objects(sb, sc);
>
> - total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
> - total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
> + if (!list_lru_empty(sb->s_dentry_lru))
> + total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
> + if (!list_lru_empty(sb->s_inode_lru))
> + total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
>
> total_objects = vfs_pressure_ratio(total_objects);
> return total_objects;

--
Michal Hocko
SUSE Labs