2020-09-08 14:34:39

by Chunxin Zang

[permalink] [raw]
Subject: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

From: Chunxin Zang <[email protected]>

On our server, there are about 10k memcg in one machine. They use memory
very frequently. When I tigger drop caches,the process will infinite loop
in drop_slab_node.
There are two reasons:
1.We have too many memcgs, even though one object freed in one memcg, the
sum of object is bigger than 10.
2.We spend a lot of time in traverse memcg once. So, the memcg who
traversed at the first have been freed many objects. Traverse memcg next
time, the freed count bigger than 10 again.

We can get the following info through 'ps':

root:~# ps -aux | grep drop
root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches

Use bpftrace follow 'freed' value in drop_slab_node:

root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
Attaching 1 probe...
^B^C

@ret:
[64, 128) 1 | |
[128, 256) 28 | |
[256, 512) 107 |@ |
[512, 1K) 298 |@@@ |
[1K, 2K) 613 |@@@@@@@ |
[2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 442 |@@@@@ |
[8K, 16K) 299 |@@@ |
[16K, 32K) 100 |@ |
[32K, 64K) 139 |@ |
[64K, 128K) 56 | |
[128K, 256K) 26 | |
[256K, 512K) 2 | |

In one drop caches action, only traverse memcg once maybe is better.
If user need more memory, they can do drop caches again.

Signed-off-by: Chunxin Zang <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
---
mm/vmscan.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..9d8ee2ae5824 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -699,17 +699,12 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,

void drop_slab_node(int nid)
{
- unsigned long freed;
+ struct mem_cgroup *memcg = NULL;

+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- struct mem_cgroup *memcg = NULL;
-
- freed = 0;
- memcg = mem_cgroup_iter(NULL, NULL, NULL);
- do {
- freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
- } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
- } while (freed > 10);
+ shrink_slab(GFP_KERNEL, nid, memcg, 0);
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
}

void drop_slab(void)
--
2.11.0


2020-09-08 17:57:04

by Chris Down

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

Vlastimil Babka writes:
>On 9/8/20 5:09 PM, Chris Down wrote:
>> drop_caches by its very nature can be extremely performance intensive -- if
>> someone wants to abort after trying too long, they can just send a
>> TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't
>> reliably work when doing that, then _that's_ something to improve, but this
>> looks premature to me until that's demonstrated not to work.
>
>Hm there might be existings scripts (even though I dislike those) running
>drop_caches periodically, and they are currently not set up to be killed, so one
>day it might surprise someone. Dropping should be a one-time event, not a
>continual reclaim.

Sure, but these scripts can send it to their child themselves instead of using
WCE, no? As far as I know, that's the already supported way to abort
drop_caches-induced reclaim.

2020-09-08 19:43:45

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

On 9/8/20 5:09 PM, Chris Down wrote:
> drop_caches by its very nature can be extremely performance intensive -- if
> someone wants to abort after trying too long, they can just send a
> TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't
> reliably work when doing that, then _that's_ something to improve, but this
> looks premature to me until that's demonstrated not to work.

Hm there might be existings scripts (even though I dislike those) running
drop_caches periodically, and they are currently not set up to be killed, so one
day it might surprise someone. Dropping should be a one-time event, not a
continual reclaim.

Maybe we could be a bit smarter and e.g. double the threshold currently
hardcoded as "10" with each iteration?

> [email protected] writes:
>>In one drop caches action, only traverse memcg once maybe is better.
>>If user need more memory, they can do drop caches again.
>
> Can you please provide some measurements of the difference in reclamation in
> practice?
>

2020-09-08 20:15:31

by Chris Down

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

drop_caches by its very nature can be extremely performance intensive -- if
someone wants to abort after trying too long, they can just send a
TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't
reliably work when doing that, then _that's_ something to improve, but this
looks premature to me until that's demonstrated not to work.

[email protected] writes:
>In one drop caches action, only traverse memcg once maybe is better.
>If user need more memory, they can do drop caches again.

Can you please provide some measurements of the difference in reclamation in
practice?

2020-09-09 04:23:40

by Muchun Song

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

Hi Chris,

On Tue, Sep 8, 2020 at 11:09 PM Chris Down <[email protected]> wrote:
>
> drop_caches by its very nature can be extremely performance intensive -- if
> someone wants to abort after trying too long, they can just send a
> TASK_KILLABLE signal, no? If exiting the loop and returning to usermode doesn't
> reliably work when doing that, then _that's_ something to improve, but this
> looks premature to me until that's demonstrated not to work.

Sending a TASK_KILLABLE signal? It didn't work now. Because the the
current task has no chance to handle the signal. So I think we may
need to do any of the following things to avoid this case happening.

1. Double the threshold currently hard coded as "10" with each iteration
suggested by Vlastimil. It is also a good idea.

2. In the while loop, we can check whether the TASK_KILLABLE
signal is set, if so, we should break the loop. like the following code
snippe. Thanks.

@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
do {
struct mem_cgroup *memcg = NULL;

+ if (fatal_signal_pending(current))
+ return;
+
freed = 0;
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {

>
> [email protected] writes:
> >In one drop caches action, only traverse memcg once maybe is better.
> >If user need more memory, they can do drop caches again.
>
> Can you please provide some measurements of the difference in reclamation in
> practice?



--
Yours,
Muchun

2020-09-09 10:01:21

by Chris Down

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

Muchun Song writes:
>1. Double the threshold currently hard coded as "10" with each iteration
> suggested by Vlastimil. It is also a good idea.

I think this sounds reasonable, although I'd like to see what the difference in
reclaim looks like in practice.

>2. In the while loop, we can check whether the TASK_KILLABLE
> signal is set, if so, we should break the loop. like the following code
> snippe. Thanks.
>
>@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> do {
> struct mem_cgroup *memcg = NULL;
>
>+ if (fatal_signal_pending(current))
>+ return;
>+
> freed = 0;
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {

Regardless of anything, I think this is probably a good idea. Could you send it
as a patch? :-)

Thanks,

Chris

2020-09-09 15:08:02

by Muchun Song

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] mm/vmscan: fix infinite loop in drop_slab_node

On Wed, Sep 9, 2020 at 5:59 PM Chris Down <[email protected]> wrote:
>
> Muchun Song writes:
> >1. Double the threshold currently hard coded as "10" with each iteration
> > suggested by Vlastimil. It is also a good idea.
>
> I think this sounds reasonable, although I'd like to see what the difference in
> reclaim looks like in practice.
>
> >2. In the while loop, we can check whether the TASK_KILLABLE
> > signal is set, if so, we should break the loop. like the following code
> > snippe. Thanks.
> >
> >@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> > do {
> > struct mem_cgroup *memcg = NULL;
> >
> >+ if (fatal_signal_pending(current))
> >+ return;
> >+
> > freed = 0;
> > memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > do {
>
> Regardless of anything, I think this is probably a good idea. Could you send it
> as a patch? :-)

OK, Will do that thanks.

>
> Thanks,
>
> Chris



--
Yours,
Muchun