From: Chunxin Zang <[email protected]>
On our server, there are about 10k memcg in one machine. They use memory
very frequently. When I tigger drop caches,the process will infinite loop
in drop_slab_node.
There are two reasons:
1.We have too many memcgs, even though one object freed in one memcg, the
sum of object is bigger than 10.
2.We spend a lot of time in traverse memcg once. So, the memcg who
traversed at the first have been freed many objects. Traverse memcg next
time, the freed count bigger than 10 again.
We can get the following info through 'ps':
root:~# ps -aux | grep drop
root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches
Use bpftrace follow 'freed' value in drop_slab_node:
root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
Attaching 1 probe...
^B^C
@ret:
[64, 128) 1 | |
[128, 256) 28 | |
[256, 512) 107 |@ |
[512, 1K) 298 |@@@ |
[1K, 2K) 613 |@@@@@@@ |
[2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 442 |@@@@@ |
[8K, 16K) 299 |@@@ |
[16K, 32K) 100 |@ |
[32K, 64K) 139 |@ |
[64K, 128K) 56 | |
[128K, 256K) 26 | |
[256K, 512K) 2 | |
In the while loop, we can check whether the TASK_KILLABLE signal is set,
if so, we should break the loop.
Signed-off-by: Chunxin Zang <[email protected]>
Signed-off-by: Muchun Song <[email protected]>
---
changelogs in v2:
1) Via check TASK_KILLABLE signal break loop.
mm/vmscan.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..c3ed8b45d264 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -704,6 +704,9 @@ void drop_slab_node(int nid)
do {
struct mem_cgroup *memcg = NULL;
+ if (fatal_signal_pending(current))
+ return;
+
freed = 0;
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
--
2.11.0
On 9/9/20 5:20 PM, [email protected] wrote:
> From: Chunxin Zang <[email protected]>
>
> On our server, there are about 10k memcg in one machine. They use memory
> very frequently. When I tigger drop caches,the process will infinite loop
> in drop_slab_node.
>
> There are two reasons:
> 1.We have too many memcgs, even though one object freed in one memcg, the
> sum of object is bigger than 10.
>
> 2.We spend a lot of time in traverse memcg once. So, the memcg who
> traversed at the first have been freed many objects. Traverse memcg next
> time, the freed count bigger than 10 again.
>
> We can get the following info through 'ps':
>
> root:~# ps -aux | grep drop
> root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
> root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
> root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
> root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
> root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
> root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
> root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
> root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches
>
> Use bpftrace follow 'freed' value in drop_slab_node:
>
> root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
> Attaching 1 probe...
> ^B^C
>
> @ret:
> [64, 128) 1 | |
> [128, 256) 28 | |
> [256, 512) 107 |@ |
> [512, 1K) 298 |@@@ |
> [1K, 2K) 613 |@@@@@@@ |
> [2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K) 442 |@@@@@ |
> [8K, 16K) 299 |@@@ |
> [16K, 32K) 100 |@ |
> [32K, 64K) 139 |@ |
> [64K, 128K) 56 | |
> [128K, 256K) 26 | |
> [256K, 512K) 2 | |
>
> In the while loop, we can check whether the TASK_KILLABLE signal is set,
> if so, we should break the loop.
That's definitely a good change, thanks. I would just maybe consider:
- Test in the memcg iteration loop? If you have 10k memcgs as you mention, this
can still take long until the test happens?
- Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
and think it's too long, I would prefer to kill it by ctrl-c and not just kill
-9. Dunno if the canonical way of testing for this is if
(signal_pending(current)) or differently.
- IMHO it's still worth to bail out in your scenario even without a signal, e.g.
by the doubling of threshold. But it can be a separate patch.
Thanks!
> Signed-off-by: Chunxin Zang <[email protected]>
> Signed-off-by: Muchun Song <[email protected]>
> ---
> changelogs in v2:
> 1) Via check TASK_KILLABLE signal break loop.
>
> mm/vmscan.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2..c3ed8b45d264 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> do {
> struct mem_cgroup *memcg = NULL;
>
> + if (fatal_signal_pending(current))
> + return;
> +
> freed = 0;
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
>
Vlastimil Babka writes:
>- Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
>and think it's too long, I would prefer to kill it by ctrl-c and not just kill
Oh dear, fatal_signal_pending() doesn't consider cases with no more userspace
instructions due to SIG_DFL on TERM/INT etc, that seems misleading :-( I had
(naively) believed it internally checks the same set as TASK_KILLABLE.
Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
way to how schedule_timeout_killable and friends do it instead, so that other
signals will be caught?
On Wed, Sep 09, 2020 at 07:59:44PM +0200, Vlastimil Babka wrote:
> - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
> and think it's too long, I would prefer to kill it by ctrl-c and not just kill
> -9. Dunno if the canonical way of testing for this is if
> (signal_pending(current)) or differently.
fatal_signal_pending() is the canonical way to do it. If your task has
installed a signal handler for ABRT or TERM, that's its prerogative,
but it's chosen not to get killed, and it's not allowed to see short
reads/writes, so we can't break out early.
On Wed, Sep 09, 2020 at 10:47:24PM +0100, Chris Down wrote:
> Vlastimil Babka writes:
> > - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
> > and think it's too long, I would prefer to kill it by ctrl-c and not just kill
>
> Oh dear, fatal_signal_pending() doesn't consider cases with no more
> userspace instructions due to SIG_DFL on TERM/INT etc, that seems misleading
> :-( I had (naively) believed it internally checks the same set as
> TASK_KILLABLE.
>
> Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
> way to how schedule_timeout_killable and friends do it instead, so that
> other signals will be caught?
You're mistaken.
if (sig_fatal(p, sig) &&
!(signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
...
sigaddset(&t->pending.signal, SIGKILL);
static inline int __fatal_signal_pending(struct task_struct *p)
{
return unlikely(sigismember(&p->pending.signal, SIGKILL));
}
Matthew Wilcox writes:
>On Wed, Sep 09, 2020 at 10:47:24PM +0100, Chris Down wrote:
>> Vlastimil Babka writes:
>> > - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
>> > and think it's too long, I would prefer to kill it by ctrl-c and not just kill
>>
>> Oh dear, fatal_signal_pending() doesn't consider cases with no more
>> userspace instructions due to SIG_DFL on TERM/INT etc, that seems misleading
>> :-( I had (naively) believed it internally checks the same set as
>> TASK_KILLABLE.
>>
>> Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
>> way to how schedule_timeout_killable and friends do it instead, so that
>> other signals will be caught?
>
>You're mistaken.
>
> if (sig_fatal(p, sig) &&
> !(signal->flags & SIGNAL_GROUP_EXIT) &&
> !sigismember(&t->real_blocked, sig) &&
> (sig == SIGKILL || !p->ptrace)) {
>...
> sigaddset(&t->pending.signal, SIGKILL);
>
>static inline int __fatal_signal_pending(struct task_struct *p)
>{
> return unlikely(sigismember(&p->pending.signal, SIGKILL));
>}
Oh great, that makes things way easier. Thanks!
On 9/9/20 11:52 PM, Matthew Wilcox wrote:
> On Wed, Sep 09, 2020 at 10:47:24PM +0100, Chris Down wrote:
>> Vlastimil Babka writes:
>> > - Exit also on other signals such as SIGABRT, SIGTERM? If I write to drop_caches
>> > and think it's too long, I would prefer to kill it by ctrl-c and not just kill
>>
>> Oh dear, fatal_signal_pending() doesn't consider cases with no more
>> userspace instructions due to SIG_DFL on TERM/INT etc, that seems misleading
>> :-( I had (naively) believed it internally checks the same set as
>> TASK_KILLABLE.
>>
>> Chuxin, Muchun, can you please make it work using TASK_KILLABLE in a similar
>> way to how schedule_timeout_killable and friends do it instead, so that
>> other signals will be caught?
>
> You're mistaken.
Ah actually it was me who thought fatal_signal_pending() was only for SIGKILL,
OOM and whatnot. Sorry for the noise.
> if (sig_fatal(p, sig) &&
> !(signal->flags & SIGNAL_GROUP_EXIT) &&
> !sigismember(&t->real_blocked, sig) &&
> (sig == SIGKILL || !p->ptrace)) {
> ...
> sigaddset(&t->pending.signal, SIGKILL);
>
> static inline int __fatal_signal_pending(struct task_struct *p)
> {
> return unlikely(sigismember(&p->pending.signal, SIGKILL));
> }
>
The subject is misleading because this patch doesn't fix an infinite
loop, right? It just allows the userspace to interrupt the operation.
On Wed 09-09-20 23:20:47, [email protected] wrote:
> From: Chunxin Zang <[email protected]>
>
> On our server, there are about 10k memcg in one machine. They use memory
> very frequently. When I tigger drop caches,the process will infinite loop
> in drop_slab_node.
Is this really an infinite loop, or it just takes a lot of time to
process all the metadata in that setup? If this is really an infinite
loop then we should look at it. My current understanding is that the
operation would finish at some time it just takes painfully long to get
there.
> There are two reasons:
> 1.We have too many memcgs, even though one object freed in one memcg, the
> sum of object is bigger than 10.
>
> 2.We spend a lot of time in traverse memcg once. So, the memcg who
> traversed at the first have been freed many objects. Traverse memcg next
> time, the freed count bigger than 10 again.
>
> We can get the following info through 'ps':
>
> root:~# ps -aux | grep drop
> root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
> root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
> root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
> root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
> root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
> root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
> root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
> root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches
>
> Use bpftrace follow 'freed' value in drop_slab_node:
>
> root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
> Attaching 1 probe...
> ^B^C
>
> @ret:
> [64, 128) 1 | |
> [128, 256) 28 | |
> [256, 512) 107 |@ |
> [512, 1K) 298 |@@@ |
> [1K, 2K) 613 |@@@@@@@ |
> [2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K) 442 |@@@@@ |
> [8K, 16K) 299 |@@@ |
> [16K, 32K) 100 |@ |
> [32K, 64K) 139 |@ |
> [64K, 128K) 56 | |
> [128K, 256K) 26 | |
> [256K, 512K) 2 | |
>
> In the while loop, we can check whether the TASK_KILLABLE signal is set,
> if so, we should break the loop.
I would make it explicit that this is not fixing the above scenario. It
just helps to cancel to operation which is a good thing in general.
> Signed-off-by: Chunxin Zang <[email protected]>
> Signed-off-by: Muchun Song <[email protected]>
With updated changelog
Acked-by: Michal Hocko <[email protected]>
> ---
> changelogs in v2:
> 1) Via check TASK_KILLABLE signal break loop.
>
> mm/vmscan.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2..c3ed8b45d264 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -704,6 +704,9 @@ void drop_slab_node(int nid)
> do {
> struct mem_cgroup *memcg = NULL;
>
> + if (fatal_signal_pending(current))
> + return;
> +
> freed = 0;
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> --
> 2.11.0
>
--
Michal Hocko
SUSE Labs