2020-01-08 19:55:03

by Shakeel Butt

[permalink] [raw]
Subject: [bug report] resctrl high memory comsumption

Hi,

Recently we had a bug in the system software writing the same pids to
the tasks file of resctrl group multiple times. The resctrl code
allocates "struct task_move_callback" for each such write and call
task_work_add() for that task to handle it on return to user-space
without checking if such request already exist for that particular
task. The issue arises for long sleeping tasks which has thousands for
such request queued to be handled. On our production, we notice
thousands of tasks having thousands of such requests and taking GiBs
of memory for "struct task_move_callback". I am not very familiar with
the code to judge if task_work_cancel() is the right approach or just
checking closid/rmid before doing task_work_add().

==repro==
# mkdir /sys/fs/resctrl/test
# cat /proc/slabinfo | grep kmalloc-32
kmalloc-32 57219 57288 32 124 1 : tunables 120 60
8 : slabdata 462 462 0
# sleep 600&
[1] 17611
# for i in {1..200000}; do echo 17611 > /sys/fs/resctrl/test/tasks ; done
# cat /proc/slabinfo | grep kmalloc-32
kmalloc-32 257466 257548 32 124 1 : tunables 120 60
8 : slabdata 2077 2077 5
# kill 17611
[1]+ Terminated sleep 600
# cat /proc/slabinfo | grep kmalloc-32
kmalloc-32 57924 60636 32 124 1 : tunables 120 60
8 : slabdata 470 489 385

thanks,
Shakeel


2020-01-08 20:13:26

by Fenghua Yu

[permalink] [raw]
Subject: Re: [bug report] resctrl high memory comsumption

On Wed, Jan 08, 2020 at 09:07:41AM -0800, Shakeel Butt wrote:
> Hi,
>
> Recently we had a bug in the system software writing the same pids to
> the tasks file of resctrl group multiple times. The resctrl code
> allocates "struct task_move_callback" for each such write and call
> task_work_add() for that task to handle it on return to user-space
> without checking if such request already exist for that particular
> task. The issue arises for long sleeping tasks which has thousands for
> such request queued to be handled. On our production, we notice
> thousands of tasks having thousands of such requests and taking GiBs
> of memory for "struct task_move_callback". I am not very familiar with
> the code to judge if task_work_cancel() is the right approach or just
> checking closid/rmid before doing task_work_add().
>

Thank you for reporting the issue, Shakeel!

Could you please check if the following patch fixes the issue?
From 3c23c39b6a44fdfbbbe0083d074dcc114d7d7f1c Mon Sep 17 00:00:00 2001
From: Fenghua Yu <[email protected]>
Date: Wed, 8 Jan 2020 19:53:33 +0000
Subject: [RFC PATCH] x86/resctrl: Fix redundant task movements

Currently a task can be moved to a rdtgroup multiple times.
But, this can cause multiple task works are added, waste memory
and degrade performance.

To fix the issue, only move the task to a rdtgroup when the task
is not in the rdgroup. Don't try to move the task to the rdtgroup
again when the task is already in the rdtgroup.

Reported-by: Shakeel Butt <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2e3b06d6bbc6..75300c4a5969 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -546,6 +546,17 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
struct task_move_callback *callback;
int ret;

+ /* If the task is already in rdtgrp, don't move the task. */
+ if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
+ tsk->rmid == rdtgrp->mon.rmid) ||
+ (rdtgrp->type == RDTMON_GROUP &&
+ rdtgrp->mon.parent->closid == tsk->closid &&
+ tsk->rmid == rdtgrp->mon.rmid)) {
+ rdt_last_cmd_puts("Task is already in the rdgroup\n");
+
+ return -EINVAL;
+ }
+
callback = kzalloc(sizeof(*callback), GFP_KERNEL);
if (!callback)
return -ENOMEM;
--
2.19.1

2020-01-08 20:43:39

by Reinette Chatre

[permalink] [raw]
Subject: Re: [bug report] resctrl high memory comsumption

Hi Fenghua,

On 1/8/2020 12:23 PM, Fenghua Yu wrote:
> On Wed, Jan 08, 2020 at 09:07:41AM -0800, Shakeel Butt wrote:
>> Hi,
>>
>> Recently we had a bug in the system software writing the same pids to
>> the tasks file of resctrl group multiple times. The resctrl code
>> allocates "struct task_move_callback" for each such write and call
>> task_work_add() for that task to handle it on return to user-space
>> without checking if such request already exist for that particular
>> task. The issue arises for long sleeping tasks which has thousands for
>> such request queued to be handled. On our production, we notice
>> thousands of tasks having thousands of such requests and taking GiBs
>> of memory for "struct task_move_callback". I am not very familiar with
>> the code to judge if task_work_cancel() is the right approach or just
>> checking closid/rmid before doing task_work_add().
>>
>
> Thank you for reporting the issue, Shakeel!
>
> Could you please check if the following patch fixes the issue?
> From 3c23c39b6a44fdfbbbe0083d074dcc114d7d7f1c Mon Sep 17 00:00:00 2001
> From: Fenghua Yu <[email protected]>
> Date: Wed, 8 Jan 2020 19:53:33 +0000
> Subject: [RFC PATCH] x86/resctrl: Fix redundant task movements
>
> Currently a task can be moved to a rdtgroup multiple times.
> But, this can cause multiple task works are added, waste memory
> and degrade performance.
>
> To fix the issue, only move the task to a rdtgroup when the task
> is not in the rdgroup. Don't try to move the task to the rdtgroup
> again when the task is already in the rdtgroup.
>
> Reported-by: Shakeel Butt <[email protected]>
> Signed-off-by: Fenghua Yu <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 2e3b06d6bbc6..75300c4a5969 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -546,6 +546,17 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
> struct task_move_callback *callback;
> int ret;
>
> + /* If the task is already in rdtgrp, don't move the task. */
> + if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
> + tsk->rmid == rdtgrp->mon.rmid) ||
> + (rdtgrp->type == RDTMON_GROUP &&
> + rdtgrp->mon.parent->closid == tsk->closid &&
> + tsk->rmid == rdtgrp->mon.rmid)) {
> + rdt_last_cmd_puts("Task is already in the rdgroup\n");
> +
> + return -EINVAL;
> + }
> +
> callback = kzalloc(sizeof(*callback), GFP_KERNEL);
> if (!callback)
> return -ENOMEM;
>

I think your fix would address this specific use case but a slightly
different use case will still encounter the problem of high memory
consumption. If for example, sleeping tasks are moved (many times)
between resource or monitoring groups then their task_works queue would
just keep growing. It seems that a call to task_work_cancel() before
adding a new work item should address all these cases?

Reinette

2020-01-08 21:21:34

by Shakeel Butt

[permalink] [raw]
Subject: Re: [bug report] resctrl high memory comsumption

On Wed, Jan 8, 2020 at 12:12 PM Fenghua Yu <[email protected]> wrote:
>
> On Wed, Jan 08, 2020 at 09:07:41AM -0800, Shakeel Butt wrote:
> > Hi,
> >
> > Recently we had a bug in the system software writing the same pids to
> > the tasks file of resctrl group multiple times. The resctrl code
> > allocates "struct task_move_callback" for each such write and call
> > task_work_add() for that task to handle it on return to user-space
> > without checking if such request already exist for that particular
> > task. The issue arises for long sleeping tasks which has thousands for
> > such request queued to be handled. On our production, we notice
> > thousands of tasks having thousands of such requests and taking GiBs
> > of memory for "struct task_move_callback". I am not very familiar with
> > the code to judge if task_work_cancel() is the right approach or just
> > checking closid/rmid before doing task_work_add().
> >
>
> Thank you for reporting the issue, Shakeel!
>
> Could you please check if the following patch fixes the issue?
> From 3c23c39b6a44fdfbbbe0083d074dcc114d7d7f1c Mon Sep 17 00:00:00 2001
> From: Fenghua Yu <[email protected]>
> Date: Wed, 8 Jan 2020 19:53:33 +0000
> Subject: [RFC PATCH] x86/resctrl: Fix redundant task movements
>
> Currently a task can be moved to a rdtgroup multiple times.
> But, this can cause multiple task works are added, waste memory
> and degrade performance.
>
> To fix the issue, only move the task to a rdtgroup when the task
> is not in the rdgroup. Don't try to move the task to the rdtgroup
> again when the task is already in the rdtgroup.
>
> Reported-by: Shakeel Butt <[email protected]>
> Signed-off-by: Fenghua Yu <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 2e3b06d6bbc6..75300c4a5969 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -546,6 +546,17 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
> struct task_move_callback *callback;
> int ret;
>
> + /* If the task is already in rdtgrp, don't move the task. */
> + if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid &&
> + tsk->rmid == rdtgrp->mon.rmid) ||
> + (rdtgrp->type == RDTMON_GROUP &&
> + rdtgrp->mon.parent->closid == tsk->closid &&
> + tsk->rmid == rdtgrp->mon.rmid)) {
> + rdt_last_cmd_puts("Task is already in the rdgroup\n");
> +
> + return -EINVAL;

Why not just return success if the task is already in that group (i.e.
just follow the cgroup behavior).

> + }
> +
> callback = kzalloc(sizeof(*callback), GFP_KERNEL);
> if (!callback)
> return -ENOMEM;
> --
> 2.19.1
>

2020-01-08 21:34:07

by Fenghua Yu

[permalink] [raw]
Subject: Re: [bug report] resctrl high memory comsumption

On Wed, Jan 08, 2020 at 12:42:17PM -0800, Reinette Chatre wrote:
> Hi Fenghua,
> On 1/8/2020 12:23 PM, Fenghua Yu wrote:
> > On Wed, Jan 08, 2020 at 09:07:41AM -0800, Shakeel Butt wrote:
> >> Recently we had a bug in the system software writing the same pids to
> >> the tasks file of resctrl group multiple times. The resctrl code
> > Subject: [RFC PATCH] x86/resctrl: Fix redundant task movements
> I think your fix would address this specific use case but a slightly
> different use case will still encounter the problem of high memory
> consumption. If for example, sleeping tasks are moved (many times)
> between resource or monitoring groups then their task_works queue would
> just keep growing. It seems that a call to task_work_cancel() before
> adding a new work item should address all these cases?

The checking code in this patch is also helpful to avoid redundant
task move preparation (kzalloc(), task_work_add(), etc) in the same
rdtgroup.

How about adding both the checking code and task_work_cancel()?

Thanks.

-Fenghua

2020-01-08 21:55:58

by Reinette Chatre

[permalink] [raw]
Subject: Re: [bug report] resctrl high memory comsumption

Hi Fenghua,

On 1/8/2020 1:42 PM, Fenghua Yu wrote:
> On Wed, Jan 08, 2020 at 12:42:17PM -0800, Reinette Chatre wrote:
>> Hi Fenghua,
>> On 1/8/2020 12:23 PM, Fenghua Yu wrote:
>>> On Wed, Jan 08, 2020 at 09:07:41AM -0800, Shakeel Butt wrote:
>>>> Recently we had a bug in the system software writing the same pids to
>>>> the tasks file of resctrl group multiple times. The resctrl code
>>> Subject: [RFC PATCH] x86/resctrl: Fix redundant task movements
>> I think your fix would address this specific use case but a slightly
>> different use case will still encounter the problem of high memory
>> consumption. If for example, sleeping tasks are moved (many times)
>> between resource or monitoring groups then their task_works queue would
>> just keep growing. It seems that a call to task_work_cancel() before
>> adding a new work item should address all these cases?
>
> The checking code in this patch is also helpful to avoid redundant
> task move preparation (kzalloc(), task_work_add(), etc) in the same
> rdtgroup.

Indeed.

>
> How about adding both the checking code and task_work_cancel()?

That does sound good to me.

There is something in the current implementation that I would appreciate
your feedback on: Currently the task's closid and rmid are initialized
_after_ the call to task_work_add() succeeds. Should these not be
initialized before the call to task_work_add()?

Thank you

Reinette

2020-01-13 18:41:14

by Shakeel Butt

[permalink] [raw]
Subject: Re: [bug report] resctrl high memory comsumption

On Wed, Jan 8, 2020 at 1:54 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Fenghua,
>
> On 1/8/2020 1:42 PM, Fenghua Yu wrote:
> > On Wed, Jan 08, 2020 at 12:42:17PM -0800, Reinette Chatre wrote:
> >> Hi Fenghua,
> >> On 1/8/2020 12:23 PM, Fenghua Yu wrote:
> >>> On Wed, Jan 08, 2020 at 09:07:41AM -0800, Shakeel Butt wrote:
> >>>> Recently we had a bug in the system software writing the same pids to
> >>>> the tasks file of resctrl group multiple times. The resctrl code
> >>> Subject: [RFC PATCH] x86/resctrl: Fix redundant task movements
> >> I think your fix would address this specific use case but a slightly
> >> different use case will still encounter the problem of high memory
> >> consumption. If for example, sleeping tasks are moved (many times)
> >> between resource or monitoring groups then their task_works queue would
> >> just keep growing. It seems that a call to task_work_cancel() before
> >> adding a new work item should address all these cases?
> >
> > The checking code in this patch is also helpful to avoid redundant
> > task move preparation (kzalloc(), task_work_add(), etc) in the same
> > rdtgroup.
>
> Indeed.
>
> >
> > How about adding both the checking code and task_work_cancel()?
>
> That does sound good to me.
>

Hi Fenghua, any updates here?

> There is something in the current implementation that I would appreciate
> your feedback on: Currently the task's closid and rmid are initialized
> _after_ the call to task_work_add() succeeds. Should these not be
> initialized before the call to task_work_add()?
>

This seems like a potential race.

thanks,
Shakeel