tg_rt_schedulable() iterates over all child task groups,
while tg_has_rt_tasks() iterates over all linked tasks.
In case of systems with big number of tasks, this may
take a lot of time.
I observed hard LOCKUP on machine with 20000+ processes
after write to "cpu.rt_period_us" of cpu cgroup with
39 children. The problem occurred because of tasklist_lock
is held for a long time and other processes can't do fork().
PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
#0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
#1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
#2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
#3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
[exception RIP: tg_rt_schedulable+463]
RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
#5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
#6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
#7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
#8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
#9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
#10 [ffff886537ad7f08] sys_write at ffffffff8121faff
#11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
The patch reworks tg_has_rt_tasks() and makes it to check
for rt_rq::rt_nr_running instead of iteration over task list.
This makes the function to scale well, and its execution time
does not depend on number of processes in the system.
Note, that since tasklist_lock doesn't protect a task against
sched_class changing, we don't introduce new races in comparison
to that we had before. Also, rt_rq::rt_nr_running contains queued
child cfs_rq in additional to queued task. Since tg_has_rt_tasks()
is used in case of !runtime case:
if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
return -EBUSY;
the behaviour won't change. The only change is that walk_tg_tree()
calling tg_rt_schedulable() will break its iteration on parent cfs_rq,
i.e. earlier.
Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/sched/rt.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7aef6b4e885a..601151bb9322 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2395,10 +2395,10 @@ const struct sched_class rt_sched_class = {
*/
static DEFINE_MUTEX(rt_constraints_mutex);
-/* Must be called with tasklist_lock held */
static inline int tg_has_rt_tasks(struct task_group *tg)
{
- struct task_struct *g, *p;
+ struct rt_rq *rt_rq;
+ int cpu, ret = 0;
/*
* Autogroups do not have RT tasks; see autogroup_create().
@@ -2406,12 +2406,18 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
if (task_group_is_autogroup(tg))
return 0;
- for_each_process_thread(g, p) {
- if (rt_task(p) && task_group(p) == tg)
- return 1;
+ preempt_disable();
+
+ for_each_online_cpu(cpu) {
+ rt_rq = tg->rt_rq[cpu];
+ if (READ_ONCE(rt_rq->rt_nr_running)) {
+ ret = 1;
+ break;
+ }
}
- return 0;
+ preempt_enable();
+ return ret;
}
struct rt_schedulable_data {
@@ -2510,7 +2516,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
return -EINVAL;
mutex_lock(&rt_constraints_mutex);
- read_lock(&tasklist_lock);
err = __rt_schedulable(tg, rt_period, rt_runtime);
if (err)
goto unlock;
@@ -2528,7 +2533,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
}
raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
unlock:
- read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
return err;
@@ -2582,9 +2586,7 @@ static int sched_rt_global_constraints(void)
int ret = 0;
mutex_lock(&rt_constraints_mutex);
- read_lock(&tasklist_lock);
ret = __rt_schedulable(NULL, 0, 0);
- read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
return ret;
Hi Kirill,
On 19/04/18 20:29, Kirill Tkhai wrote:
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
>
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
>
> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> [exception RIP: tg_rt_schedulable+463]
> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>
> The patch reworks tg_has_rt_tasks() and makes it to check
> for rt_rq::rt_nr_running instead of iteration over task list.
> This makes the function to scale well, and its execution time
> does not depend on number of processes in the system.
>
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before. Also, rt_rq::rt_nr_running contains queued
> child cfs_rq in additional to queued task. Since tg_has_rt_tasks()
s/cfs_/rt_/ , right?
> is used in case of !runtime case:
>
> if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
> return -EBUSY;
>
> the behaviour won't change. The only change is that walk_tg_tree()
> calling tg_rt_schedulable() will break its iteration on parent cfs_rq,
Ditto.
> i.e. earlier.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> kernel/sched/rt.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7aef6b4e885a..601151bb9322 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2395,10 +2395,10 @@ const struct sched_class rt_sched_class = {
> */
> static DEFINE_MUTEX(rt_constraints_mutex);
>
> -/* Must be called with tasklist_lock held */
> static inline int tg_has_rt_tasks(struct task_group *tg)
> {
> - struct task_struct *g, *p;
> + struct rt_rq *rt_rq;
> + int cpu, ret = 0;
>
> /*
> * Autogroups do not have RT tasks; see autogroup_create().
> @@ -2406,12 +2406,18 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
> if (task_group_is_autogroup(tg))
> return 0;
>
> - for_each_process_thread(g, p) {
> - if (rt_task(p) && task_group(p) == tg)
> - return 1;
> + preempt_disable();
> +
> + for_each_online_cpu(cpu) {
> + rt_rq = tg->rt_rq[cpu];
> + if (READ_ONCE(rt_rq->rt_nr_running)) {
Isn't this however checking against the current (dynamic) number of
runnable tasks/groups instead of the "static" group membership (which
shouldn't be affected by a task running state)?
Best,
- Juri
Hi, Juri,
On 20.04.2018 12:25, Juri Lelli wrote:
> Hi Kirill,
>
> On 19/04/18 20:29, Kirill Tkhai wrote:
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
>> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>> [exception RIP: tg_rt_schedulable+463]
>> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
>> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
>> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
>> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
>> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
>> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> --- <NMI exception stack> ---
>> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to check
>> for rt_rq::rt_nr_running instead of iteration over task list.
>> This makes the function to scale well, and its execution time
>> does not depend on number of processes in the system.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before. Also, rt_rq::rt_nr_running contains queued
>> child cfs_rq in additional to queued task. Since tg_has_rt_tasks()
>
> s/cfs_/rt_/ , right?
>
>> is used in case of !runtime case:
>>
>> if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
>> return -EBUSY;
>>
>> the behaviour won't change. The only change is that walk_tg_tree()
>> calling tg_rt_schedulable() will break its iteration on parent cfs_rq,
>
> Ditto.
>
>> i.e. earlier.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> kernel/sched/rt.c | 22 ++++++++++++----------
>> 1 file changed, 12 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 7aef6b4e885a..601151bb9322 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2395,10 +2395,10 @@ const struct sched_class rt_sched_class = {
>> */
>> static DEFINE_MUTEX(rt_constraints_mutex);
>>
>> -/* Must be called with tasklist_lock held */
>> static inline int tg_has_rt_tasks(struct task_group *tg)
>> {
>> - struct task_struct *g, *p;
>> + struct rt_rq *rt_rq;
>> + int cpu, ret = 0;
>>
>> /*
>> * Autogroups do not have RT tasks; see autogroup_create().
>> @@ -2406,12 +2406,18 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>> if (task_group_is_autogroup(tg))
>> return 0;
>>
>> - for_each_process_thread(g, p) {
>> - if (rt_task(p) && task_group(p) == tg)
>> - return 1;
>> + preempt_disable();
>> +
>> + for_each_online_cpu(cpu) {
>> + rt_rq = tg->rt_rq[cpu];
>> + if (READ_ONCE(rt_rq->rt_nr_running)) {
>
> Isn't this however checking against the current (dynamic) number of
> runnable tasks/groups instead of the "static" group membership (which
> shouldn't be affected by a task running state)?
Ah, you are sure. I forgot that rt_nr_running does not contain sleeping tasks.
We need to check something else here. I'll try to find another way.
Thanks,
Kirill
From: Kirill Tkhai <[email protected]>
tg_rt_schedulable() iterates over all child task groups,
while tg_has_rt_tasks() iterates over all linked tasks.
In case of systems with big number of tasks, this may
take a lot of time.
I observed hard LOCKUP on machine with 20000+ processes
after write to "cpu.rt_period_us" of cpu cgroup with
39 children. The problem occurred because of tasklist_lock
is held for a long time and other processes can't do fork().
PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
#0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
#1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
#2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
#3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
[exception RIP: tg_rt_schedulable+463]
RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
#5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
#6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
#7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
#8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
#9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
#10 [ffff886537ad7f08] sys_write at ffffffff8121faff
#11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
The patch reworks tg_has_rt_tasks() and makes it to iterate over
task group process list instead of iteration over all tasks list.
This makes the function to scale well, and reduces its execution
time.
Note, that since tasklist_lock doesn't protect a task against
sched_class changing, we don't introduce new races in comparison
to that we had before.
Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/sched/rt.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7aef6b4e885a..20be796d4e63 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
*/
static DEFINE_MUTEX(rt_constraints_mutex);
-/* Must be called with tasklist_lock held */
static inline int tg_has_rt_tasks(struct task_group *tg)
{
- struct task_struct *g, *p;
+ struct css_task_iter it;
+ struct task_struct *task;
+ int ret = 0;
/*
* Autogroups do not have RT tasks; see autogroup_create().
@@ -2406,12 +2407,15 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
if (task_group_is_autogroup(tg))
return 0;
- for_each_process_thread(g, p) {
- if (rt_task(p) && task_group(p) == tg)
- return 1;
- }
+ css_task_iter_start(&tg->css, 0, &it);
+ while ((task = css_task_iter_next(&it)))
+ if (rt_task(task)) {
+ ret = 1;
+ break;
+ }
+ css_task_iter_end(&it);
- return 0;
+ return ret;
}
struct rt_schedulable_data {
@@ -2510,7 +2514,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
return -EINVAL;
mutex_lock(&rt_constraints_mutex);
- read_lock(&tasklist_lock);
err = __rt_schedulable(tg, rt_period, rt_runtime);
if (err)
goto unlock;
@@ -2528,7 +2531,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
}
raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
unlock:
- read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
return err;
@@ -2582,9 +2584,7 @@ static int sched_rt_global_constraints(void)
int ret = 0;
mutex_lock(&rt_constraints_mutex);
- read_lock(&tasklist_lock);
ret = __rt_schedulable(NULL, 0, 0);
- read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
return ret;
On 20/04/18 12:43, Kirill Tkhai wrote:
> On 20.04.2018 12:25, Juri Lelli wrote:
[...]
> > Isn't this however checking against the current (dynamic) number of
> > runnable tasks/groups instead of the "static" group membership (which
> > shouldn't be affected by a task running state)?
>
> Ah, you are sure. I forgot that rt_nr_running does not contain sleeping tasks.
> We need to check something else here. I'll try to find another way.
n/p. Maybe a per rt_rq flag linked to "static" membership (I didn't
really thought this through though :).
BTW, since you faced this problem, I guess this is on RT_GROUP_SCHED
enabled boxes, so I'd have a couple of questions (not strictly related
to the problem at hand):
- do such boxes rely on RT throttling being performed at root level?
- is RT_RUNTIME_SHARE enabled?
Thanks,
- Juri
On 20.04.2018 13:58, Juri Lelli wrote:
> On 20/04/18 12:43, Kirill Tkhai wrote:
>> On 20.04.2018 12:25, Juri Lelli wrote:
>
> [...]
>
>>> Isn't this however checking against the current (dynamic) number of
>>> runnable tasks/groups instead of the "static" group membership (which
>>> shouldn't be affected by a task running state)?
>>
>> Ah, you are sure. I forgot that rt_nr_running does not contain sleeping tasks.
>> We need to check something else here. I'll try to find another way.
>
> n/p. Maybe a per rt_rq flag linked to "static" membership (I didn't
> really thought this through though :).
sched_move_task() does not change any rt_rq's fields on moving a dequeued
task, so we definitely can't use rt_rq to detect all the tasks.
> BTW, since you faced this problem, I guess this is on RT_GROUP_SCHED
> enabled boxes, so I'd have a couple of questions (not strictly related
> to the problem at hand):
>
> - do such boxes rely on RT throttling being performed at root level?
> - is RT_RUNTIME_SHARE enabled?
This is a machine with many fair_sched_class tasks, while there are no
(almost) RT tasks there, and there is no "real" real-time with throttling.
They are not interesting from this point of view. Very small number
of task group may have rt_runtime != 0 there, and where so, rt_runtime value
is small. The problem I'm solving happened, when rt_period of one of such groups
were restored, and it hanged for ages, because of enormous number of child
cgroups and tasks (of fair_sched_class) in system.
RT_RUNTIME_SHARE is not enabled, as it's RH7-based 3.10 kernel.
Really, it will be very difficult to provide real-time on machines with
such the big number of tasks, in case of tasks allocates some resources
(not all the memory are pinned from going to slab, there is some IO, etc),
since there are some tasks-to-solve even in !rt case.
Kirill
On 20/04/18 13:06, Kirill Tkhai wrote:
> From: Kirill Tkhai <[email protected]>
>
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
>
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
>
> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> [exception RIP: tg_rt_schedulable+463]
> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>
> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> task group process list instead of iteration over all tasks list.
> This makes the function to scale well, and reduces its execution
> time.
>
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before.
This seems to be true. However, I wonder why we are OK with current racy
code (against tasks moving between groups). :/
Can't a task join the group while we are iterating and we miss that?
On 20.04.2018 17:11, Juri Lelli wrote:
> On 20/04/18 13:06, Kirill Tkhai wrote:
>> From: Kirill Tkhai <[email protected]>
>>
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
>> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>> [exception RIP: tg_rt_schedulable+463]
>> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
>> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
>> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
>> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
>> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
>> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> --- <NMI exception stack> ---
>> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to iterate over
>> task group process list instead of iteration over all tasks list.
>> This makes the function to scale well, and reduces its execution
>> time.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before.
>
> This seems to be true. However, I wonder why we are OK with current racy
> code (against tasks moving between groups). :/
>
> Can't a task join the group while we are iterating and we miss that?
Yes, it can, but I'm not sure either this should be considered as problem,
seeing the race design we already have. It's not a real protection, this
place is to warn a person, he does something wrong. We check for zero written
there, but really written "1" will invent the same problems.
There are cgroup_threadgroup_change_begin(task)/cgroup_threadgroup_change_end(task)
to protect from cgroup change, but it seems we can't use it as they have task argument
and aimed to protect single task thread group in the future (despite they take global
percpu_rwsem).
Kirill
On 20/04/18 17:30, Kirill Tkhai wrote:
> On 20.04.2018 17:11, Juri Lelli wrote:
> > On 20/04/18 13:06, Kirill Tkhai wrote:
> >> From: Kirill Tkhai <[email protected]>
> >>
> >> tg_rt_schedulable() iterates over all child task groups,
> >> while tg_has_rt_tasks() iterates over all linked tasks.
> >> In case of systems with big number of tasks, this may
> >> take a lot of time.
> >>
> >> I observed hard LOCKUP on machine with 20000+ processes
> >> after write to "cpu.rt_period_us" of cpu cgroup with
> >> 39 children. The problem occurred because of tasklist_lock
> >> is held for a long time and other processes can't do fork().
> >>
> >> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
> >> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> >> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> >> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> >> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> >> [exception RIP: tg_rt_schedulable+463]
> >> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
> >> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
> >> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
> >> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
> >> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
> >> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
> >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> >> --- <NMI exception stack> ---
> >> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> >> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> >> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> >> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> >> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> >> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> >> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> >> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> >>
> >> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> >> task group process list instead of iteration over all tasks list.
> >> This makes the function to scale well, and reduces its execution
> >> time.
> >>
> >> Note, that since tasklist_lock doesn't protect a task against
> >> sched_class changing, we don't introduce new races in comparison
> >> to that we had before.
> >
> > This seems to be true. However, I wonder why we are OK with current racy
> > code (against tasks moving between groups). :/
> >
> > Can't a task join the group while we are iterating and we miss that?
>
> Yes, it can, but I'm not sure either this should be considered as problem,
> seeing the race design we already have. It's not a real protection, this
> place is to warn a person, he does something wrong. We check for zero written
> there, but really written "1" will invent the same problems.
Mmm, right. :/
Peter,
could you please to give some comments on this?
Kirill
On 20.04.2018 13:06, Kirill Tkhai wrote:
> From: Kirill Tkhai <[email protected]>
>
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
>
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
>
> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> [exception RIP: tg_rt_schedulable+463]
> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>
> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> task group process list instead of iteration over all tasks list.
> This makes the function to scale well, and reduces its execution
> time.
>
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> kernel/sched/rt.c | 22 +++++++++++-----------
> 1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7aef6b4e885a..20be796d4e63 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
> */
> static DEFINE_MUTEX(rt_constraints_mutex);
>
> -/* Must be called with tasklist_lock held */
> static inline int tg_has_rt_tasks(struct task_group *tg)
> {
> - struct task_struct *g, *p;
> + struct css_task_iter it;
> + struct task_struct *task;
> + int ret = 0;
>
> /*
> * Autogroups do not have RT tasks; see autogroup_create().
> @@ -2406,12 +2407,15 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
> if (task_group_is_autogroup(tg))
> return 0;
>
> - for_each_process_thread(g, p) {
> - if (rt_task(p) && task_group(p) == tg)
> - return 1;
> - }
> + css_task_iter_start(&tg->css, 0, &it);
> + while ((task = css_task_iter_next(&it)))
> + if (rt_task(task)) {
> + ret = 1;
> + break;
> + }
> + css_task_iter_end(&it);
>
> - return 0;
> + return ret;
> }
>
> struct rt_schedulable_data {
> @@ -2510,7 +2514,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> return -EINVAL;
>
> mutex_lock(&rt_constraints_mutex);
> - read_lock(&tasklist_lock);
> err = __rt_schedulable(tg, rt_period, rt_runtime);
> if (err)
> goto unlock;
> @@ -2528,7 +2531,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> }
> raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
> unlock:
> - read_unlock(&tasklist_lock);
> mutex_unlock(&rt_constraints_mutex);
>
> return err;
> @@ -2582,9 +2584,7 @@ static int sched_rt_global_constraints(void)
> int ret = 0;
>
> mutex_lock(&rt_constraints_mutex);
> - read_lock(&tasklist_lock);
> ret = __rt_schedulable(NULL, 0, 0);
> - read_unlock(&tasklist_lock);
> mutex_unlock(&rt_constraints_mutex);
>
> return ret;
>
On Thu, Apr 19, 2018 at 08:29:01PM +0300, Kirill Tkhai wrote:
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
So you're actually using RT cgroups?
Some of us recently considered removing that thing entirely because most
distro's don't actually use it at all -- and it's broken from a RT POV.
That would then clear the slate to try and implement something new.
But if you're actually using this stuff, that would complicate matters.
On Fri, Apr 20, 2018 at 01:06:31PM +0300, Kirill Tkhai wrote:
> @@ -2406,12 +2407,15 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
> if (task_group_is_autogroup(tg))
> return 0;
>
> - for_each_process_thread(g, p) {
> - if (rt_task(p) && task_group(p) == tg)
> - return 1;
> - }
> + css_task_iter_start(&tg->css, 0, &it);
> + while ((task = css_task_iter_next(&it)))
> + if (rt_task(task)) {
> + ret = 1;
> + break;
> + }
Aside from the missing {} there, the patch looks OK I suppose.
The races found seems somewhat dodgy, but like you argue, they're not in
fact new :/
> + css_task_iter_end(&it);
>
> - return 0;
> + return ret;
> }
>
On 25.04.2018 20:55, Peter Zijlstra wrote:
> On Thu, Apr 19, 2018 at 08:29:01PM +0300, Kirill Tkhai wrote:
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>
> So you're actually using RT cgroups?
I myself don't use them, but this goes from CRIU project. We try to dump and to restore
every configuration that people may use. Sometimes they configure RT parameters of cpu
cgroup, and criu has to restore them as they was at the moment of dump. So, this is the place,
where we met the hard lockup. I can't say, what forces people to use nested cpu cgroup
for RT...
> Some of us recently considered removing that thing entirely because most
> distro's don't actually use it at all -- and it's broken from a RT POV.
>
> That would then clear the slate to try and implement something new.
>
> But if you're actually using this stuff, that would complicate matters.
From the point of criu project, if you delete them, this will decrease
the number of essence we need to support, so I'm not against that :D
Seriously speaking, it's a surprise for me, that they are broken. It seemed,
everything is nice from RT POV: throttling, boosting, PI...
Kirill
From: Kirill Tkhai <[email protected]>
tg_rt_schedulable() iterates over all child task groups,
while tg_has_rt_tasks() iterates over all linked tasks.
In case of systems with big number of tasks, this may
take a lot of time.
I observed hard LOCKUP on machine with 20000+ processes
after write to "cpu.rt_period_us" of cpu cgroup with
39 children. The problem occurred because of tasklist_lock
is held for a long time and other processes can't do fork().
PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
#0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
#1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
#2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
#3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
[exception RIP: tg_rt_schedulable+463]
RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
#5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
#6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
#7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
#8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
#9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
#10 [ffff886537ad7f08] sys_write at ffffffff8121faff
#11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
The patch reworks tg_has_rt_tasks() and makes it to iterate over
task group process list instead of iteration over all tasks list.
This makes the function to scale well, and reduces its execution
time.
Note, that since tasklist_lock doesn't protect a task against
sched_class changing, we don't introduce new races in comparison
to that we had before.
Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/sched/rt.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7aef6b4e885a..a40535c604b8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
*/
static DEFINE_MUTEX(rt_constraints_mutex);
-/* Must be called with tasklist_lock held */
static inline int tg_has_rt_tasks(struct task_group *tg)
{
- struct task_struct *g, *p;
+ struct task_struct *task;
+ struct css_task_iter it;
+ int ret = 0;
/*
* Autogroups do not have RT tasks; see autogroup_create().
@@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
if (task_group_is_autogroup(tg))
return 0;
- for_each_process_thread(g, p) {
- if (rt_task(p) && task_group(p) == tg)
- return 1;
+ css_task_iter_start(&tg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (rt_task(task)) {
+ ret = 1;
+ break;
+ }
}
+ css_task_iter_end(&it);
- return 0;
+ return ret;
}
struct rt_schedulable_data {
@@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
return -EINVAL;
mutex_lock(&rt_constraints_mutex);
- read_lock(&tasklist_lock);
err = __rt_schedulable(tg, rt_period, rt_runtime);
if (err)
goto unlock;
@@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
}
raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
unlock:
- read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
return err;
@@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
int ret = 0;
mutex_lock(&rt_constraints_mutex);
- read_lock(&tasklist_lock);
ret = __rt_schedulable(NULL, 0, 0);
- read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
return ret;
Hi,
On Thu, Apr 26, 2018 at 12:54:41PM +0300 Kirill Tkhai wrote:
> From: Kirill Tkhai <[email protected]>
>
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
>
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
>
> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> [exception RIP: tg_rt_schedulable+463]
> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>
> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> task group process list instead of iteration over all tasks list.
> This makes the function to scale well, and reduces its execution
> time.
>
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> kernel/sched/rt.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7aef6b4e885a..a40535c604b8 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
> */
> static DEFINE_MUTEX(rt_constraints_mutex);
>
> -/* Must be called with tasklist_lock held */
> static inline int tg_has_rt_tasks(struct task_group *tg)
> {
> - struct task_struct *g, *p;
> + struct task_struct *task;
> + struct css_task_iter it;
> + int ret = 0;
>
> /*
> * Autogroups do not have RT tasks; see autogroup_create().
> @@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
> if (task_group_is_autogroup(tg))
> return 0;
>
> - for_each_process_thread(g, p) {
> - if (rt_task(p) && task_group(p) == tg)
> - return 1;
> + css_task_iter_start(&tg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (rt_task(task)) {
> + ret = 1;
> + break;
> + }
> }
> + css_task_iter_end(&it);
>
> - return 0;
> + return ret;
> }
>
> struct rt_schedulable_data {
> @@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> return -EINVAL;
>
> mutex_lock(&rt_constraints_mutex);
> - read_lock(&tasklist_lock);
> err = __rt_schedulable(tg, rt_period, rt_runtime);
> if (err)
> goto unlock;
> @@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> }
> raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
> unlock:
> - read_unlock(&tasklist_lock);
> mutex_unlock(&rt_constraints_mutex);
>
> return err;
> @@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
> int ret = 0;
>
> mutex_lock(&rt_constraints_mutex);
> - read_lock(&tasklist_lock);
> ret = __rt_schedulable(NULL, 0, 0);
> - read_unlock(&tasklist_lock);
> mutex_unlock(&rt_constraints_mutex);
>
> return ret;
Sorry to necro this...
I have a customer case that has hit the issue here. It looks like this
fix didn't end up going in.
The only way I could reproduce it myself was to add a udelay(2) to
the tg_has_rt_tasks loop. With this patch applied, even with the
udelay nothing bad happens.
Kirill, did you have a good way to reproduce it without adding
delays?
Peter, is there any chance of taking something like this?
Is there a better fix?
Thanks,
Phil
--
On 24.01.2020 00:56, Phil Auld wrote:
> Hi,
>
> On Thu, Apr 26, 2018 at 12:54:41PM +0300 Kirill Tkhai wrote:
>> From: Kirill Tkhai <[email protected]>
>>
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
>> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>> [exception RIP: tg_rt_schedulable+463]
>> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
>> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
>> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
>> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
>> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
>> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> --- <NMI exception stack> ---
>> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to iterate over
>> task group process list instead of iteration over all tasks list.
>> This makes the function to scale well, and reduces its execution
>> time.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> kernel/sched/rt.c | 21 +++++++++++----------
>> 1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 7aef6b4e885a..a40535c604b8 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
>> */
>> static DEFINE_MUTEX(rt_constraints_mutex);
>>
>> -/* Must be called with tasklist_lock held */
>> static inline int tg_has_rt_tasks(struct task_group *tg)
>> {
>> - struct task_struct *g, *p;
>> + struct task_struct *task;
>> + struct css_task_iter it;
>> + int ret = 0;
>>
>> /*
>> * Autogroups do not have RT tasks; see autogroup_create().
>> @@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>> if (task_group_is_autogroup(tg))
>> return 0;
>>
>> - for_each_process_thread(g, p) {
>> - if (rt_task(p) && task_group(p) == tg)
>> - return 1;
>> + css_task_iter_start(&tg->css, 0, &it);
>> + while ((task = css_task_iter_next(&it))) {
>> + if (rt_task(task)) {
>> + ret = 1;
>> + break;
>> + }
>> }
>> + css_task_iter_end(&it);
>>
>> - return 0;
>> + return ret;
>> }
>>
>> struct rt_schedulable_data {
>> @@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>> return -EINVAL;
>>
>> mutex_lock(&rt_constraints_mutex);
>> - read_lock(&tasklist_lock);
>> err = __rt_schedulable(tg, rt_period, rt_runtime);
>> if (err)
>> goto unlock;
>> @@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>> }
>> raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
>> unlock:
>> - read_unlock(&tasklist_lock);
>> mutex_unlock(&rt_constraints_mutex);
>>
>> return err;
>> @@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
>> int ret = 0;
>>
>> mutex_lock(&rt_constraints_mutex);
>> - read_lock(&tasklist_lock);
>> ret = __rt_schedulable(NULL, 0, 0);
>> - read_unlock(&tasklist_lock);
>> mutex_unlock(&rt_constraints_mutex);
>>
>> return ret;
>
> Sorry to necro this...
>
> I have a customer case that has hit the issue here. It looks like this
> fix didn't end up going in.
>
> The only way I could reproduce it myself was to add a udelay(2) to
> the tg_has_rt_tasks loop. With this patch applied, even with the
> udelay nothing bad happens.
>
> Kirill, did you have a good way to reproduce it without adding
> delays?
I have no a reproducer, it just used to fire on some of our customer nodes.
It depends on many parameters, and main is workload. Also, I think, interrupts
affinity may be involved (whether they bound to small subset of cpus, or they
distributed over all cores). If this is possible theoretically, it occurs
practically with some probability.
We fixed this with out-of-tree patch in our vzkernel tree, so we haven't seen
a reproduction anymore.
> Peter, is there any chance of taking something like this?
>
> Is there a better fix?
On Fri, Jan 24, 2020 at 12:09:20PM +0300 Kirill Tkhai wrote:
> On 24.01.2020 00:56, Phil Auld wrote:
> > Hi,
> >
> > On Thu, Apr 26, 2018 at 12:54:41PM +0300 Kirill Tkhai wrote:
> >> From: Kirill Tkhai <[email protected]>
> >>
> >> tg_rt_schedulable() iterates over all child task groups,
> >> while tg_has_rt_tasks() iterates over all linked tasks.
> >> In case of systems with big number of tasks, this may
> >> take a lot of time.
> >>
> >> I observed hard LOCKUP on machine with 20000+ processes
> >> after write to "cpu.rt_period_us" of cpu cgroup with
> >> 39 children. The problem occurred because of tasklist_lock
> >> is held for a long time and other processes can't do fork().
> >>
> >> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
> >> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> >> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> >> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> >> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> >> [exception RIP: tg_rt_schedulable+463]
> >> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
> >> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
> >> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
> >> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
> >> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
> >> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
> >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> >> --- <NMI exception stack> ---
> >> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> >> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> >> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> >> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> >> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> >> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> >> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> >> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> >>
> >> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> >> task group process list instead of iteration over all tasks list.
> >> This makes the function to scale well, and reduces its execution
> >> time.
> >>
> >> Note, that since tasklist_lock doesn't protect a task against
> >> sched_class changing, we don't introduce new races in comparison
> >> to that we had before.
> >>
> >> Signed-off-by: Kirill Tkhai <[email protected]>
> >> ---
> >> kernel/sched/rt.c | 21 +++++++++++----------
> >> 1 file changed, 11 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index 7aef6b4e885a..a40535c604b8 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
> >> */
> >> static DEFINE_MUTEX(rt_constraints_mutex);
> >>
> >> -/* Must be called with tasklist_lock held */
> >> static inline int tg_has_rt_tasks(struct task_group *tg)
> >> {
> >> - struct task_struct *g, *p;
> >> + struct task_struct *task;
> >> + struct css_task_iter it;
> >> + int ret = 0;
> >>
> >> /*
> >> * Autogroups do not have RT tasks; see autogroup_create().
> >> @@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
> >> if (task_group_is_autogroup(tg))
> >> return 0;
> >>
> >> - for_each_process_thread(g, p) {
> >> - if (rt_task(p) && task_group(p) == tg)
> >> - return 1;
> >> + css_task_iter_start(&tg->css, 0, &it);
> >> + while ((task = css_task_iter_next(&it))) {
> >> + if (rt_task(task)) {
> >> + ret = 1;
> >> + break;
> >> + }
> >> }
> >> + css_task_iter_end(&it);
> >>
> >> - return 0;
> >> + return ret;
> >> }
> >>
> >> struct rt_schedulable_data {
> >> @@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> >> return -EINVAL;
> >>
> >> mutex_lock(&rt_constraints_mutex);
> >> - read_lock(&tasklist_lock);
> >> err = __rt_schedulable(tg, rt_period, rt_runtime);
> >> if (err)
> >> goto unlock;
> >> @@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> >> }
> >> raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
> >> unlock:
> >> - read_unlock(&tasklist_lock);
> >> mutex_unlock(&rt_constraints_mutex);
> >>
> >> return err;
> >> @@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
> >> int ret = 0;
> >>
> >> mutex_lock(&rt_constraints_mutex);
> >> - read_lock(&tasklist_lock);
> >> ret = __rt_schedulable(NULL, 0, 0);
> >> - read_unlock(&tasklist_lock);
> >> mutex_unlock(&rt_constraints_mutex);
> >>
> >> return ret;
> >
> > Sorry to necro this...
> >
> > I have a customer case that has hit the issue here. It looks like this
> > fix didn't end up going in.
> >
> > The only way I could reproduce it myself was to add a udelay(2) to
> > the tg_has_rt_tasks loop. With this patch applied, even with the
> > udelay nothing bad happens.
> >
> > Kirill, did you have a good way to reproduce it without adding
> > delays?
>
> I have no a reproducer, it just used to fire on some of our customer nodes.
> It depends on many parameters, and main is workload. Also, I think, interrupts
> affinity may be involved (whether they bound to small subset of cpus, or they
> distributed over all cores). If this is possible theoretically, it occurs
> practically with some probability.
>
> We fixed this with out-of-tree patch in our vzkernel tree, so we haven't seen
> a reproduction anymore.
>
Thanks Kirill.
Fwiw, it looks like Konstantin Khlebnikov has picked up this change in
a newer patch, so if that can get in then this would be cleared up too.
https://lkml.org/lkml/2020/1/25/167
Cheers,
Phil
> > Peter, is there any chance of taking something like this?
> >
> > Is there a better fix?
>
--
On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> Peter, is there any chance of taking something like this?
Whoopsy, looks like this fell on the floor. Can do I suppose.
Thanks!
On Mon, Jan 27, 2020 at 05:43:15PM +0100 Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> > Peter, is there any chance of taking something like this?
>
> Whoopsy, looks like this fell on the floor. Can do I suppose.
>
> Thanks!
>
Thanks. Probably make sense at this point to use Konstantin's new version?
But they both do the trick.
Cheers,
Phil
--
On Mon, Jan 27, 2020 at 11:56:38AM -0500, Phil Auld wrote:
> On Mon, Jan 27, 2020 at 05:43:15PM +0100 Peter Zijlstra wrote:
> > On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> > > Peter, is there any chance of taking something like this?
> >
> > Whoopsy, looks like this fell on the floor. Can do I suppose.
> >
> > Thanks!
> >
>
> Thanks. Probably make sense at this point to use Konstantin's new version?
> But they both do the trick.
Care to send it along?
On Mon, Jan 27, 2020 at 06:00:10PM +0100 Peter Zijlstra wrote:
> On Mon, Jan 27, 2020 at 11:56:38AM -0500, Phil Auld wrote:
> > On Mon, Jan 27, 2020 at 05:43:15PM +0100 Peter Zijlstra wrote:
> > > On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> > > > Peter, is there any chance of taking something like this?
> > >
> > > Whoopsy, looks like this fell on the floor. Can do I suppose.
> > >
> > > Thanks!
> > >
> >
> > Thanks. Probably make sense at this point to use Konstantin's new version?
> > But they both do the trick.
>
> Care to send it along?
>
I think you were on the CC, but here's a link:
https://lkml.org/lkml/2020/1/25/167
It's called:
[PATCH] sched/rt: optimize checking group rt scheduler constraints
It does a little bit more and needs an "|=" changed to "=". And actually
looking at it again, it should probably just have a break in the loop.
I'll make that comment in the thread for it.
Thanks,
Phil
--