Commit 95158a89dd50 ("sched,rt: Use the full cpumask for balancing")
allow find_lock_lowest_rq to pick a task with migration disabled.
This commit is intended to push the current running task on this CPU
away.
There is a race scenario, which allows a migration disabled task to
be migrated to another CPU.
When there is a RT task with higher priority, rt sched class was
intended to migrate higher priority task to lowest rq via push_rt_tasks,
this WARNING will happen here.
With the system running on PREEMPT_RT, rt_spin_lock will disable
migration, this will make the problem easier to reproduce.
I have seen this WARNING on PREEMPT_RT, from the logs, there is a race
when trying to migrate higher priority tasks to the lowest rq.
Please refer to the following scenarios.
CPU0 CPU1
------------------------------------------------------------------
push_rt_task
check is_migration_disabled(next_task)
task not running and
migration_disabled == 0
find_lock_lowest_rq(next_task, rq);
_double_lock_balance(this_rq, busiest);
raw_spin_rq_unlock(this_rq);
double_rq_lock(this_rq, busiest);
<<wait for busiest rq>>
<wakeup>
task become running
migrate_disable();
<context out>
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, lowest_rq->cpu);
WARN_ON_ONCE(is_migration_disabled(p));
---------OOPS-------------
Crash logs are as follows:
[123671.996430] WARNING: CPU: 2 PID: 13470 at kernel/sched/core.c:2485
set_task_cpu+0x8c/0x108
[123671.996800] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[123671.996811] pc : set_task_cpu+0x8c/0x108
[123671.996820] lr : set_task_cpu+0x7c/0x108
[123671.996828] sp : ffff80001268bd30
[123671.996832] pmr_save: 00000060
[123671.996835] x29: ffff80001268bd30 x28: ffff0001a3d68e80
[123671.996844] x27: ffff80001225f4a8 x26: ffff800010ab62cb
[123671.996854] x25: ffff80026d95e000 x24: 0000000000000005
[123671.996864] x23: ffff00019746c1b0 x22: 0000000000000000
[123671.996873] x21: ffff00027ee33a80 x20: 0000000000000000
[123671.996882] x19: ffff00019746ba00 x18: 0000000000000000
[123671.996890] x17: 0000000000000000 x16: 0000000000000000
[123671.996899] x15: 000000000000000a x14: 000000000000349e
[123671.996908] x13: ffff800012f4503d x12: 0000000000000001
[123671.996916] x11: 0000000000000000 x10: 0000000000000000
[123671.996925] x9 : 00000000000c0000 x8 : ffff00027ee58700
[123671.996933] x7 : ffff00027ee8da80 x6 : ffff00027ee8e580
[123671.996942] x5 : ffff00027ee8dcc0 x4 : 0000000000000005
[123671.996951] x3 : ffff00027ee8e338 x2 : 0000000000000000
[123671.996959] x1 : 00000000000000ff x0 : 0000000000000002
[123671.996969] Call trace:
[123671.996975] set_task_cpu+0x8c/0x108
[123671.996984] push_rt_task.part.0+0x144/0x184
[123671.996995] push_rt_tasks+0x28/0x3c
[123671.997002] task_woken_rt+0x58/0x68
[123671.997009] ttwu_do_wakeup+0x5c/0xd0
[123671.997019] ttwu_do_activate+0xc0/0xd4
[123671.997028] try_to_wake_up+0x244/0x288
[123671.997036] wake_up_process+0x18/0x24
[123671.997045] __irq_wake_thread+0x64/0x80
[123671.997056] __handle_irq_event_percpu+0x110/0x124
[123671.997064] handle_irq_event_percpu+0x50/0xac
[123671.997072] handle_irq_event+0x84/0xfc
To fix it, we need to check migration_disabled flag again to avoid
bad migration.
Fixes: 95158a89dd50 ("sched,rt: Use the full cpumask for balancing")
CC: Valentin Schneider <[email protected]>
Signed-off-by: Schspa Shi <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
--
Changelog:
v1 -> v2:
- Modify commit message to add fixed commit information.
- Going to retry to push the current running task on this CPU
away, instead doing nothing for this migrate disabled task.
v2 -> v3:
- Change migration disabled check to the correct position
v3 -> v4:
- Check migrate disabled in find_lock_lowest_rq to avoid not
necessary check when task rq is not released as Steven advised.
v4 -> v5:
- Adjust the comment as Steve advised to make it clear.
v5 -> v6:
- Adjust the comment again as Steve advised.
v6 -> v7:
- Add missing put_task_struct && add this task migration
disable check to deadline scheduler too as Dietmar advised.
v7 -> v8:
- Change the BUG on comments to WARN to avoid misunderstanding.
- Change the comments on DL case for resched_curr rather than
push the current task.
---
kernel/sched/deadline.c | 1 +
kernel/sched/rt.c | 4 ++++
2 files changed, 5 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0ab79d819a0d6..e7eea6cde5cb9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2243,6 +2243,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
!cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
task_running(rq, task) ||
!dl_task(task) ||
+ is_migration_disabled(task) ||
!task_on_rq_queued(task))) {
double_unlock_balance(rq, later_rq);
later_rq = NULL;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 55f39c8f42032..57e8cd5c9c267 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2001,11 +2001,15 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
* the mean time, task could have
* migrated already or had its affinity changed.
* Also make sure that it wasn't scheduled on its rq.
+ * It is possible the task was scheduled, set
+ * "migrate_disabled" and then got preempted, so we must
+ * check the task migration disable flag here too.
*/
if (unlikely(task_rq(task) != rq ||
!cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
task_running(rq, task) ||
!rt_task(task) ||
+ is_migration_disabled(task) ||
!task_on_rq_queued(task))) {
double_unlock_balance(rq, lowest_rq);
--
2.37.2
When the task to push disable migration, retry to push the current
running task on this CPU away, instead doing nothing for this migrate
disabled task.
CC: Valentin Schneider <[email protected]>
Signed-off-by: Schspa Shi <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 13 ++++++++++++-
kernel/sched/deadline.c | 9 +++++++++
kernel/sched/rt.c | 8 ++++++++
3 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0c..056b336c29e70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2503,8 +2503,19 @@ int push_cpu_stop(void *arg)
if (p->sched_class->find_lock_rq)
lowest_rq = p->sched_class->find_lock_rq(p, rq);
- if (!lowest_rq)
+ if (!lowest_rq) {
+ /*
+ * The find_lock_rq function above could have released the rq
+ * lock and allow p to schedule and be preempted again, and
+ * that lowest_rq could be NULL because p now has the
+ * migrate_disable flag set and not because it could not find
+ * the lowest rq. So we must check task migration flag again.
+ */
+ if (unlikely(is_migration_disabled(p)))
+ p->migration_flags |= MDF_PUSH;
+
goto out_unlock;
+ }
// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e7eea6cde5cb9..c8055b978dbc3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2340,6 +2340,15 @@ static int push_dl_task(struct rq *rq)
*/
task = pick_next_pushable_dl_task(rq);
if (task == next_task) {
+ /*
+ * If next task has now disabled migrating, see if we
+ * can do resched_curr().
+ */
+ if (unlikely(is_migration_disabled(task))) {
+ put_task_struct(next_task);
+ goto retry;
+ }
+
/*
* The task is still there. We don't try
* again, some other CPU will pull it when ready.
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 57e8cd5c9c267..381ec05eb2701 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2139,6 +2139,14 @@ static int push_rt_task(struct rq *rq, bool pull)
*/
task = pick_next_pushable_task(rq);
if (task == next_task) {
+ /*
+ * If next task has now disabled migrating, see if we
+ * can push the current task.
+ */
+ if (unlikely(is_migration_disabled(task))) {
+ put_task_struct(next_task);
+ goto retry;
+ }
/*
* The task hasn't migrated, and is still the next
* eligible task, but we failed to find a run-queue
--
2.37.2
On 29/08/22 01:03, Schspa Shi wrote:
> When the task to push disable migration, retry to push the current
> running task on this CPU away, instead doing nothing for this migrate
> disabled task.
>
> CC: Valentin Schneider <[email protected]>
> Signed-off-by: Schspa Shi <[email protected]>
> Reviewed-by: Steven Rostedt (Google) <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/core.c | 13 ++++++++++++-
> kernel/sched/deadline.c | 9 +++++++++
> kernel/sched/rt.c | 8 ++++++++
> 3 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee28253c9ac0c..056b336c29e70 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2503,8 +2503,19 @@ int push_cpu_stop(void *arg)
> if (p->sched_class->find_lock_rq)
> lowest_rq = p->sched_class->find_lock_rq(p, rq);
>
> - if (!lowest_rq)
> + if (!lowest_rq) {
> + /*
> + * The find_lock_rq function above could have released the rq
> + * lock and allow p to schedule and be preempted again, and
> + * that lowest_rq could be NULL because p now has the
> + * migrate_disable flag set and not because it could not find
> + * the lowest rq. So we must check task migration flag again.
> + */
> + if (unlikely(is_migration_disabled(p)))
> + p->migration_flags |= MDF_PUSH;
> +
Given p has to be on this rq initially, this implies p being migrated away
to become migration_disabled() (it *can't* be scheduled while the stopper
is running), in which case it's not on this rq anymore, so do we care?
> goto out_unlock;
> + }
>
> // XXX validate p is still the highest prio task
> if (task_rq(p) == rq) {
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index e7eea6cde5cb9..c8055b978dbc3 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2340,6 +2340,15 @@ static int push_dl_task(struct rq *rq)
> */
> task = pick_next_pushable_dl_task(rq);
> if (task == next_task) {
> + /*
> + * If next task has now disabled migrating, see if we
> + * can do resched_curr().
> + */
> + if (unlikely(is_migration_disabled(task))) {
> + put_task_struct(next_task);
> + goto retry;
> + }
> +
> /*
> * The task is still there. We don't try
> * again, some other CPU will pull it when ready.
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 57e8cd5c9c267..381ec05eb2701 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2139,6 +2139,14 @@ static int push_rt_task(struct rq *rq, bool pull)
> */
> task = pick_next_pushable_task(rq);
> if (task == next_task) {
> + /*
> + * If next task has now disabled migrating, see if we
> + * can push the current task.
> + */
> + if (unlikely(is_migration_disabled(task))) {
> + put_task_struct(next_task);
> + goto retry;
> + }
Similarly here, if the task has been through a switch-in / switch-out
cycle, then at least for RT we'd have
set_next_task_rt()
`\
rt_queue_push_tasks()
which will take care of it.
If the task is preempted by e.g. a DL task, then the retry would fail on
(next_task->prio < rq->curr->prio)
and I'm thinking the same logic applies to the deadline.c. IOW, it looks
like we're already doing the right thing here when the task gets scheduled
out, so I don't think we need any of this.
> /*
> * The task hasn't migrated, and is still the next
> * eligible task, but we failed to find a run-queue
> --
> 2.37.2
On 29/08/22 01:03, Schspa Shi wrote:
> Commit 95158a89dd50 ("sched,rt: Use the full cpumask for balancing")
> allow find_lock_lowest_rq to pick a task with migration disabled.
> This commit is intended to push the current running task on this CPU
> away.
>
> There is a race scenario, which allows a migration disabled task to
> be migrated to another CPU.
>
> When there is a RT task with higher priority, rt sched class was
> intended to migrate higher priority task to lowest rq via push_rt_tasks,
> this WARNING will happen here.
>
> With the system running on PREEMPT_RT, rt_spin_lock will disable
> migration, this will make the problem easier to reproduce.
>
> I have seen this WARNING on PREEMPT_RT, from the logs, there is a race
> when trying to migrate higher priority tasks to the lowest rq.
>
> Please refer to the following scenarios.
>
> CPU0 CPU1
> ------------------------------------------------------------------
> push_rt_task
> check is_migration_disabled(next_task)
> task not running and
> migration_disabled == 0
> find_lock_lowest_rq(next_task, rq);
> _double_lock_balance(this_rq, busiest);
> raw_spin_rq_unlock(this_rq);
> double_rq_lock(this_rq, busiest);
> <<wait for busiest rq>>
> <wakeup>
> task become running
> migrate_disable();
> <context out>
> deactivate_task(rq, next_task, 0);
> set_task_cpu(next_task, lowest_rq->cpu);
> WARN_ON_ONCE(is_migration_disabled(p));
> ---------OOPS-------------
>
> Crash logs are as follows:
> [123671.996430] WARNING: CPU: 2 PID: 13470 at kernel/sched/core.c:2485
> set_task_cpu+0x8c/0x108
> [123671.996800] pstate: 20400009 (nzCv daif +PAN -UAO -TCO BTYPE=--)
> [123671.996811] pc : set_task_cpu+0x8c/0x108
> [123671.996820] lr : set_task_cpu+0x7c/0x108
> [123671.996828] sp : ffff80001268bd30
> [123671.996832] pmr_save: 00000060
> [123671.996835] x29: ffff80001268bd30 x28: ffff0001a3d68e80
> [123671.996844] x27: ffff80001225f4a8 x26: ffff800010ab62cb
> [123671.996854] x25: ffff80026d95e000 x24: 0000000000000005
> [123671.996864] x23: ffff00019746c1b0 x22: 0000000000000000
> [123671.996873] x21: ffff00027ee33a80 x20: 0000000000000000
> [123671.996882] x19: ffff00019746ba00 x18: 0000000000000000
> [123671.996890] x17: 0000000000000000 x16: 0000000000000000
> [123671.996899] x15: 000000000000000a x14: 000000000000349e
> [123671.996908] x13: ffff800012f4503d x12: 0000000000000001
> [123671.996916] x11: 0000000000000000 x10: 0000000000000000
> [123671.996925] x9 : 00000000000c0000 x8 : ffff00027ee58700
> [123671.996933] x7 : ffff00027ee8da80 x6 : ffff00027ee8e580
> [123671.996942] x5 : ffff00027ee8dcc0 x4 : 0000000000000005
> [123671.996951] x3 : ffff00027ee8e338 x2 : 0000000000000000
> [123671.996959] x1 : 00000000000000ff x0 : 0000000000000002
> [123671.996969] Call trace:
> [123671.996975] set_task_cpu+0x8c/0x108
> [123671.996984] push_rt_task.part.0+0x144/0x184
> [123671.996995] push_rt_tasks+0x28/0x3c
> [123671.997002] task_woken_rt+0x58/0x68
> [123671.997009] ttwu_do_wakeup+0x5c/0xd0
> [123671.997019] ttwu_do_activate+0xc0/0xd4
> [123671.997028] try_to_wake_up+0x244/0x288
> [123671.997036] wake_up_process+0x18/0x24
> [123671.997045] __irq_wake_thread+0x64/0x80
> [123671.997056] __handle_irq_event_percpu+0x110/0x124
> [123671.997064] handle_irq_event_percpu+0x50/0xac
> [123671.997072] handle_irq_event+0x84/0xfc
>
> To fix it, we need to check migration_disabled flag again to avoid
> bad migration.
>
> Fixes: 95158a89dd50 ("sched,rt: Use the full cpumask for balancing")
>
> CC: Valentin Schneider <[email protected]>
> Signed-off-by: Schspa Shi <[email protected]>
> Reviewed-by: Steven Rostedt (Google) <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
>
Sorry, I lost track of that one, and ironically it looks like we've hit
this bug internally. I'm going to test whether this is the cure we need,
but even if this isn't the same issue, that patch looks good:
Reviewed-by: Valentin Schneider <[email protected]>
I have some more comments on 2/2, but IMO this can go in on its own.
On 06/04/23 12:55, Valentin Schneider wrote:
>
> Sorry, I lost track of that one, and ironically it looks like we've hit
> this bug internally. I'm going to test whether this is the cure we need,
> but even if this isn't the same issue, that patch looks good:
>
> Reviewed-by: Valentin Schneider <[email protected]>
>
> I have some more comments on 2/2, but IMO this can go in on its own.
Seems like we were indeed hitting that bug, as we used to have a MTTF of
about 2 hours, and we haven't had any with that patch in 3 days.
Tested-By: Dwaine Gonyier <[email protected]>
Valentin Schneider <[email protected]> writes:
> On 29/08/22 01:03, Schspa Shi wrote:
>> When the task to push disable migration, retry to push the current
>> running task on this CPU away, instead doing nothing for this migrate
>> disabled task.
>>
>> CC: Valentin Schneider <[email protected]>
>> Signed-off-by: Schspa Shi <[email protected]>
>> Reviewed-by: Steven Rostedt (Google) <[email protected]>
>> Reviewed-by: Dietmar Eggemann <[email protected]>
>> ---
>> kernel/sched/core.c | 13 ++++++++++++-
>> kernel/sched/deadline.c | 9 +++++++++
>> kernel/sched/rt.c | 8 ++++++++
>> 3 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index ee28253c9ac0c..056b336c29e70 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2503,8 +2503,19 @@ int push_cpu_stop(void *arg)
>> if (p->sched_class->find_lock_rq)
>> lowest_rq = p->sched_class->find_lock_rq(p, rq);
>>
>> - if (!lowest_rq)
>> + if (!lowest_rq) {
>> + /*
>> + * The find_lock_rq function above could have released the rq
>> + * lock and allow p to schedule and be preempted again, and
>> + * that lowest_rq could be NULL because p now has the
>> + * migrate_disable flag set and not because it could not find
>> + * the lowest rq. So we must check task migration flag again.
>> + */
>> + if (unlikely(is_migration_disabled(p)))
>> + p->migration_flags |= MDF_PUSH;
>> +
>
> Given p has to be on this rq initially, this implies p being migrated away
> to become migration_disabled() (it *can't* be scheduled while the stopper
> is running), in which case it's not on this rq anymore, so do we care?
>
Yes, you are right, we have already have a correct handle for this.
>> goto out_unlock;
>> + }
>>
>> // XXX validate p is still the highest prio task
>> if (task_rq(p) == rq) {
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index e7eea6cde5cb9..c8055b978dbc3 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2340,6 +2340,15 @@ static int push_dl_task(struct rq *rq)
>> */
>> task = pick_next_pushable_dl_task(rq);
>> if (task == next_task) {
>> + /*
>> + * If next task has now disabled migrating, see if we
>> + * can do resched_curr().
>> + */
>> + if (unlikely(is_migration_disabled(task))) {
>> + put_task_struct(next_task);
>> + goto retry;
>> + }
>> +
>> /*
>> * The task is still there. We don't try
>> * again, some other CPU will pull it when ready.
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 57e8cd5c9c267..381ec05eb2701 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2139,6 +2139,14 @@ static int push_rt_task(struct rq *rq, bool pull)
>> */
>> task = pick_next_pushable_task(rq);
>> if (task == next_task) {
>> + /*
>> + * If next task has now disabled migrating, see if we
>> + * can push the current task.
>> + */
>> + if (unlikely(is_migration_disabled(task))) {
>> + put_task_struct(next_task);
>> + goto retry;
>> + }
>
> Similarly here, if the task has been through a switch-in / switch-out
> cycle, then at least for RT we'd have
>
> set_next_task_rt()
> `\
> rt_queue_push_tasks()
>
> which will take care of it.
>
Yes, it will take care of this.
> If the task is preempted by e.g. a DL task, then the retry would fail on
>
> (next_task->prio < rq->curr->prio)
>
It may fail most of the time, but push_rt_task can run on a different
CPU (the rq != this_rq()), and the rq->curr can be changed. the retry
won't fail in this case. It is the same with the deadline.c.
> and I'm thinking the same logic applies to the deadline.c. IOW, it looks
> like we're already doing the right thing here when the task gets scheduled
> out, so I don't think we need any of this.
>
>> /*
>> * The task hasn't migrated, and is still the next
>> * eligible task, but we failed to find a run-queue
>> --
>> 2.37.2
--
BRs
Schspa Shi