LinuxLists.cc - [PATCH -v2 1/5] sched: Fix ttwu() race

2020-06-22 10:19:55

Subject: [PATCH -v2 1/5] sched: Fix ttwu() race

Paul reported rcutorture occasionally hitting a NULL deref:

sched_ttwu_pending()
ttwu_do_wakeup()
check_preempt_curr() := check_preempt_wakeup()
find_matching_se()
is_same_group()
if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

X->cpu = 1
rq(1)->curr = X

CPU0 CPU1 CPU2

// switch away from X
LOCK rq(1)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 9
switch_to(Z)
X->on_cpu = 0
UNLOCK rq(1)->lock

// migrate X to cpu 0
LOCK rq(1)->lock
dequeue_task(X)
set_task_cpu(X, 0)
X->cpu = 0
UNLOCK rq(1)->lock

LOCK rq(0)->lock
enqueue_task(X)
X->on_rq = 1
UNLOCK rq(0)->lock

// switch to X
LOCK rq(0)->lock
smp_mb__after_spinlock
switch_to(X)
X->on_cpu = 1
UNLOCK rq(0)->lock

// X goes sleep
X->state = TASK_UNINTERRUPTIBLE
smp_mb(); // wake X
ttwu()
LOCK X->pi_lock
smp_mb__after_spinlock

if (p->state)

cpu = X->cpu; // =? 1

smp_rmb()

// X calls schedule()
LOCK rq(0)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 0

if (p->on_rq)

smp_rmb();

if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

smp_cond_load_acquire(&p->on_cpu, !VAL)

cpu = select_task_rq(X, X->wake_cpu, ...)
if (X->cpu != cpu)
switch_to(Y)
X->on_cpu = 0
UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
---
kernel/sched/core.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,8 +2293,15 @@ void sched_ttwu_pending(void *arg)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

- llist_for_each_entry_safe(p, t, llist, wake_entry)
+ llist_for_each_entry_safe(p, t, llist, wake_entry) {
+ if (WARN_ON_ONCE(p->on_cpu))
+ smp_cond_load_acquire(p->on_cpu, !VAL);
+
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ set_task_cpu(p, cpu_of(rq));
+
ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ }

rq_unlock_irqrestore(rq, &rf);
}
@@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
+ if (WARN_ON_ONCE(cpu == smp_processor_id()))
+ return false;
+
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
__ttwu_queue_wakelist(p, cpu, wake_flags);
return true;
@@ -2550,7 +2560,6 @@ try_to_wake_up(struct task_struct *p, un

/* We're going to change ->state: */
success = 1;
- cpu = task_cpu(p);

/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
@@ -2614,8 +2623,21 @@ try_to_wake_up(struct task_struct *p, un
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
+ *
+ * Ensure we load task_cpu(p) after p->on_cpu:
+ *
+ * set_task_cpu(p, cpu);
+ * STORE p->cpu = @cpu
+ * __schedule() (switch to task 'p')
+ * LOCK rq->lock
+ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
+ * STORE p->on_cpu = 1 LOAD p->cpu
+ *
+ * to ensure we observe the correct CPU on which the task is currently
+ * scheduling.
*/
- if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
+ if (smp_load_acquire(&p->on_cpu) &&
+ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_RQ))
goto unlock;

/*
@@ -2635,6 +2657,8 @@ try_to_wake_up(struct task_struct *p, un
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
+#else
+ cpu = task_cpu(p);
#endif /* CONFIG_SMP */

ttwu_queue(p, cpu, wake_flags);

2020-06-22 13:00:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

*sigh*, this one should actually build and I got a smatch report that
there's an uninitizlied usage of @cpu, so I shuffled that around a bit.

---
Subject: sched: Fix ttwu() race
From: Peter Zijlstra <[email protected]>
Date: Mon, 22 Jun 2020 12:01:23 +0200

Paul reported rcutorture occasionally hitting a NULL deref:

sched_ttwu_pending()
ttwu_do_wakeup()
check_preempt_curr() := check_preempt_wakeup()
find_matching_se()
is_same_group()
if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

X->cpu = 1
rq(1)->curr = X

CPU0 CPU1 CPU2

// switch away from X
LOCK rq(1)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 9
switch_to(Z)
X->on_cpu = 0
UNLOCK rq(1)->lock

// migrate X to cpu 0
LOCK rq(1)->lock
dequeue_task(X)
set_task_cpu(X, 0)
X->cpu = 0
UNLOCK rq(1)->lock

LOCK rq(0)->lock
enqueue_task(X)
X->on_rq = 1
UNLOCK rq(0)->lock

// switch to X
LOCK rq(0)->lock
smp_mb__after_spinlock
switch_to(X)
X->on_cpu = 1
UNLOCK rq(0)->lock

// X goes sleep
X->state = TASK_UNINTERRUPTIBLE
smp_mb(); // wake X
ttwu()
LOCK X->pi_lock
smp_mb__after_spinlock

if (p->state)

cpu = X->cpu; // =? 1

smp_rmb()

// X calls schedule()
LOCK rq(0)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 0

if (p->on_rq)

smp_rmb();

if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

smp_cond_load_acquire(&p->on_cpu, !VAL)

cpu = select_task_rq(X, X->wake_cpu, ...)
if (X->cpu != cpu)
switch_to(Y)
X->on_cpu = 0
UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
---
kernel/sched/core.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,8 +2293,15 @@ void sched_ttwu_pending(void *arg)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

- llist_for_each_entry_safe(p, t, llist, wake_entry)
+ llist_for_each_entry_safe(p, t, llist, wake_entry) {
+ if (WARN_ON_ONCE(p->on_cpu))
+ smp_cond_load_acquire(&p->on_cpu, !VAL);
+
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ set_task_cpu(p, cpu_of(rq));
+
ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ }

rq_unlock_irqrestore(rq, &rf);
}
@@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
+ if (WARN_ON_ONCE(cpu == smp_processor_id()))
+ return false;
+
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
__ttwu_queue_wakelist(p, cpu, wake_flags);
return true;
@@ -2528,7 +2538,6 @@ try_to_wake_up(struct task_struct *p, un
goto out;

success = 1;
- cpu = task_cpu(p);
trace_sched_waking(p);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2550,7 +2559,6 @@ try_to_wake_up(struct task_struct *p, un

/* We're going to change ->state: */
success = 1;
- cpu = task_cpu(p);

/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
@@ -2614,8 +2622,21 @@ try_to_wake_up(struct task_struct *p, un
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
+ *
+ * Ensure we load task_cpu(p) after p->on_cpu:
+ *
+ * set_task_cpu(p, cpu);
+ * STORE p->cpu = @cpu
+ * __schedule() (switch to task 'p')
+ * LOCK rq->lock
+ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
+ * STORE p->on_cpu = 1 LOAD p->cpu
+ *
+ * to ensure we observe the correct CPU on which the task is currently
+ * scheduling.
*/
- if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
+ if (smp_load_acquire(&p->on_cpu) &&
+ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_RQ))
goto unlock;

/*
@@ -2635,6 +2656,8 @@ try_to_wake_up(struct task_struct *p, un
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
+#else
+ cpu = task_cpu(p);
#endif /* CONFIG_SMP */

ttwu_queue(p, cpu, wake_flags);
@@ -2642,7 +2665,7 @@ try_to_wake_up(struct task_struct *p, un
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
out:
if (success)
- ttwu_stat(p, cpu, wake_flags);
+ ttwu_stat(p, task_cpu(p), wake_flags);
preempt_enable();

return success;

2020-06-23 07:25:06

by tip-bot2 for Jacob Pan

[permalink] [raw]

Subject: [tip: sched/urgent] sched: Fix ttwu() race

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: dcb623b51fb9d424471da9acbd2cfa618ecf9a09
Gitweb: https://git.kernel.org/tip/dcb623b51fb9d424471da9acbd2cfa618ecf9a09
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 22 Jun 2020 12:01:23 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 22 Jun 2020 20:51:06 +02:00

sched: Fix ttwu() race

Paul reported rcutorture occasionally hitting a NULL deref:

sched_ttwu_pending()
ttwu_do_wakeup()
check_preempt_curr() := check_preempt_wakeup()
find_matching_se()
is_same_group()
if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

X->cpu = 1
rq(1)->curr = X

CPU0 CPU1 CPU2

// switch away from X
LOCK rq(1)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 9
switch_to(Z)
X->on_cpu = 0
UNLOCK rq(1)->lock

// migrate X to cpu 0
LOCK rq(1)->lock
dequeue_task(X)
set_task_cpu(X, 0)
X->cpu = 0
UNLOCK rq(1)->lock

LOCK rq(0)->lock
enqueue_task(X)
X->on_rq = 1
UNLOCK rq(0)->lock

// switch to X
LOCK rq(0)->lock
smp_mb__after_spinlock
switch_to(X)
X->on_cpu = 1
UNLOCK rq(0)->lock

// X goes sleep
X->state = TASK_UNINTERRUPTIBLE
smp_mb(); // wake X
ttwu()
LOCK X->pi_lock
smp_mb__after_spinlock

if (p->state)

cpu = X->cpu; // =? 1

smp_rmb()

// X calls schedule()
LOCK rq(0)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 0

if (p->on_rq)

smp_rmb();

if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

smp_cond_load_acquire(&p->on_cpu, !VAL)

cpu = select_task_rq(X, X->wake_cpu, ...)
if (X->cpu != cpu)
switch_to(Y)
X->on_cpu = 0
UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f79d76..3328c29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,8 +2293,15 @@ void sched_ttwu_pending(void *arg)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

- llist_for_each_entry_safe(p, t, llist, wake_entry)
+ llist_for_each_entry_safe(p, t, llist, wake_entry) {
+ if (WARN_ON_ONCE(p->on_cpu))
+ smp_cond_load_acquire(&p->on_cpu, !VAL);
+
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ set_task_cpu(p, cpu_of(rq));
+
ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ }

rq_unlock_irqrestore(rq, &rf);
}
@@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
+ if (WARN_ON_ONCE(cpu == smp_processor_id()))
+ return false;
+
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
__ttwu_queue_wakelist(p, cpu, wake_flags);
return true;
@@ -2528,7 +2538,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
goto out;

success = 1;
- cpu = task_cpu(p);
trace_sched_waking(p);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2550,7 +2559,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

/* We're going to change ->state: */
success = 1;
- cpu = task_cpu(p);

/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
@@ -2614,8 +2622,21 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
+ *
+ * Ensure we load task_cpu(p) after p->on_cpu:
+ *
+ * set_task_cpu(p, cpu);
+ * STORE p->cpu = @cpu
+ * __schedule() (switch to task 'p')
+ * LOCK rq->lock
+ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
+ * STORE p->on_cpu = 1 LOAD p->cpu
+ *
+ * to ensure we observe the correct CPU on which the task is currently
+ * scheduling.
*/
- if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
+ if (smp_load_acquire(&p->on_cpu) &&
+ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_RQ))
goto unlock;

/*
@@ -2635,6 +2656,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
+#else
+ cpu = task_cpu(p);
#endif /* CONFIG_SMP */

ttwu_queue(p, cpu, wake_flags);
@@ -2642,7 +2665,7 @@ unlock:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
out:
if (success)
- ttwu_stat(p, cpu, wake_flags);
+ ttwu_stat(p, task_cpu(p), wake_flags);
preempt_enable();

return success;

2020-06-23 08:54:09

by tip-bot2 for Jacob Pan

[permalink] [raw]

Subject: [tip: sched/urgent] sched/core: Fix ttwu() race

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 964ed98b075263faabe416eeebac99a9bef3f06c
Gitweb: https://git.kernel.org/tip/964ed98b075263faabe416eeebac99a9bef3f06c
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 22 Jun 2020 12:01:23 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 23 Jun 2020 10:42:39 +02:00

sched/core: Fix ttwu() race

Paul reported rcutorture occasionally hitting a NULL deref:

sched_ttwu_pending()
ttwu_do_wakeup()
check_preempt_curr() := check_preempt_wakeup()
find_matching_se()
is_same_group()
if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

X->cpu = 1
rq(1)->curr = X

CPU0 CPU1 CPU2

// switch away from X
LOCK rq(1)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 9
switch_to(Z)
X->on_cpu = 0
UNLOCK rq(1)->lock

// migrate X to cpu 0
LOCK rq(1)->lock
dequeue_task(X)
set_task_cpu(X, 0)
X->cpu = 0
UNLOCK rq(1)->lock

LOCK rq(0)->lock
enqueue_task(X)
X->on_rq = 1
UNLOCK rq(0)->lock

// switch to X
LOCK rq(0)->lock
smp_mb__after_spinlock
switch_to(X)
X->on_cpu = 1
UNLOCK rq(0)->lock

// X goes sleep
X->state = TASK_UNINTERRUPTIBLE
smp_mb(); // wake X
ttwu()
LOCK X->pi_lock
smp_mb__after_spinlock

if (p->state)

cpu = X->cpu; // =? 1

smp_rmb()

// X calls schedule()
LOCK rq(0)->lock
smp_mb__after_spinlock
dequeue_task(X)
X->on_rq = 0

if (p->on_rq)

smp_rmb();

if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

smp_cond_load_acquire(&p->on_cpu, !VAL)

cpu = select_task_rq(X, X->wake_cpu, ...)
if (X->cpu != cpu)
switch_to(Y)
X->on_cpu = 0
UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c1ba2e5..60791b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,8 +2293,15 @@ void sched_ttwu_pending(void *arg)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

- llist_for_each_entry_safe(p, t, llist, wake_entry)
+ llist_for_each_entry_safe(p, t, llist, wake_entry) {
+ if (WARN_ON_ONCE(p->on_cpu))
+ smp_cond_load_acquire(&p->on_cpu, !VAL);
+
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ set_task_cpu(p, cpu_of(rq));
+
ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+ }

rq_unlock_irqrestore(rq, &rf);
}
@@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
+ if (WARN_ON_ONCE(cpu == smp_processor_id()))
+ return false;
+
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
__ttwu_queue_wakelist(p, cpu, wake_flags);
return true;
@@ -2528,7 +2538,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
goto out;

success = 1;
- cpu = task_cpu(p);
trace_sched_waking(p);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2550,7 +2559,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

/* We're going to change ->state: */
success = 1;
- cpu = task_cpu(p);

/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
@@ -2614,8 +2622,21 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
+ *
+ * Ensure we load task_cpu(p) after p->on_cpu:
+ *
+ * set_task_cpu(p, cpu);
+ * STORE p->cpu = @cpu
+ * __schedule() (switch to task 'p')
+ * LOCK rq->lock
+ * smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
+ * STORE p->on_cpu = 1 LOAD p->cpu
+ *
+ * to ensure we observe the correct CPU on which the task is currently
+ * scheduling.
*/
- if (READ_ONCE(p->on_cpu) && ttwu_queue_wakelist(p, cpu, wake_flags | WF_ON_RQ))
+ if (smp_load_acquire(&p->on_cpu) &&
+ ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_RQ))
goto unlock;

/*
@@ -2635,6 +2656,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
+#else
+ cpu = task_cpu(p);
#endif /* CONFIG_SMP */

ttwu_queue(p, cpu, wake_flags);
@@ -2642,7 +2665,7 @@ unlock:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
out:
if (success)
- ttwu_stat(p, cpu, wake_flags);
+ ttwu_stat(p, task_cpu(p), wake_flags);
preempt_enable();

return success;

2020-07-21 10:50:49

by Chris Wilson

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

Quoting Peter Zijlstra (2020-06-22 11:01:23)
> @@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c
> static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> {
> if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
> + if (WARN_ON_ONCE(cpu == smp_processor_id()))
> + return false;
> +
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> __ttwu_queue_wakelist(p, cpu, wake_flags);
> return true;

We've been hitting this warning frequently, but have never seen the
rcu-torture-esque oops ourselves.

<4> [181.766705] RIP: 0010:ttwu_queue_wakelist+0xbc/0xd0
<4> [181.766710] Code: 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 5b 5d 41 5c 41 5d c3 31 c0 f6 c3 08 74 f2 48 c7 c2 00 ad 03 00 83 7c 11 40 01 77 e4 eb 80 <0f> 0b 31 c0 eb dc 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 bf 17
<4> [181.766726] RSP: 0018:ffffc90000003e08 EFLAGS: 00010046
<4> [181.766733] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffff888276a00000
<4> [181.766740] RDX: 000000000003ad00 RSI: ffffffff8232045b RDI: ffffffff8233103e
<4> [181.766747] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
<4> [181.766754] R10: 00000000d3fa25c3 R11: 0000000053712267 R12: ffff88825b912940
<4> [181.766761] R13: 0000000000000000 R14: 0000000000000087 R15: 000000000003ad00
<4> [181.766769] FS: 0000000000000000(0000) GS:ffff888276a00000(0000) knlGS:0000000000000000
<4> [181.766777] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [181.766783] CR2: 000055b8245814e0 CR3: 0000000005610003 CR4: 00000000003606f0
<4> [181.766790] Call Trace:
<4> [181.766794] <IRQ>
<4> [181.766798] try_to_wake_up+0x21b/0x690
<4> [181.766805] autoremove_wake_function+0xc/0x50
<4> [181.766858] __i915_sw_fence_complete+0x1ee/0x250 [i915]
<4> [181.766912] dma_i915_sw_fence_wake+0x2d/0x40 [i915]

We are seeing this on the ttwu_queue() path, so with p->on_cpu=0, and the
warning is cleared up by

- if (WARN_ON_ONCE(cpu == smp_processor_id()))
+ if (WARN_ON_ONCE(p->on_cpu && cpu == smp_processor_id()))

which would appear to restore the old behaviour for ttwu_queue() and
seem to be consistent with the intent of this patch. Hopefully this
helps identify the problem correctly.
-Chris

2020-07-21 11:41:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

On Tue, Jul 21, 2020 at 11:49:05AM +0100, Chris Wilson wrote:
> Quoting Peter Zijlstra (2020-06-22 11:01:23)
> > @@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c
> > static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> > {
> > if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
> > + if (WARN_ON_ONCE(cpu == smp_processor_id()))
> > + return false;
> > +
> > sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> > __ttwu_queue_wakelist(p, cpu, wake_flags);
> > return true;
>
> We've been hitting this warning frequently, but have never seen the
> rcu-torture-esque oops ourselves.

How easy is it to hit this? What, if anything, can I do to make my own
computer go bang?

> <4> [181.766705] RIP: 0010:ttwu_queue_wakelist+0xbc/0xd0
> <4> [181.766710] Code: 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 5b 5d 41 5c 41 5d c3 31 c0 f6 c3 08 74 f2 48 c7 c2 00 ad 03 00 83 7c 11 40 01 77 e4 eb 80 <0f> 0b 31 c0 eb dc 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 bf 17
> <4> [181.766726] RSP: 0018:ffffc90000003e08 EFLAGS: 00010046
> <4> [181.766733] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffff888276a00000
> <4> [181.766740] RDX: 000000000003ad00 RSI: ffffffff8232045b RDI: ffffffff8233103e
> <4> [181.766747] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> <4> [181.766754] R10: 00000000d3fa25c3 R11: 0000000053712267 R12: ffff88825b912940
> <4> [181.766761] R13: 0000000000000000 R14: 0000000000000087 R15: 000000000003ad00
> <4> [181.766769] FS: 0000000000000000(0000) GS:ffff888276a00000(0000) knlGS:0000000000000000
> <4> [181.766777] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [181.766783] CR2: 000055b8245814e0 CR3: 0000000005610003 CR4: 00000000003606f0
> <4> [181.766790] Call Trace:
> <4> [181.766794] <IRQ>
> <4> [181.766798] try_to_wake_up+0x21b/0x690
> <4> [181.766805] autoremove_wake_function+0xc/0x50
> <4> [181.766858] __i915_sw_fence_complete+0x1ee/0x250 [i915]
> <4> [181.766912] dma_i915_sw_fence_wake+0x2d/0x40 [i915]

Please, don't trim oopses..

> We are seeing this on the ttwu_queue() path, so with p->on_cpu=0, and the
> warning is cleared up by
>
> - if (WARN_ON_ONCE(cpu == smp_processor_id()))
> + if (WARN_ON_ONCE(p->on_cpu && cpu == smp_processor_id()))
>
> which would appear to restore the old behaviour for ttwu_queue() and
> seem to be consistent with the intent of this patch. Hopefully this
> helps identify the problem correctly.

Hurmph, that's actively wrong. We should never queue to self, as that
would result in self-IPI, which is not possible on a bunch of archs. It
works for you because x86 can in fact do that.

So ttwu_queue_cond() will only return true when:

- target-cpu and current-cpu do not share cache;
so it cannot be this condition, because you _always_
share cache with yourself.

- when WF_ON_CPU and target-cpu has nr_running <= 1;
which means p->on_cpu == true.

So now you have cpu == smp_processor_id() && p->on_cpu == 1, however
your modified WARN contradicts that.

*puzzle*

2020-07-22 09:59:30

by Chris Wilson

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

Quoting [email protected] (2020-07-21 12:37:19)
> On Tue, Jul 21, 2020 at 11:49:05AM +0100, Chris Wilson wrote:
> > Quoting Peter Zijlstra (2020-06-22 11:01:23)
> > > @@ -2378,6 +2385,9 @@ static inline bool ttwu_queue_cond(int c
> > > static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> > > {
> > > if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
> > > + if (WARN_ON_ONCE(cpu == smp_processor_id()))
> > > + return false;
> > > +
> > > sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> > > __ttwu_queue_wakelist(p, cpu, wake_flags);
> > > return true;
> >
> > We've been hitting this warning frequently, but have never seen the
> > rcu-torture-esque oops ourselves.
>
> How easy is it to hit this? What, if anything, can I do to make my own
> computer go bang?

I tried reproducing it in a mockup, hrtimer + irq_work + waitqueue, but
it remains elusive. It pops up in an obscure HW tests where we are
exercising timeout handling for rogue HW.
>
> > <4> [181.766705] RIP: 0010:ttwu_queue_wakelist+0xbc/0xd0
> > <4> [181.766710] Code: 00 00 00 5b 5d 41 5c 41 5d c3 31 c0 5b 5d 41 5c 41 5d c3 31 c0 f6 c3 08 74 f2 48 c7 c2 00 ad 03 00 83 7c 11 40 01 77 e4 eb 80 <0f> 0b 31 c0 eb dc 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 bf 17
> > <4> [181.766726] RSP: 0018:ffffc90000003e08 EFLAGS: 00010046
> > <4> [181.766733] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffff888276a00000
> > <4> [181.766740] RDX: 000000000003ad00 RSI: ffffffff8232045b RDI: ffffffff8233103e
> > <4> [181.766747] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> > <4> [181.766754] R10: 00000000d3fa25c3 R11: 0000000053712267 R12: ffff88825b912940
> > <4> [181.766761] R13: 0000000000000000 R14: 0000000000000087 R15: 000000000003ad00
> > <4> [181.766769] FS: 0000000000000000(0000) GS:ffff888276a00000(0000) knlGS:0000000000000000
> > <4> [181.766777] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4> [181.766783] CR2: 000055b8245814e0 CR3: 0000000005610003 CR4: 00000000003606f0
> > <4> [181.766790] Call Trace:
> > <4> [181.766794] <IRQ>
> > <4> [181.766798] try_to_wake_up+0x21b/0x690
> > <4> [181.766805] autoremove_wake_function+0xc/0x50
> > <4> [181.766858] __i915_sw_fence_complete+0x1ee/0x250 [i915]
> > <4> [181.766912] dma_i915_sw_fence_wake+0x2d/0x40 [i915]
>
> Please, don't trim oopses..
>
> > We are seeing this on the ttwu_queue() path, so with p->on_cpu=0, and the
> > warning is cleared up by
> >
> > - if (WARN_ON_ONCE(cpu == smp_processor_id()))
> > + if (WARN_ON_ONCE(p->on_cpu && cpu == smp_processor_id()))
> >
> > which would appear to restore the old behaviour for ttwu_queue() and
> > seem to be consistent with the intent of this patch. Hopefully this
> > helps identify the problem correctly.
>
> Hurmph, that's actively wrong. We should never queue to self, as that
> would result in self-IPI, which is not possible on a bunch of archs. It
> works for you because x86 can in fact do that.
>
> So ttwu_queue_cond() will only return true when:
>
> - target-cpu and current-cpu do not share cache;
> so it cannot be this condition, because you _always_
> share cache with yourself.
>
> - when WF_ON_CPU and target-cpu has nr_running <= 1;
> which means p->on_cpu == true.
>
> So now you have cpu == smp_processor_id() && p->on_cpu == 1, however
> your modified WARN contradicts that.
>
> *puzzle*

Perhaps more damning is that I can replace WF_ON_CPU with p->on_cpu to
suppress the warning:

-static inline bool ttwu_queue_cond(int cpu, int wake_flags)
+static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, int wake_flags)
{
/*
* If the CPU does not share cache, then queue the task on the
@@ -2370,7 +2370,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
* the soon-to-be-idle CPU as the current CPU is likely busy.
* nr_running is checked to avoid unnecessary task stacking.
*/
- if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
+ if (p->on_cpu && cpu_rq(cpu)->nr_running <= 1)
return true;

return false;
@@ -2378,7 +2378,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)

static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
- if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
+ if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu, wake_flags)) {
if (WARN_ON_ONCE(cpu == smp_processor_id()))
return false;

2020-07-23 18:31:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

On Wed, Jul 22, 2020 at 10:57:56AM +0100, Chris Wilson wrote:

> Perhaps more damning is that I can replace WF_ON_CPU with p->on_cpu to
> suppress the warning:

*argh*, I'm starting to go mad...

Chris, could you please try the below patch?

Can you also confirm that if you do:

$ echo NO_TTWU_QUEUE_ON_CPU > /debug/sched_features

or wherever else system-doofus mounts debugfs these days,
the issue no longer manifests? Because if I don't get a handle on this
soon we might have to disable this thing for now :/

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2a244af9a537..8218779734288 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2430,13 +2430,15 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
}

-static inline bool ttwu_queue_cond(int cpu, int wake_flags)
+static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, int wake_flags)
{
+ int this_cpu = smp_processor_id();
+
/*
* If the CPU does not share cache, then queue the task on the
* remote rqs wakelist to avoid accessing remote data.
*/
- if (!cpus_share_cache(smp_processor_id(), cpu))
+ if (!cpus_share_cache(this_cpu, cpu))
return true;

/*
@@ -2445,15 +2447,30 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
* the soon-to-be-idle CPU as the current CPU is likely busy.
* nr_running is checked to avoid unnecessary task stacking.
*/
- if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
- return true;
+ if (wake_flags & WF_ON_CPU) {
+
+ if (unlikely(cpu == this_cpu)) {
+ int on_cpu = READ_ONCE(p->on_cpu);
+ int cpu1 = task_cpu(p);
+
+ smp_rmb();
+ smp_cond_load_acquire(&p->on_cpu, !VAL);
+
+ pr_alert("ttwu-IPI-self: %d==%d, p->on_cpu=%d;0, task_cpu(p)=%d;%d\n",
+ cpu, this_cpu, on_cpu, cpu1, task_cpu(p));
+
+ return false;
+ }
+
+ return cpu_rq(cpu)->nr_running <= 1;
+ }

return false;
}

static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
- if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
+ if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu, wake_flags)) {
if (WARN_ON_ONCE(cpu == smp_processor_id()))
return false;

@@ -2713,7 +2730,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* to ensure we observe the correct CPU on which the task is currently
* scheduling.
*/
- if (smp_load_acquire(&p->on_cpu) &&
+ if (sched_feat(TTWU_QUEUE_ON_CPU) && smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
goto unlock;

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7481cd96f3915..b231a840c3eba 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
* using the scheduler IPI. Reduces rq->lock contention/bounces.
*/
SCHED_FEAT(TTWU_QUEUE, true)
+SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)

/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.

2020-07-23 19:45:49

by Chris Wilson

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

Quoting Peter Zijlstra (2020-07-23 19:28:41)
> On Wed, Jul 22, 2020 at 10:57:56AM +0100, Chris Wilson wrote:
>
> > Perhaps more damning is that I can replace WF_ON_CPU with p->on_cpu to
> > suppress the warning:
>
> *argh*, I'm starting to go mad...
>
> Chris, could you please try the below patch?

ttwu-IPI-self: 1==1, p->on_cpu=0;0, task_cpu(p)=1;1
ttwu-IPI-self: 1==1, p->on_cpu=0;0, task_cpu(p)=1;1
ttwu-IPI-self: 0==0, p->on_cpu=0;0, task_cpu(p)=0;0
ttwu-IPI-self: 3==3, p->on_cpu=0;0, task_cpu(p)=3;3
ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2
ttwu-IPI-self: 1==1, p->on_cpu=0;0, task_cpu(p)=1;1
ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2
ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2
ttwu-IPI-self: 2==2, p->on_cpu=0;0, task_cpu(p)=2;2

> Can you also confirm that if you do:
>
> $ echo NO_TTWU_QUEUE_ON_CPU > /debug/sched_features

With,

sched_feat_disable(10):TTWU_QUEUE_ON_CPU

the pr_alert is still being hit

ttwu-IPI-self: 3==3, p->on_cpu=0;0, task_cpu(p)=3;3

At which point, it darns on me. Mea culpa, stray bits being passed into
default_wake_function.

I am very sorry for the wild goose chase.
-Chris

2020-07-23 20:14:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

On Thu, Jul 23, 2020 at 08:41:03PM +0100, Chris Wilson wrote:

> I am very sorry for the wild goose chase.

*phew*... all good then. I was starting to go a little ga-ga trying to
make sense of things.

Arguably we should probably do something like:

@@ -4555,7 +4572,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
void *key)
{
- return try_to_wake_up(curr->private, mode, wake_flags);
+ return try_to_wake_up(curr->private, mode, wake_flags & WF_SYNC);
}
EXPORT_SYMBOL(default_wake_function);

Since I don't think anybody uses anything other than WF_SYNC, ever. And
the rest of the WF_flags are used internally.

Thanks Chris!

2020-07-24 17:56:21

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH -v2 1/5] sched: Fix ttwu() race

On Thu, Jul 23, 2020 at 10:11:28PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2020 at 08:41:03PM +0100, Chris Wilson wrote:
>
> > I am very sorry for the wild goose chase.
>
> *phew*... all good then. I was starting to go a little ga-ga trying to
> make sense of things.
>
> Arguably we should probably do something like:
>
>
> @@ -4555,7 +4572,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
> int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
> void *key)
> {
> - return try_to_wake_up(curr->private, mode, wake_flags);
> + return try_to_wake_up(curr->private, mode, wake_flags & WF_SYNC);
> }
> EXPORT_SYMBOL(default_wake_function);

If you do:

Tested-by: Paul E. McKenney <[email protected]>

This was about nine hours of each of the default rcutorture scenarios.

Thanx, Paul

> Since I don't think anybody uses anything other than WF_SYNC, ever. And
> the rest of the WF_flags are used internally.
>
> Thanks Chris!