The control cpu thread which initiates hotplug calls kthread_park()
for hotplug thread and sets KTHREAD_SHOULD_PARK. After this control
thread wakes up the hotplug thread. There is a chance that wakeup
code sees the hotplug thread (running on AP core) in INTERRUPTIBLE
state, but sets its state to RUNNING after hotplug thread has entered
kthread_parkme() and changed its state to TASK_PARKED. This can result
in panic later on in kthread_unpark(), as it sees KTHREAD_IS_PARKED
flag set but fails to rebind the kthread, due to it being not in
TASK_PARKED state. Fix this, by serializing wakeup state change,
against state change before parking the kthread.
Below is the possible race:
Control thread Hotplug Thread
kthread_park()
set KTHREAD_SHOULD_PARK
smpboot_thread_fn
set_current_state(TASK_INTERRUPTIBLE);
kthread_parkme
wake_up_process()
raw_spin_lock_irqsave(&p->pi_lock, flags);
if (!(p->state & state)) -> this will fail
goto out;
__kthread_parkme
__set_current_state(TASK_PARKED);
if (p->on_rq && ttwu_remote(p, wake_flags))
ttwu_remote()
p->state = TASK_RUNNING;
schedule();
So to avoid this race, take pi_lock to serial state changes.
Suggested-by: Pavankumar Kondeti <[email protected]>
Co-developed-by: Neeraj Upadhyay <[email protected]>
Signed-off-by: Neeraj Upadhyay <[email protected]>
Signed-off-by: Gaurav Kohli <[email protected]>
---
Changes since V1:
- Add comment to explain need of pi_lock.
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 5043e74..c5c5184 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -122,7 +122,45 @@ static int smpboot_thread_fn(void *data)
}
if (kthread_should_park()) {
+ /*
+ * Serialize against wakeup. If we take the lock first,
+ * wakeup is skipped. If we run later, we observe,
+ * TASK_RUNNING update from wakeup path, before moving
+ * forward. This helps avoid the race, where wakeup
+ * observes TASK_INTERRUPTIBLE, and also observes
+ * the TASK_PARKED in kthread_parkme() before updating
+ * task state to TASK_RUNNING. In this case, kthread
+ * gets parked in TASK_RUNNING state. This results
+ * in panic later on in kthread_unpark(), as it sees
+ * KTHREAD_IS_PARKED flag set but fails to rebind the
+ * kthread, due to it being not in TASK_PARKED state.
+ *
+ * Control thread Hotplug Thread
+ *
+ * kthread_park()
+ * set KTHREAD_SHOULD_PARK
+ * smpboot_thread_fn()
+ * set_current_state(
+ * TASK_INTERRUPTIBLE);
+ * kthread_parkme()
+ *
+ * wake_up_process()
+ *
+ * raw_spin_lock_irqsave(&p->pi_lock, flags);
+ * if (!(p->state & state)) __set_current_state(
+ * goto out; TASK_RUNNING);
+ *
+ * __set_current_state(
+ * TASK_PARKED);
+ *
+ * if (p->on_rq && ttwu_remote(p, wake_flags))
+ * ttwu_remote()
+ * p->state = TASK_RUNNING;
+ * schedule();
+ */
+ raw_spin_lock(¤t->pi_lock);
__set_current_state(TASK_RUNNING);
+ raw_spin_unlock(¤t->pi_lock);
preempt_enable();
if (ht->park && td->status == HP_THREAD_ACTIVE) {
BUG_ON(td->cpu != smp_processor_id());
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Wed, Apr 25, 2018 at 02:03:19PM +0530, Gaurav Kohli wrote:
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index 5043e74..c5c5184 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -122,7 +122,45 @@ static int smpboot_thread_fn(void *data)
> }
>
> if (kthread_should_park()) {
> + /*
> + * Serialize against wakeup.
*
* Prior wakeups must complete and later wakeups
* will observe TASK_RUNNING.
*
* This avoids the case where the TASK_RUNNING
* store from ttwu() competes with the
* TASK_PARKED store from kthread_parkme().
*
* If the TASK_PARKED store looses that
* competition, kthread_unpark() will go wobbly.
> + */
> + raw_spin_lock(¤t->pi_lock);
> __set_current_state(TASK_RUNNING);
> + raw_spin_unlock(¤t->pi_lock);
> preempt_enable();
> if (ht->park && td->status == HP_THREAD_ACTIVE) {
> BUG_ON(td->cpu != smp_processor_id());
Does that work for you?
But looking at this a bit more; don't we have the exact same problem
with the TASK_RUNNING store in the !ht->thread_should_run() case?
Suppose a ttwu() happens concurrently there, it can end up competing
against the TASK_INTERRUPTIBLE store, no?
Of course, that race is not fatal, we'll just end up going around the
loop once again I suppose. Maybe a comment there too?
/*
* A similar race is possible here, but loosing
* the TASK_INTERRUPTIBLE store is harmless and
* will make us go around the loop once more.
*/
And of course, I suspect we actually want to use TASK_IDLE, smpboot
threads don't want signals do they? But that probably ought to be a
separate patch.
On 4/26/2018 1:39 AM, Peter Zijlstra wrote:
> On Wed, Apr 25, 2018 at 02:03:19PM +0530, Gaurav Kohli wrote:
>> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
>> index 5043e74..c5c5184 100644
>> --- a/kernel/smpboot.c
>> +++ b/kernel/smpboot.c
>> @@ -122,7 +122,45 @@ static int smpboot_thread_fn(void *data)
>> }
>>
>> if (kthread_should_park()) {
>> + /*
>> + * Serialize against wakeup.
> *
> * Prior wakeups must complete and later wakeups
> * will observe TASK_RUNNING.
> *
> * This avoids the case where the TASK_RUNNING
> * store from ttwu() competes with the
> * TASK_PARKED store from kthread_parkme().
> *
> * If the TASK_PARKED store looses that
> * competition, kthread_unpark() will go wobbly.
>> + */
>> + raw_spin_lock(¤t->pi_lock);
>> __set_current_state(TASK_RUNNING);
>> + raw_spin_unlock(¤t->pi_lock);
>> preempt_enable();
>> if (ht->park && td->status == HP_THREAD_ACTIVE) {
>> BUG_ON(td->cpu != smp_processor_id());
> Does that work for you?
We have given patch for testing, usually it takes around 2-3 days for reproduction(we will update for the same).
>
> But looking at this a bit more; don't we have the exact same problem
> with the TASK_RUNNING store in the !ht->thread_should_run() case?
> Suppose a ttwu() happens concurrently there, it can end up competing
> against the TASK_INTERRUPTIBLE store, no?
>
> Of course, that race is not fatal, we'll just end up going around the
> loop once again I suppose. Maybe a comment there too?
>
> /*
> * A similar race is possible here, but loosing
> * the TASK_INTERRUPTIBLE store is harmless and
> * will make us go around the loop once more.
> */
Actually instead of race, i am seeing wakeup miss problem which is very rare, if we take case of hotplug thread
Controller Hotplug
Loop start
set_current_state(TASK_INTERRUPTIBLE);
if (kthread_should_park()) { -> fails
Set Should_park
then wake_up
if (!ht->thread_should_run(td->cpu)) {
preempt_enable_no_resched();
schedule(); Again went to schedule(which is very rare to occur,not sure whether it hits)
>
> And of course, I suspect we actually want to use TASK_IDLE, smpboot
> threads don't want signals do they? But that probably ought to be a
> separate patch.
Yes I agree, we can control race from here as well, Please suggest would below change be any help here:
} else {
__set_current_state(TASK_RUNNING);
preempt_enable();
ht->thread_fn(td->cpu);
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
}
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Wed, Apr 25, 2018 at 10:09:17PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 25, 2018 at 02:03:19PM +0530, Gaurav Kohli wrote:
> > diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> > index 5043e74..c5c5184 100644
> > --- a/kernel/smpboot.c
> > +++ b/kernel/smpboot.c
> > @@ -122,7 +122,45 @@ static int smpboot_thread_fn(void *data)
> > }
> >
> > if (kthread_should_park()) {
> > + /*
> > + * Serialize against wakeup.
> *
> * Prior wakeups must complete and later wakeups
> * will observe TASK_RUNNING.
> *
> * This avoids the case where the TASK_RUNNING
> * store from ttwu() competes with the
> * TASK_PARKED store from kthread_parkme().
> *
> * If the TASK_PARKED store looses that
> * competition, kthread_unpark() will go wobbly.
> > + */
> > + raw_spin_lock(¤t->pi_lock);
> > __set_current_state(TASK_RUNNING);
> > + raw_spin_unlock(¤t->pi_lock);
> > preempt_enable();
> > if (ht->park && td->status == HP_THREAD_ACTIVE) {
> > BUG_ON(td->cpu != smp_processor_id());
>
> Does that work for you?
>
> But looking at this a bit more; don't we have the exact same problem
> with the TASK_RUNNING store in the !ht->thread_should_run() case?
> Suppose a ttwu() happens concurrently there, it can end up competing
> against the TASK_INTERRUPTIBLE store, no?
>
> Of course, that race is not fatal, we'll just end up going around the
> loop once again I suppose. Maybe a comment there too?
>
> /*
> * A similar race is possible here, but loosing
> * the TASK_INTERRUPTIBLE store is harmless and
> * will make us go around the loop once more.
> */
>
And with slightly more sleep I realize this is actually the normal
and expected pattern. The comment with __set_current_state() even
mentions this.
Also, I think the above patch is 'wrong'. It is not the TASK_RUNNING
store that is a problem it is the TASK_PARKED state that is special. And
if you look at do_task_dead() you'll see we do something very similar
for setting TASK_DEAD.
It is a problem specific to blocked states that do not follow the normal
wait pattern:
for (;;) {
set_current_state(STATE);
if (cond)
break;
schedule();
}
__set_current_state(RUNNING);
The initial store or STATE can _always_ loose against a competing
RUNNING store from a previous wakeup, but the wait-loop and @cond test
will make it harmless.
The special states (DEAD,STOPPED,..) are different though, they
do not have a loop and expect to be honoured.
This had me looking at __kthread_park() and afaict we actually have a
condition, namely KTHREAD_SHOULD_PARK, which would suggest the following
change:
diff --git a/kernel/kthread.c b/kernel/kthread.c
index cd50e99202b0..4b6503c6a029 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_struct *task)
static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_PARKED);
- while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+ for (;;) {
+ __set_task_state(TASK_PARKED);
+ if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
+ break;
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
- __set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
For the others, I think we want to do something like the below. I still
need to look at TASK_TRACED, which I suspect is also special, but ptrace
always hurts my brain.
Opinions?
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d697f3b573..f4098435a882 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -110,19 +110,45 @@ struct task_group;
(task->flags & PF_FROZEN) == 0 && \
(task->state & TASK_NOLOAD) == 0)
+/*
+ * Special states are those that do not use the normal wait-loop pattern. See
+ * the comment with set_special_state().
+ */
+#define is_special_state(state) \
+ ((state) == TASK_DEAD || \
+ (state) == TASK_STOPPED)
+
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+/*
+ * Assert we don't use the regular *set_current_state() helpers for special
+ * states. See the comment with set_special_state().
+ */
+#define assert_special_state(state) WARN_ON_ONCE(is_special_state(state))
+
#define __set_current_state(state_value) \
do { \
+ assert_special_state(state_value); \
current->task_state_change = _THIS_IP_; \
current->state = (state_value); \
} while (0)
+
#define set_current_state(state_value) \
do { \
+ assert_special_state(state_value); \
current->task_state_change = _THIS_IP_; \
smp_store_mb(current->state, (state_value)); \
} while (0)
+#define set_special_state(state_value) \
+ do { \
+ unsigned long flags; /* may shadow */ \
+ WARN_ON_ONCE(!is_special_state(state_value)); \
+ raw_spin_lock_irqsave(¤t->pi_lock, flags); \
+ current->task_state_change = _THIS_IP_; \
+ current->state = (state_value); \
+ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
+ } while (0)
#else
/*
* set_current_state() includes a barrier so that the write of current->state
@@ -154,12 +180,30 @@ struct task_group;
* once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
* TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
*
- * This is obviously fine, since they both store the exact same value.
+ * However, with slightly different timing the wakeup TASK_RUNNING store can
+ * also collide with the TASK_UNINTERRUPTIBLE store. Loosing that store is not
+ * a problem either because that will result in one extra go around the loop
+ * and our @cond test will save the day.
*
* Also see the comments of try_to_wake_up().
*/
#define __set_current_state(state_value) do { current->state = (state_value); } while (0)
#define set_current_state(state_value) smp_store_mb(current->state, (state_value))
+
+/*
+ * set_special_state() should be used for those states when the blocking task
+ * can not use the regular condition based wait-loop. In that case we must
+ * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
+ * will not collide with out state change.
+ */
+#define set_special_state(state_value) \
+ do { \
+ unsigned long flags; /* may shadow */ \
+ raw_spin_lock_irqsave(¤t->pi_lock, flags); \
+ current->state = (state_value); \
+ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
+ } while (0)
+
#endif
/* Task command name length: */
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a7ce74c74e49..113d1ad1ced7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -280,7 +280,7 @@ static inline void kernel_signal_stop(void)
{
spin_lock_irq(¤t->sighand->siglock);
if (current->jobctl & JOBCTL_STOP_DEQUEUED)
- __set_current_state(TASK_STOPPED);
+ set_special_state(TASK_STOPPED);
spin_unlock_irq(¤t->sighand->siglock);
schedule();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..3898a8047c11 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3498,23 +3498,8 @@ static void __sched notrace __schedule(bool preempt)
void __noreturn do_task_dead(void)
{
- /*
- * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
- * when the following two conditions become true.
- * - There is race condition of mmap_sem (It is acquired by
- * exit_mm()), and
- * - SMI occurs before setting TASK_RUNINNG.
- * (or hypervisor of virtual machine switches to other guest)
- * As a result, we may become TASK_RUNNING after becoming TASK_DEAD
- *
- * To avoid it, we have to wait for releasing tsk->pi_lock which
- * is held by try_to_wake_up()
- */
- raw_spin_lock_irq(¤t->pi_lock);
- raw_spin_unlock_irq(¤t->pi_lock);
-
/* Causes final put_task_struct in finish_task_switch(): */
- __set_current_state(TASK_DEAD);
+ set_special_state(TASK_DEAD);
/* Tell freezer to ignore us: */
current->flags |= PF_NOFREEZE;
diff --git a/kernel/signal.c b/kernel/signal.c
index d4ccea599692..c9cac52b1369 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2176,7 +2176,7 @@ static bool do_signal_stop(int signr)
if (task_participate_group_stop(current))
notify = CLD_STOPPED;
- __set_current_state(TASK_STOPPED);
+ set_special_state(TASK_STOPPED);
spin_unlock_irq(¤t->sighand->siglock);
/*
On Thu, Apr 26, 2018 at 10:41:31AM +0200, Peter Zijlstra wrote:
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index cd50e99202b0..4b6503c6a029 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_struct *task)
>
> static void __kthread_parkme(struct kthread *self)
> {
> - __set_current_state(TASK_PARKED);
> - while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
> + for (;;) {
> + __set_task_state(TASK_PARKED);
set_current_state(TASK_PARKED);
of course..
> + if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
> + break;
> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
> complete(&self->parked);
> schedule();
> - __set_current_state(TASK_PARKED);
> }
> clear_bit(KTHREAD_IS_PARKED, &self->flags);
> __set_current_state(TASK_RUNNING);
>
On Thu, Apr 26, 2018 at 09:34:36AM +0530, Kohli, Gaurav wrote:
> On 4/26/2018 1:39 AM, Peter Zijlstra wrote:
>
> > On Wed, Apr 25, 2018 at 02:03:19PM +0530, Gaurav Kohli wrote:
> > > diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> > > index 5043e74..c5c5184 100644
> > > --- a/kernel/smpboot.c
> > > +++ b/kernel/smpboot.c
> > > @@ -122,7 +122,45 @@ static int smpboot_thread_fn(void *data)
> > > }
> > > if (kthread_should_park()) {
> > > + /*
> > > + * Serialize against wakeup.
> > *
> > * Prior wakeups must complete and later wakeups
> > * will observe TASK_RUNNING.
> > *
> > * This avoids the case where the TASK_RUNNING
> > * store from ttwu() competes with the
> > * TASK_PARKED store from kthread_parkme().
> > *
> > * If the TASK_PARKED store looses that
> > * competition, kthread_unpark() will go wobbly.
> > > + */
> > > + raw_spin_lock(¤t->pi_lock);
> > > __set_current_state(TASK_RUNNING);
> > > + raw_spin_unlock(¤t->pi_lock);
> > > preempt_enable();
> > > if (ht->park && td->status == HP_THREAD_ACTIVE) {
> > > BUG_ON(td->cpu != smp_processor_id());
> > Does that work for you?
>
> We have given patch for testing, usually it takes around 2-3 days for
> reproduction(we will update for the same).
I only changed the comment; surely your compiler doesn't generate
different code for that?
I was asking if the proposed comment was good with you; but see my more
recent email, that actually proposes a different fix.
> > /*
> > * A similar race is possible here, but loosing
> > * the TASK_INTERRUPTIBLE store is harmless and
> > * will make us go around the loop once more.
> > */
>
> Actually instead of race, i am seeing wakeup miss problem which is
> very rare, if we take case of hotplug thread
Yes, triggering these issues is tricky, no doubt about that.
> > And of course, I suspect we actually want to use TASK_IDLE, smpboot
> > threads don't want signals do they? But that probably ought to be a
> > separate patch.
>
> Yes I agree, we can control race from here as well,? Please suggest
> would below change be any help here:
That is not what I suggested. I said the thing should use TASK_IDLE
instead of TASK_INTERRUPTIBLE. Not change the location of it.
On 4/26/2018 2:27 PM, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 10:41:31AM +0200, Peter Zijlstra wrote:
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index cd50e99202b0..4b6503c6a029 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_struct *task)
>>
>> static void __kthread_parkme(struct kthread *self)
>> {
>> - __set_current_state(TASK_PARKED);
>> - while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
>> + for (;;) {
>> + __set_task_state(TASK_PARKED);
> set_current_state(TASK_PARKED);
>
> of course..
Hi Peter,
Maybe i am missing something , but still that race can come as we don't put task_parked on special state.
Controller Hotplug
Loop
Task_Interruptible
Set SHOULD_PARK
wakeup -> Proceeds
Set Running
kthread_parkme
SET TASK_PARKED
schedule
Set TASK_RUNNING
Can you please correct ME, if I misunderstood this.
>
>> + if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
>> + break;
>> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
>> complete(&self->parked);
>> schedule();
>> - __set_current_state(TASK_PARKED);
>> }
>> clear_bit(KTHREAD_IS_PARKED, &self->flags);
>> __set_current_state(TASK_RUNNING);
>>
>>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Thu, Apr 26, 2018 at 10:41:31AM +0200, Peter Zijlstra wrote:
[...]
> +/*
> + * Special states are those that do not use the normal wait-loop pattern. See
> + * the comment with set_special_state().
> + */
> +#define is_special_state(state) \
> + ((state) == TASK_DEAD || \
> + (state) == TASK_STOPPED)
> +
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>
> +/*
> + * Assert we don't use the regular *set_current_state() helpers for special
> + * states. See the comment with set_special_state().
> + */
> +#define assert_special_state(state) WARN_ON_ONCE(is_special_state(state))
Nitpicking, this name suggests "Shout if the state is NOT special" to me:
maybe,
#define assert_special_state(state) WARN_ON_ONCE(!is_special_state(state))
#define assert_regular_state(state) WARN_ON_ONCE(is_special_state(state))
or just do with the WARN_ON_ONCE()s ?
Andrea
> +
> #define __set_current_state(state_value) \
> do { \
> + assert_special_state(state_value); \
> current->task_state_change = _THIS_IP_; \
> current->state = (state_value); \
> } while (0)
> +
> #define set_current_state(state_value) \
> do { \
> + assert_special_state(state_value); \
> current->task_state_change = _THIS_IP_; \
> smp_store_mb(current->state, (state_value)); \
> } while (0)
>
> +#define set_special_state(state_value) \
> + do { \
> + unsigned long flags; /* may shadow */ \
> + WARN_ON_ONCE(!is_special_state(state_value)); \
> + raw_spin_lock_irqsave(¤t->pi_lock, flags); \
> + current->task_state_change = _THIS_IP_; \
> + current->state = (state_value); \
> + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
> + } while (0)
On 04/26, Peter Zijlstra wrote:
>
> For the others, I think we want to do something like the below. I still
> need to look at TASK_TRACED, which I suspect is also special,
Yes, and TASK_STOPPED.
ptrace_freeze_traced() and ptrace_unfreeze_traced() should be fine, but
ptrace_stop() wants set_special_state() too, I think.
> but ptrace always hurts my brain.
mine too ;)
> +/*
> + * set_special_state() should be used for those states when the blocking task
> + * can not use the regular condition based wait-loop. In that case we must
> + * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
> + * will not collide with out state change.
> + */
> +#define set_special_state(state_value) \
> + do { \
> + unsigned long flags; /* may shadow */ \
> + raw_spin_lock_irqsave(¤t->pi_lock, flags); \
> + current->state = (state_value); \
> + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
> + } while (0)
> +
Agreed.
I thought that perhaps we can change ttwu_do_wakeup() cmpxchg() instead of
plain p->state = TASK_RUNNING, but this helper looks much more clear and simple.
Oleg.
FYI, we noticed the following commit (built with gcc-6):
commit: cad8e9967526d263d9a4f04ca763b2d76c045750 ("kthread/smpboot: Serialize kthread parking against wakeup")
url: https://github.com/0day-ci/linux/commits/Gaurav-Kohli/kthread-smpboot-Serialize-kthread-parking-against-wakeup/20180426-185404
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 512M
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+------------------------------------------------------------------+-----------+------------+
| | v4.17-rc2 | cad8e99675 |
+------------------------------------------------------------------+-----------+------------+
| boot_successes | 60 | 0 |
| boot_failures | 34 | 10 |
| invoked_oom-killer:gfp_mask=0x | 20 | 4 |
| Mem-Info | 20 | 4 |
| Kernel_panic-not_syncing:Out_of_memory_and_no_killable_processes | 11 | |
| Out_of_memory:Kill_process | 9 | 4 |
| RIP:__clear_user | 1 | 1 |
| WARNING:inconsistent_lock_state | 6 | 10 |
| inconsistent{HARDIRQ-ON-W}->{IN-HARDIRQ-W}usage | 6 | |
| calltrace:neigh_periodic_work | 2 | |
| RIP:e1000_watchdog | 4 | |
| BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h | 6 | |
| calltrace:irq_exit | 2 | |
| IP-Config:Auto-configuration_of_network_failed | 8 | 4 |
| RIP:native_safe_halt | 2 | |
| inconsistent{IN-HARDIRQ-W}->{HARDIRQ-ON-W}usage | 0 | 10 |
+------------------------------------------------------------------+-----------+------------+
[ 21.664248] WARNING: inconsistent lock state
[ 21.664989] 4.17.0-rc2-00001-gcad8e99 #1 Not tainted
[ 21.665824] --------------------------------
[ 21.666510] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 21.667591] cpuhp/0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 21.668441] (ptrval) (&p->pi_lock){?.-.}, at: smpboot_thread_fn+0x183/0x42c
[ 21.669715] {IN-HARDIRQ-W} state was registered at:
[ 21.670739] lock_acquire+0x99/0xc9
[ 21.671341] _raw_spin_lock_irqsave+0x45/0x59
[ 21.672062] try_to_wake_up+0x8f/0x528
[ 21.672733] wake_up_process+0x10/0x12
[ 21.673434] hrtimer_wakeup+0x27/0x2c
[ 21.674068] __hrtimer_run_queues+0x163/0x1d4
[ 21.674785] hrtimer_run_queues+0x175/0x1b4
[ 21.675518] run_local_timers+0x30/0x7f
[ 21.676206] update_process_times+0x22/0x4f
[ 21.686136] tick_periodic+0xc4/0xca
[ 21.689746] tick_handle_periodic+0x29/0x7b
[ 21.690518] smp_apic_timer_interrupt+0x9d/0xb1
[ 21.691318] apic_timer_interrupt+0xf/0x20
[ 21.692022] native_safe_halt+0x6/0x8
[ 21.692635] default_idle+0x13/0x1c
[ 21.693253] arch_cpu_idle+0xa/0xc
[ 21.697213] default_idle_call+0x35/0x38
[ 21.697891] do_idle+0x181/0x2bf
[ 21.698467] cpu_startup_entry+0x1a/0x1c
[ 21.699156] rest_init+0x22c/0x232
[ 21.699752] start_kernel+0x4fc/0x520
[ 21.700309] x86_64_start_reservations+0x24/0x26
[ 21.701909] x86_64_start_kernel+0x6f/0x72
[ 21.702553] secondary_startup_64+0xa5/0xb0
[ 21.703294] irq event stamp: 1401
[ 21.703834] hardirqs last enabled at (1401): [<ffffffffbd3ef4f3>] _raw_spin_unlock_irq+0x27/0x4b
[ 21.705223] hardirqs last disabled at (1400): [<ffffffffbd3e69e3>] __schedule+0x23b/0xccc
[ 21.706490] softirqs last enabled at (1206): [<ffffffffbd60028d>] __do_softirq+0x28d/0x2db
[ 21.707466] softirqs last disabled at (1181): [<ffffffffbcab67cf>] irq_exit+0x67/0x100
[ 21.708244]
[ 21.708244] other info that might help us debug this:
[ 21.709009] Possible unsafe locking scenario:
[ 21.709009]
[ 21.709919] CPU0
[ 21.710300] ----
[ 21.710696] lock(&p->pi_lock);
[ 21.711199] <Interrupt>
[ 21.711614] lock(&p->pi_lock);
[ 21.712146]
[ 21.712146] *** DEADLOCK ***
[ 21.712146]
[ 21.713108] no locks held by cpuhp/0/14.
[ 21.713773]
[ 21.713773] stack backtrace:
[ 21.714502] CPU: 0 PID: 14 Comm: cpuhp/0 Not tainted 4.17.0-rc2-00001-gcad8e99 #1
[ 21.715747] Call Trace:
[ 21.716174] dump_stack+0x9c/0xe9
[ 21.716756] print_usage_bug+0x321/0x330
[ 21.717533] mark_lock+0x4f5/0x779
[ 21.718122] ? finish_task_switch+0x274/0x325
[ 21.718653] ? print_shortest_lock_dependencies+0x1c6/0x1c6
[ 21.719230] __lock_acquire+0x4a0/0x888
[ 21.719606] lock_acquire+0x99/0xc9
[ 21.719992] ? smpboot_thread_fn+0x183/0x42c
[ 21.720457] _raw_spin_lock+0x2d/0x3c
[ 21.720885] ? smpboot_thread_fn+0x183/0x42c
[ 21.721285] smpboot_thread_fn+0x183/0x42c
[ 21.721703] ? sort_range+0x1d/0x1d
[ 21.722105] ? preempt_count_sub+0x13/0xb5
[ 21.722542] ? schedule+0xdd/0xe9
[ 21.722934] kthread+0x1db/0x1eb
[ 21.723245] ? sort_range+0x1d/0x1d
[ 21.723620] ? kthread_create_on_node+0xa1/0xa1
[ 21.724137] ret_from_fork+0x24/0x30
[ 21.730948] numa_remove_cpu cpu 0 node 0: mask now 1
[ 21.733638] CPU 0 is now offline
[ 21.735209] debug: unmapping init [mem 0xffffffffbe187000-0xffffffffbe2affff]
[ 21.736225] Write protecting the kernel read-only data: 20480k
[ 21.739451] debug: unmapping init [mem 0xffff88000ec07000-0xffff88000edfffff]
[ 21.741223] debug: unmapping init [mem 0xffff88000f2dc000-0xffff88000f3fffff]
[ 21.798549] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[ 21.815648] random: init: uninitialized urandom read (12 bytes read)
[ 21.876844] hwclock (101) used greatest stack depth: 27928 bytes left
[ 21.886110] plymouthd (99) used greatest stack depth: 27656 bytes left
[ 21.906815] sh (104) used greatest stack depth: 27640 bytes left
[ 21.953005] mountall (108) used greatest stack depth: 26776 bytes left
[ 21.976917] random: trinity: uninitialized urandom read (4 bytes read)
mountall: Event failed
[ 22.102893] random: mountall: uninitialized urandom read (12 bytes read)
[ 22.154174] init: Failed to create pty - disabling logging for job
[ 22.155285] init: Temporary process spawn error: No such file or directory
[ 22.191738] init: Failed to create pty - disabling logging for job
[ 22.193058] init: Temporary process spawn error: No such file or directory
[ 22.301682] init: Failed to create pty - disabling logging for job
[ 22.302812] init: Temporary process spawn error: No such file or directory
[ 22.307684] init: Failed to create pty - disabling logging for job
[ 22.308793] init: Temporary process spawn error: No such file or directory
[ 22.427189] init: Failed to create pty - disabling logging for job
[ 22.428994] init: Temporary process spawn error: No such file or directory
[ 22.439238] init: Failed to create pty - disabling logging for job
[ 22.440420] init: Temporary process spawn error: No such file or directory
[ 22.451555] init: Failed to create pty - disabling logging for job
[ 22.452662] init: Temporary process spawn error: No such file or directory
[ 22.477652] init: Failed to create pty - disabling logging for job
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Xiaolong
On Thu, Apr 26, 2018 at 09:23:25PM +0530, Kohli, Gaurav wrote:
> On 4/26/2018 2:27 PM, Peter Zijlstra wrote:
>
> > On Thu, Apr 26, 2018 at 10:41:31AM +0200, Peter Zijlstra wrote:
> > > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > > index cd50e99202b0..4b6503c6a029 100644
> > > --- a/kernel/kthread.c
> > > +++ b/kernel/kthread.c
> > > @@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_struct *task)
> > > static void __kthread_parkme(struct kthread *self)
> > > {
> > > - __set_current_state(TASK_PARKED);
> > > - while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
> > > + for (;;) {
> > > + __set_task_state(TASK_PARKED);
> > set_current_state(TASK_PARKED);
> >
> > of course..
>
> Hi Peter,
>
> Maybe i am missing something , but still that race can come as we don't put task_parked on special state.
>
> Controller ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Hotplug
>
> ???? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? Loop
>
> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?Task_Interruptible
>
> Set SHOULD_PARK
>
> wakeup -> Proceeds
>
> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ? Set Running
>
> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ? kthread_parkme
>
> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ? SET TASK_PARKED
>
> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ? schedule
>
> Set TASK_RUNNING
>
> Can you please correct ME, if I misunderstood this.
If that could happen, all wait-loops would be broken. However,
AFAICT that cannot happen, because ttwu_remote() and schedule()
serialize on rq->lock. See:
A B
for (;;) {
set_current_state(UNINTERRUPTIBLE);
cond1 = true;
wake_up_process(A)
lock(A->pi_lock)
smp_mb__after_spinlock()
if (A->state & TASK_NORMAL)
A->on_rq && ttwu_remote()
if (cond1) // true
break;
schedule();
}
__set_current_state(RUNNING);
for (;;) {
set_current_state(UNINTERRUPTIBLE);
if (cond2)
break;
schedule();
lock(rq->lock)
smp_mb__after_spinlock();
deactivate_task(A);
<sched-out>
unlock(rq->lock);
rq = __task_rq_lock(A)
if (A->on_rq) // false
A->state = TASK_RUNNING;
__task_rq_unlock(rq)
Either A's schedule() must observe RUNNING (not shown) or B must
observe !A->on_rq (shown) and not issue the store.
On Thu, Apr 26, 2018 at 06:18:20PM +0200, Oleg Nesterov wrote:
> On 04/26, Peter Zijlstra wrote:
> >
> > For the others, I think we want to do something like the below. I still
> > need to look at TASK_TRACED, which I suspect is also special,
>
> Yes, and TASK_STOPPED.
I already did that one, right ;-)
>
> ptrace_freeze_traced() and ptrace_unfreeze_traced() should be fine, but
> ptrace_stop() wants set_special_state() too, I think.
Thanks for going through that, I'll update the patch.
> > +/*
> > + * set_special_state() should be used for those states when the blocking task
> > + * can not use the regular condition based wait-loop. In that case we must
> > + * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
> > + * will not collide with out state change.
> > + */
> > +#define set_special_state(state_value) \
> > + do { \
> > + unsigned long flags; /* may shadow */ \
> > + raw_spin_lock_irqsave(¤t->pi_lock, flags); \
> > + current->state = (state_value); \
> > + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
> > + } while (0)
> > +
>
> Agreed.
>
> I thought that perhaps we can change ttwu_do_wakeup() cmpxchg() instead of
> plain p->state = TASK_RUNNING, but this helper looks much more clear and simple.
Aside from cmpxchg() slowing down the common fast path, that also
would not work on architectures where cmpxchg() is not in fact atomic vs
regular stores (and yes, sadly we still have those, PARISC and SPARC32
come to mind).
On Mon, Apr 30, 2018 at 01:20:29PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 06:18:20PM +0200, Oleg Nesterov wrote:
> > On 04/26, Peter Zijlstra wrote:
> > >
> > > For the others, I think we want to do something like the below. I still
> > > need to look at TASK_TRACED, which I suspect is also special,
> >
> > ptrace_freeze_traced() and ptrace_unfreeze_traced() should be fine, but
> > ptrace_stop() wants set_special_state() too, I think.
>
> Thanks for going through that, I'll update the patch.
Something like so then... I'll go write me some Changelog.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d697f3b573..51f46704e566 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -112,17 +112,38 @@ struct task_group;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+/*
+ * Special states are those that do not use the normal wait-loop pattern. See
+ * the comment with set_special_state().
+ */
+#define is_special_task_state(state) \
+ ((state) == TASK_DEAD || \
+ (state) == TASK_STOPPED || \
+ (state) == TASK_TRACED)
+
#define __set_current_state(state_value) \
do { \
+ WARN_ON_ONCE(is_special_task_state(state)); \
current->task_state_change = _THIS_IP_; \
current->state = (state_value); \
} while (0)
+
#define set_current_state(state_value) \
do { \
+ WARN_ON_ONCE(is_special_task_state(state)); \
current->task_state_change = _THIS_IP_; \
smp_store_mb(current->state, (state_value)); \
} while (0)
+#define set_special_state(state_value) \
+ do { \
+ unsigned long flags; /* may shadow */ \
+ WARN_ON_ONCE(!is_special_task_state(state_value)); \
+ raw_spin_lock_irqsave(¤t->pi_lock, flags); \
+ current->task_state_change = _THIS_IP_; \
+ current->state = (state_value); \
+ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
+ } while (0)
#else
/*
* set_current_state() includes a barrier so that the write of current->state
@@ -144,8 +165,8 @@ struct task_group;
*
* The above is typically ordered against the wakeup, which does:
*
- * need_sleep = false;
- * wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ * need_sleep = false;
+ * wake_up_state(p, TASK_UNINTERRUPTIBLE);
*
* Where wake_up_state() (and all other wakeup primitives) imply enough
* barriers to order the store of the variable against wakeup.
@@ -154,12 +175,33 @@ struct task_group;
* once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
* TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
*
- * This is obviously fine, since they both store the exact same value.
+ * However, with slightly different timing the wakeup TASK_RUNNING store can
+ * also collide with the TASK_UNINTERRUPTIBLE store. Loosing that store is not
+ * a problem either because that will result in one extra go around the loop
+ * and our @cond test will save the day.
*
* Also see the comments of try_to_wake_up().
*/
-#define __set_current_state(state_value) do { current->state = (state_value); } while (0)
-#define set_current_state(state_value) smp_store_mb(current->state, (state_value))
+#define __set_current_state(state_value) \
+ current->state = (state_value)
+
+#define set_current_state(state_value) \
+ smp_store_mb(current->state, (state_value))
+
+/*
+ * set_special_state() should be used for those states when the blocking task
+ * can not use the regular condition based wait-loop. In that case we must
+ * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
+ * will not collide with our state change.
+ */
+#define set_special_state(state_value) \
+ do { \
+ unsigned long flags; /* may shadow */ \
+ raw_spin_lock_irqsave(¤t->pi_lock, flags); \
+ current->state = (state_value); \
+ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \
+ } while (0)
+
#endif
/* Task command name length: */
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a7ce74c74e49..113d1ad1ced7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -280,7 +280,7 @@ static inline void kernel_signal_stop(void)
{
spin_lock_irq(¤t->sighand->siglock);
if (current->jobctl & JOBCTL_STOP_DEQUEUED)
- __set_current_state(TASK_STOPPED);
+ set_special_state(TASK_STOPPED);
spin_unlock_irq(¤t->sighand->siglock);
schedule();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..3898a8047c11 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3498,23 +3498,8 @@ static void __sched notrace __schedule(bool preempt)
void __noreturn do_task_dead(void)
{
- /*
- * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
- * when the following two conditions become true.
- * - There is race condition of mmap_sem (It is acquired by
- * exit_mm()), and
- * - SMI occurs before setting TASK_RUNINNG.
- * (or hypervisor of virtual machine switches to other guest)
- * As a result, we may become TASK_RUNNING after becoming TASK_DEAD
- *
- * To avoid it, we have to wait for releasing tsk->pi_lock which
- * is held by try_to_wake_up()
- */
- raw_spin_lock_irq(¤t->pi_lock);
- raw_spin_unlock_irq(¤t->pi_lock);
-
/* Causes final put_task_struct in finish_task_switch(): */
- __set_current_state(TASK_DEAD);
+ set_special_state(TASK_DEAD);
/* Tell freezer to ignore us: */
current->flags |= PF_NOFREEZE;
diff --git a/kernel/signal.c b/kernel/signal.c
index d4ccea599692..564ee214e80a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
* atomic with respect to siglock and should be done after the arch
* hook as siglock is released and regrabbed across it.
*/
- set_current_state(TASK_TRACED);
+ set_special_state(TASK_TRACED);
current->last_siginfo = info;
current->exit_code = exit_code;
@@ -2176,7 +2176,7 @@ static bool do_signal_stop(int signr)
if (task_participate_group_stop(current))
notify = CLD_STOPPED;
- __set_current_state(TASK_STOPPED);
+ set_special_state(TASK_STOPPED);
spin_unlock_irq(¤t->sighand->siglock);
/*
sorry for spam, Adding list
On 4/30/2018 4:47 PM, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 09:23:25PM +0530, Kohli, Gaurav wrote:
>> On 4/26/2018 2:27 PM, Peter Zijlstra wrote:
>>
>>> On Thu, Apr 26, 2018 at 10:41:31AM +0200, Peter Zijlstra wrote:
>>>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>>>> index cd50e99202b0..4b6503c6a029 100644
>>>> --- a/kernel/kthread.c
>>>> +++ b/kernel/kthread.c
>>>> @@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_struct *task)
>>>> static void __kthread_parkme(struct kthread *self)
>>>> {
>>>> - __set_current_state(TASK_PARKED);
>>>> - while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
>>>> + for (;;) {
>>>> + __set_task_state(TASK_PARKED);
>>> set_current_state(TASK_PARKED);
>>>
>>> of course..
>>
>> Hi Peter,
>>
>> Maybe i am missing something , but still that race can come as we don't put task_parked on special state.
>>
>> Controller Hotplug
>>
>> Loop
>>
>> Task_Interruptible
>>
>> Set SHOULD_PARK
>>
>> wakeup -> Proceeds
>>
>> Set Running
>>
>> kthread_parkme
>>
>> SET TASK_PARKED
>>
>> schedule
>>
>> Set TASK_RUNNING
>>
>> Can you please correct ME, if I misunderstood this.
>
> If that could happen, all wait-loops would be broken. However,
> AFAICT that cannot happen, because ttwu_remote() and schedule()
> serialize on rq->lock. See:
>
>
> A B
>
> for (;;) {
> set_current_state(UNINTERRUPTIBLE);
>
> cond1 = true;
> wake_up_process(A)
> lock(A->pi_lock)
> smp_mb__after_spinlock()
> if (A->state & TASK_NORMAL)
> A->on_rq && ttwu_remote()
> if (cond1) // true
> break;
> schedule();
> }
> __set_current_state(RUNNING);
>
Hi Peter,
Sorry for the late reply and i was on leave.
Thanks for the new patches, We will apply and test for issue reproduction.
But In our older case, where we have seen failure below is the wake up
path and ftraces, Wakeup occured and completed before schedule call only.
So final state of CPUHP is running not parked. I have also pasted debug
ftraces that we got during issue reproduction.
Here wakeup for cpuhp is below:
takedown_cpu-> kthread_park-> wake_up_process
39,034,311,742,395 apps (10240) Trace Printk
cpuhp/0 (16) [000] 39015.625000: <debug> __kthread_parkme state=512
task=ffffffcc7458e680 flags: 0x5 -> state 5 -> state is parked inside
parkme function
39,034,311,846,510 apps (10240) Trace Printk
cpuhp/0 (16) [000] 39015.625000: <debug> before schedule
__kthread_parkme state=0 task=ffffffcc7458e680 flags: 0xd -> just
before schedule call, state is running
tatic void __kthread_parkme(struct kthread *self)
{
__set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
schedule();
__set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
}
So my point is here also, if it is reschedule then it can set
TASK_PARKED, but it seems after takedown_cpu call this thread never get
a chance to run, So final state is TASK_RUNNING.
In our current fix also can't we observe same scenario where final state
is TASK_RUNNING.
Regards
Gaurav
> for (;;) {
> set_current_state(UNINTERRUPTIBLE);
> if (cond2)
> break;
>
> schedule();
> lock(rq->lock)
> smp_mb__after_spinlock();
> deactivate_task(A);
> <sched-out>
> unlock(rq->lock);
> rq = __task_rq_lock(A)
> if (A->on_rq) // false
> A->state = TASK_RUNNING;
> __task_rq_unlock(rq)
>
>
> Either A's schedule() must observe RUNNING (not shown) or B must
> observe !A->on_rq (shown) and not issue the store.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Tue, May 01, 2018 at 01:20:26PM +0530, Kohli, Gaurav wrote:
> But In our older case, where we have seen failure below is the wake up path
> and ftraces, Wakeup occured and completed before schedule call only.
>
> So final state of CPUHP is running not parked. I have also pasted debug
> ftraces that we got during issue reproduction.
>
> Here wakeup for cpuhp is below:
>
> takedown_cpu-> kthread_park-> wake_up_process
>
>
> 39,034,311,742,395 apps (10240) Trace Printk cpuhp/0 (16) [000]
> 39015.625000: <debug> __kthread_parkme state=512 task=ffffffcc7458e680
> flags: 0x5 -> state 5 -> state is parked inside parkme function
>
> 39,034,311,846,510 apps (10240) Trace Printk cpuhp/0 (16) [000]
> 39015.625000: <debug> before schedule __kthread_parkme state=0
> task=ffffffcc7458e680 flags: 0xd -> just before schedule call, state is
> running
>
> tatic void __kthread_parkme(struct kthread *self)
>
> {
>
> __set_current_state(TASK_PARKED);
>
> while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
>
> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
>
> complete(&self->parked);
>
> schedule();
>
> __set_current_state(TASK_PARKED);
>
> }
>
> clear_bit(KTHREAD_IS_PARKED, &self->flags);
>
> __set_current_state(TASK_RUNNING);
>
> }
>
> So my point is here also, if it is reschedule then it can set TASK_PARKED,
> but it seems after takedown_cpu call this thread never get a chance to run,
> So final state is TASK_RUNNING.
>
> In our current fix also can't we observe same scenario where final state is
> TASK_RUNNING.
I'm not sure I understand your concern. Loosing the TASK_PARKED store
with the above code is obviously bad. But with the loop as proposed I
don't see a problem.
takedown_cpu() can proceed beyond smpboot_park_threads() and kill the
CPU before any of the threads are parked -- per having the complete()
before hitting schedule().
And, afaict, that is harmless. When we go offline, sched_cpu_dying() ->
migrate_tasks() will migrate any still runnable threads off the cpu.
But because at this point the thread must be in the PARKED wait-loop, it
will hit schedule() and go to sleep eventually.
Also note that kthread_unpark() does __kthread_bind() to rebind the
threads.
Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
before we rebind, so if the thread lost the first PARKED store,
does the completion, gets migrated, cycles through the loop and now
observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
will forever wait.
Is that what you had in mind?
On Tue, May 01, 2018 at 12:18:45PM +0200, Peter Zijlstra wrote:
> Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
> before we rebind, so if the thread lost the first PARKED store,
> does the completion, gets migrated, cycles through the loop and now
> observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
> will forever wait.
>
> Is that what you had in mind?
Another possible problem is concurrent thread_park(), if both observe
!IS_PARKED, we'll end up with 2 threads waiting on the completion, but
we only do a single complete().
Of course, this might not be a suppored use of the API, and I don't
think this will ever actually happen. But the whole !test_bit(IS_PARKED)
thing seems to suggest it is fine calling this on an already parked
thread.
Confusing stuff that should be cleared up in any case.
On 5/1/2018 3:48 PM, Peter Zijlstra wrote:
> On Tue, May 01, 2018 at 01:20:26PM +0530, Kohli, Gaurav wrote:
>> But In our older case, where we have seen failure below is the wake up path
>> and ftraces, Wakeup occured and completed before schedule call only.
>>
>> So final state of CPUHP is running not parked. I have also pasted debug
>> ftraces that we got during issue reproduction.
>>
>> Here wakeup for cpuhp is below:
>>
>> takedown_cpu-> kthread_park-> wake_up_process
>>
>>
>> 39,034,311,742,395 apps (10240) Trace Printk cpuhp/0 (16) [000]
>> 39015.625000: <debug> __kthread_parkme state=512 task=ffffffcc7458e680
>> flags: 0x5 -> state 5 -> state is parked inside parkme function
>>
>> 39,034,311,846,510 apps (10240) Trace Printk cpuhp/0 (16) [000]
>> 39015.625000: <debug> before schedule __kthread_parkme state=0
>> task=ffffffcc7458e680 flags: 0xd -> just before schedule call, state is
>> running
>>
>> tatic void __kthread_parkme(struct kthread *self)
>>
>> {
>>
>> __set_current_state(TASK_PARKED);
>>
>> while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
>>
>> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
>>
>> complete(&self->parked);
>>
>> schedule();
>>
>> __set_current_state(TASK_PARKED);
>>
>> }
>>
>> clear_bit(KTHREAD_IS_PARKED, &self->flags);
>>
>> __set_current_state(TASK_RUNNING);
>>
>> }
>>
>> So my point is here also, if it is reschedule then it can set TASK_PARKED,
>> but it seems after takedown_cpu call this thread never get a chance to run,
>> So final state is TASK_RUNNING.
>>
>> In our current fix also can't we observe same scenario where final state is
>> TASK_RUNNING.
>
> I'm not sure I understand your concern. Loosing the TASK_PARKED store
> with the above code is obviously bad. But with the loop as proposed I
> don't see a problem.
Yes with loop, it will reset TASK_PARKED but that is not happening in
the dumps we have seen. Here before schedule state is RUNNING and cpuhp
got migrate to some core but never get a chance to run so state is running.
>
> takedown_cpu() can proceed beyond smpboot_park_threads() and kill the
> CPU before any of the threads are parked -- per having the complete()
> before hitting schedule().
>
> And, afaict, that is harmless. When we go offline, sched_cpu_dying() ->
> migrate_tasks() will migrate any still runnable threads off the cpu.
> But because at this point the thread must be in the PARKED wait-loop, it
> will hit schedule() and go to sleep eventually.
>
> Also note that kthread_unpark() does __kthread_bind() to rebind the
> threads.
>
> Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
> before we rebind, so if the thread lost the first PARKED store,
> does the completion, gets migrated, cycles through the loop and now
> observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
> will forever wait.
>
So during next unpark
__kthread_unpark -> __kthread_bind -> wait_task_inactive (this got
failed, as current state is running so failed on below call:
while (task_running(rq, p)) {
if (match_state && unlikely(p->state !=
match_state))
return 0;
cpu_relax();
}
and gives warning:
if (!wait_task_inactive(p, state)) {
WARN_ON(1);
return; -> return from here, and further binding call
fail which is after this code.
}
finally it is giving bug_on here as we failed to rebind hotplug to our core:
}
kthread_parkme();
/* We might have been woken for stop */
continue;
}
BUG_ON(td->cpu != smp_processor_id()); panic occured.
So it seems we always have to be in PARKED state only , not miss any
single instance.
> Is that what you had in mind?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Tue, May 01, 2018 at 12:18:45PM +0200, Peter Zijlstra wrote:
> Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
> before we rebind, so if the thread lost the first PARKED store,
> does the completion, gets migrated, cycles through the loop and now
> observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
> will forever wait.
Something like so perhaps...
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -451,6 +451,21 @@ void kthread_unpark(struct task_struct *
{
struct kthread *kthread = to_kthread(k);
+ if (test_bit(KTHREAD_IS_PARKED)) {
+ /*
+ * Newly created kthread was parked when the CPU was offline.
+ * The binding was lost and we need to set it again.
+ */
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ }
+
+ /*
+ * Ensures the IS_PARKED load precedes the !SHOULD_PARK store.
+ * matched by the smp_mb() from test_and_set_bit() in __kthread_parkme().
+ */
+ smp_mb__before_atomic();
+
clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
/*
* We clear the IS_PARKED bit here as we don't wait
@@ -458,15 +473,8 @@ void kthread_unpark(struct task_struct *
* park before that happens we'd see the IS_PARKED bit
* which might be about to be cleared.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- /*
- * Newly created kthread was parked when the CPU was offline.
- * The binding was lost and we need to set it again.
- */
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu, TASK_PARKED);
+ if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags))
wake_up_state(k, TASK_PARKED);
- }
}
EXPORT_SYMBOL_GPL(kthread_unpark);
On Tue, May 01, 2018 at 04:10:53PM +0530, Kohli, Gaurav wrote:
> Yes with loop, it will reset TASK_PARKED but that is not happening in the
> dumps we have seen.
But was that with or without the fixed wait-loop? I don't care about
stuff you might have seen with the current code, that is clearly broken.
> > takedown_cpu() can proceed beyond smpboot_park_threads() and kill the
> > CPU before any of the threads are parked -- per having the complete()
> > before hitting schedule().
> >
> > And, afaict, that is harmless. When we go offline, sched_cpu_dying() ->
> > migrate_tasks() will migrate any still runnable threads off the cpu.
> > But because at this point the thread must be in the PARKED wait-loop, it
> > will hit schedule() and go to sleep eventually.
> >
> > Also note that kthread_unpark() does __kthread_bind() to rebind the
> > threads.
> >
> > Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
> > before we rebind, so if the thread lost the first PARKED store,
> > does the completion, gets migrated, cycles through the loop and now
> > observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
> > will forever wait.
> >
>
> So during next unpark
> __kthread_unpark -> __kthread_bind -> wait_task_inactive (this got failed,
> as current state is running so failed on below call:
Aah, yes, I seem to have mis-remembered how wait_task_inactive() works.
And it is indeed still a problem..
Let me ponder what the best solution is, it's a bit of a mess.
On 5/1/2018 5:01 PM, Peter Zijlstra wrote:
> On Tue, May 01, 2018 at 04:10:53PM +0530, Kohli, Gaurav wrote:
>> Yes with loop, it will reset TASK_PARKED but that is not happening in the
>> dumps we have seen.
>
> But was that with or without the fixed wait-loop? I don't care about
> stuff you might have seen with the current code, that is clearly broken.
>
>>> takedown_cpu() can proceed beyond smpboot_park_threads() and kill the
>>> CPU before any of the threads are parked -- per having the complete()
>>> before hitting schedule().
>>>
>>> And, afaict, that is harmless. When we go offline, sched_cpu_dying() ->
>>> migrate_tasks() will migrate any still runnable threads off the cpu.
>>> But because at this point the thread must be in the PARKED wait-loop, it
>>> will hit schedule() and go to sleep eventually.
>>>
>>> Also note that kthread_unpark() does __kthread_bind() to rebind the
>>> threads.
>>>
>>> Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK
>>> before we rebind, so if the thread lost the first PARKED store,
>>> does the completion, gets migrated, cycles through the loop and now
>>> observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind()
>>> will forever wait.
>>>
>>
>> So during next unpark
>> __kthread_unpark -> __kthread_bind -> wait_task_inactive (this got failed,
>> as current state is running so failed on below call:
>
> Aah, yes, I seem to have mis-remembered how wait_task_inactive() works.
> And it is indeed still a problem..
>
> Let me ponder what the best solution is, it's a bit of a mess.
>
Sure , Thanks a lot.
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
> On 5/1/2018 5:01 PM, Peter Zijlstra wrote:
> > Let me ponder what the best solution is, it's a bit of a mess.
So there's:
- TASK_PARKED, which we can make a special state; this solves the
problem because then wait_task_inactive() is guaranteed to see
TASK_PARKED and DTRT.
- complete(&kthread->parked), which we can do inside schedule(); this
solves the problem because then kthread_park() will not return early
and the task really is blocked.
and I'm fairly sure I thought of a 3rd way to cure things, but now that
I'm writing things down I cannot seem to remember :/ -- could be we muck
with wait_task_inactive().
In any case, I hate all of them, but I think the completion one is the
least horrible because it gives the strongest guarantees and cleans up
most. But promoting TASK_PARKED to special certainly is the earier
patch.
The below boots, but that's about all I did with it. Opinions?
---
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -62,6 +62,7 @@ void *kthread_probe_data(struct task_str
int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
+void kthread_park_complete(struct task_struct *k);
int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -55,7 +55,6 @@ enum KTHREAD_BITS {
KTHREAD_IS_PER_CPU = 0,
KTHREAD_SHOULD_STOP,
KTHREAD_SHOULD_PARK,
- KTHREAD_IS_PARKED,
};
static inline void set_kthread_struct(void *kthread)
@@ -177,14 +176,12 @@ void *kthread_probe_data(struct task_str
static void __kthread_parkme(struct kthread *self)
{
- __set_current_state(TASK_PARKED);
- while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
- if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
- complete(&self->parked);
+ for (;;) {
+ set_current_state(TASK_PARKED);
+ if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
+ break;
schedule();
- __set_current_state(TASK_PARKED);
}
- clear_bit(KTHREAD_IS_PARKED, &self->flags);
__set_current_state(TASK_RUNNING);
}
@@ -194,6 +191,11 @@ void kthread_parkme(void)
}
EXPORT_SYMBOL_GPL(kthread_parkme);
+void kthread_park_complete(struct task_struct *k)
+{
+ complete(&to_kthread(k)->parked);
+}
+
static int kthread(void *_create)
{
/* Copy data: it's on kthread's stack */
@@ -450,22 +452,15 @@ void kthread_unpark(struct task_struct *
{
struct kthread *kthread = to_kthread(k);
- clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
/*
- * We clear the IS_PARKED bit here as we don't wait
- * until the task has left the park code. So if we'd
- * park before that happens we'd see the IS_PARKED bit
- * which might be about to be cleared.
+ * Newly created kthread was parked when the CPU was offline.
+ * The binding was lost and we need to set it again.
*/
- if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- /*
- * Newly created kthread was parked when the CPU was offline.
- * The binding was lost and we need to set it again.
- */
- if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
- __kthread_bind(k, kthread->cpu, TASK_PARKED);
- wake_up_state(k, TASK_PARKED);
- }
+ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
+ __kthread_bind(k, kthread->cpu, TASK_PARKED);
+
+ clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ wake_up_state(k, TASK_PARKED);
}
EXPORT_SYMBOL_GPL(kthread_unpark);
@@ -488,12 +483,13 @@ int kthread_park(struct task_struct *k)
if (WARN_ON(k->flags & PF_EXITING))
return -ENOSYS;
- if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
- set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
- if (k != current) {
- wake_up_process(k);
- wait_for_completion(&kthread->parked);
- }
+ if (WARN_ON_ONCE(test_bit(KTHREAD_SHOULD_PARK, &kthread->flags)))
+ return -EBUSY;
+
+ set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ if (k != current) {
+ wake_up_process(k);
+ wait_for_completion(&kthread->parked);
}
return 0;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7,6 +7,8 @@
*/
#include "sched.h"
+#include <linux/kthread.h>
+
#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -2718,20 +2720,28 @@ static struct rq *finish_task_switch(str
membarrier_mm_sync_core_before_usermode(mm);
mmdrop(mm);
}
- if (unlikely(prev_state == TASK_DEAD)) {
- if (prev->sched_class->task_dead)
- prev->sched_class->task_dead(prev);
+ if (unlikely(prev_state & (TASK_DEAD|TASK_PARKED))) {
+ switch (prev_state) {
+ case TASK_DEAD:
+ if (prev->sched_class->task_dead)
+ prev->sched_class->task_dead(prev);
+
+ /*
+ * Remove function-return probe instances associated with this
+ * task and put them back on the free list.
+ */
+ kprobe_flush_task(prev);
- /*
- * Remove function-return probe instances associated with this
- * task and put them back on the free list.
- */
- kprobe_flush_task(prev);
+ /* Task is done with its stack. */
+ put_task_stack(prev);
- /* Task is done with its stack. */
- put_task_stack(prev);
+ put_task_struct(prev);
+ break;
- put_task_struct(prev);
+ case TASK_PARKED:
+ kthread_park_complete(prev);
+ break;
+ }
}
tick_nohz_task_switch();
On 5/1/2018 6:49 PM, Peter Zijlstra wrote:
>> On 5/1/2018 5:01 PM, Peter Zijlstra wrote:
>
>>> Let me ponder what the best solution is, it's a bit of a mess.
>
> So there's:
>
> - TASK_PARKED, which we can make a special state; this solves the
> problem because then wait_task_inactive() is guaranteed to see
> TASK_PARKED and DTRT.
Yes this should work, as it takes pi-lock and blocking kthread_parkme.
>
> - complete(&kthread->parked), which we can do inside schedule(); this
> solves the problem because then kthread_park() will not return early
> and the task really is blocked.
I think complete will not help, as problem is like below :
Control Thread CPUHP thread
cpuhp_thread_fun
Wake control thread
complete(&st->done);
takedown_cpu
kthread_park
set_bit(KTHREAD_SHOULD_PARK
Here cpuhp is looping,
//success case
Generally when issue is not
coming
it schedule out by below :
ht->thread_should_run(td->cpu
scheduler
//failure case
before schedule
loop check
(kthread_should_park()
enter here as PARKED set
wake_up_process(k)
__kthread_parkme
complete(&self->parked);
SETS RUNNING
schedule
wait_for_completion(&kthread->parked);
So even we protect complete, it may see TASK_RUNNING as final state.
That's why we took pi-lock , So wake_up_process either see TASK_RUNNING
so return or it will exit early before parkme call.
Please correct me , if i misunderstood it. With pi-lock approach we are
not seeing issue. SO you first solution should fix this(put parked in
special state).
Regards
Gaurav
>
> and I'm fairly sure I thought of a 3rd way to cure things, but now that
> I'm writing things down I cannot seem to remember :/ -- could be we muck
> with wait_task_inactive().
>
> In any case, I hate all of them, but I think the completion one is the
> least horrible because it gives the strongest guarantees and cleans up
> most. But promoting TASK_PARKED to special certainly is the earier
> patch.
>
> The below boots, but that's about all I did with it. Opinions?
>
> ---
>
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -62,6 +62,7 @@ void *kthread_probe_data(struct task_str
> int kthread_park(struct task_struct *k);
> void kthread_unpark(struct task_struct *k);
> void kthread_parkme(void);
> +void kthread_park_complete(struct task_struct *k);
>
> int kthreadd(void *unused);
> extern struct task_struct *kthreadd_task;
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -55,7 +55,6 @@ enum KTHREAD_BITS {
> KTHREAD_IS_PER_CPU = 0,
> KTHREAD_SHOULD_STOP,
> KTHREAD_SHOULD_PARK,
> - KTHREAD_IS_PARKED,
> };
>
> static inline void set_kthread_struct(void *kthread)
> @@ -177,14 +176,12 @@ void *kthread_probe_data(struct task_str
>
> static void __kthread_parkme(struct kthread *self)
> {
> - __set_current_state(TASK_PARKED);
> - while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
> - if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
> - complete(&self->parked);
> + for (;;) {
> + set_current_state(TASK_PARKED);
> + if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
> + break;
> schedule();
> - __set_current_state(TASK_PARKED);
> }
> - clear_bit(KTHREAD_IS_PARKED, &self->flags);
> __set_current_state(TASK_RUNNING);
> }
>
> @@ -194,6 +191,11 @@ void kthread_parkme(void)
> }
> EXPORT_SYMBOL_GPL(kthread_parkme);
>
> +void kthread_park_complete(struct task_struct *k)
> +{
> + complete(&to_kthread(k)->parked);
> +}
> +
> static int kthread(void *_create)
> {
> /* Copy data: it's on kthread's stack */
> @@ -450,22 +452,15 @@ void kthread_unpark(struct task_struct *
> {
> struct kthread *kthread = to_kthread(k);
>
> - clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> /*
> - * We clear the IS_PARKED bit here as we don't wait
> - * until the task has left the park code. So if we'd
> - * park before that happens we'd see the IS_PARKED bit
> - * which might be about to be cleared.
> + * Newly created kthread was parked when the CPU was offline.
> + * The binding was lost and we need to set it again.
> */
> - if (test_and_clear_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> - /*
> - * Newly created kthread was parked when the CPU was offline.
> - * The binding was lost and we need to set it again.
> - */
> - if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> - __kthread_bind(k, kthread->cpu, TASK_PARKED);
> - wake_up_state(k, TASK_PARKED);
> - }
> + if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
> + __kthread_bind(k, kthread->cpu, TASK_PARKED);
> +
> + clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> + wake_up_state(k, TASK_PARKED);
> }
> EXPORT_SYMBOL_GPL(kthread_unpark);
>
> @@ -488,12 +483,13 @@ int kthread_park(struct task_struct *k)
> if (WARN_ON(k->flags & PF_EXITING))
> return -ENOSYS;
>
> - if (!test_bit(KTHREAD_IS_PARKED, &kthread->flags)) {
> - set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> - if (k != current) {
> - wake_up_process(k);
> - wait_for_completion(&kthread->parked);
> - }
> + if (WARN_ON_ONCE(test_bit(KTHREAD_SHOULD_PARK, &kthread->flags)))
> + return -EBUSY;
> +
> + set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
> + if (k != current) {
> + wake_up_process(k);
> + wait_for_completion(&kthread->parked);
> }
>
> return 0;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7,6 +7,8 @@
> */
> #include "sched.h"
>
> +#include <linux/kthread.h>
> +
> #include <asm/switch_to.h>
> #include <asm/tlb.h>
>
> @@ -2718,20 +2720,28 @@ static struct rq *finish_task_switch(str
> membarrier_mm_sync_core_before_usermode(mm);
> mmdrop(mm);
> }
> - if (unlikely(prev_state == TASK_DEAD)) {
> - if (prev->sched_class->task_dead)
> - prev->sched_class->task_dead(prev);
> + if (unlikely(prev_state & (TASK_DEAD|TASK_PARKED))) {
> + switch (prev_state) {
> + case TASK_DEAD:
> + if (prev->sched_class->task_dead)
> + prev->sched_class->task_dead(prev);
> +
> + /*
> + * Remove function-return probe instances associated with this
> + * task and put them back on the free list.
> + */
> + kprobe_flush_task(prev);
>
> - /*
> - * Remove function-return probe instances associated with this
> - * task and put them back on the free list.
> - */
> - kprobe_flush_task(prev);
> + /* Task is done with its stack. */
> + put_task_stack(prev);
>
> - /* Task is done with its stack. */
> - put_task_stack(prev);
> + put_task_struct(prev);
> + break;
>
> - put_task_struct(prev);
> + case TASK_PARKED:
> + kthread_park_complete(prev);
> + break;
> + }
> }
>
> tick_nohz_task_switch();
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Wed, May 02, 2018 at 10:45:52AM +0530, Kohli, Gaurav wrote:
> On 5/1/2018 6:49 PM, Peter Zijlstra wrote:
> > - complete(&kthread->parked), which we can do inside schedule(); this
> > solves the problem because then kthread_park() will not return early
> > and the task really is blocked.
>
> I think complete will not help, as problem is like below :
>
> Control Thread CPUHP thread
>
> cpuhp_thread_fun
> Wake control thread
> complete(&st->done);
>
> takedown_cpu
> kthread_park
> set_bit(KTHREAD_SHOULD_PARK
>
> Here cpuhp is looping,
> //success case
> Generally when issue is not
> coming
> it schedule out by below :
> ht->thread_should_run(td->cpu
> scheduler
> //failure case
> before schedule
> loop check
> (kthread_should_park()
> enter here as PARKED set
>
> wake_up_process(k)
If k has TASK_PARKED, then wake_up_process() which uses TASK_NORMAL will
no-op, because:
TASK_PARKED & TASK_NORMAL == 0
> __kthread_parkme
> complete(&self->parked);
> SETS RUNNING
> schedule
But suppose, you do get that store, and we get to schedule with
TASK_RUNNING, then schedule will no-op and we'll go around the loop and
not complete.
See also: lkml.kernel.org/r/[email protected]
Either TASK_RUNNING gets set before we do schedule() and we go around
again, re-set TASK_PARKED, resched the condition and re-call schedule(),
or we schedule() first and ttwu() will not issue the TASK_RUNNING store.
In either case, we'll eventually hit schedule() with TASK_PARKED. Then,
and only then will the complete() happen.
> wait_for_completion(&kthread->parked);
The point is, we'll only ever complete ^ that completion when we've
scheduled out the task in TASK_PARKED state. If the task didn't get
parked, no completion.
And that is the reason I like this approach above the others. It
guarantees the task really is parked when we ask for it. We don't have
to deal with the task still running and getting migrated to another CPU
nonsense.
On 5/2/2018 1:50 PM, Peter Zijlstra wrote:
> On Wed, May 02, 2018 at 10:45:52AM +0530, Kohli, Gaurav wrote:
>> On 5/1/2018 6:49 PM, Peter Zijlstra wrote:
>
>>> - complete(&kthread->parked), which we can do inside schedule(); this
>>> solves the problem because then kthread_park() will not return early
>>> and the task really is blocked.
>>
>> I think complete will not help, as problem is like below :
>>
>> Control Thread CPUHP thread
>>
>> cpuhp_thread_fun
>> Wake control thread
>> complete(&st->done);
>>
>> takedown_cpu
>> kthread_park
>> set_bit(KTHREAD_SHOULD_PARK
>>
>> Here cpuhp is looping,
>> //success case
>> Generally when issue is not
>> coming
>> it schedule out by below :
>> ht->thread_should_run(td->cpu
>> scheduler
>> //failure case
>> before schedule
>> loop check
>> (kthread_should_park()
>> enter here as PARKED set
>>
>> wake_up_process(k)
>
> If k has TASK_PARKED, then wake_up_process() which uses TASK_NORMAL will
> no-op, because:
>
> TASK_PARKED & TASK_NORMAL == 0
>
>> __kthread_parkme
>> complete(&self->parked);
>> SETS RUNNING
>> schedule
>
> But suppose, you do get that store, and we get to schedule with
> TASK_RUNNING, then schedule will no-op and we'll go around the loop and
> not complete.
>
> See also: lkml.kernel.org/r/[email protected]
>
> Either TASK_RUNNING gets set before we do schedule() and we go around
> again, re-set TASK_PARKED, resched the condition and re-call schedule(),
> or we schedule() first and ttwu() will not issue the TASK_RUNNING store.
>
> In either case, we'll eventually hit schedule() with TASK_PARKED. Then,
> and only then will the complete() happen.
>
>> wait_for_completion(&kthread->parked);
>
> The point is, we'll only ever complete ^ that completion when we've
> scheduled out the task in TASK_PARKED state. If the task didn't get
> parked, no completion.
Thanks for the detailed explanation, yes in all cases unpark will
observe parked state only.
>
>
> And that is the reason I like this approach above the others. It
> guarantees the task really is parked when we ask for it. We don't have
> to deal with the task still running and getting migrated to another CPU
> nonsense.
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On 5/2/2018 3:43 PM, Kohli, Gaurav wrote:
>
>
> On 5/2/2018 1:50 PM, Peter Zijlstra wrote:
>> On Wed, May 02, 2018 at 10:45:52AM +0530, Kohli, Gaurav wrote:
>>> On 5/1/2018 6:49 PM, Peter Zijlstra wrote:
>>
>>>> - complete(&kthread->parked), which we can do inside schedule();
>>>> this
>>>> solves the problem because then kthread_park() will not return
>>>> early
>>>> and the task really is blocked.
>>>
>>> I think complete will not help, as problem is like below :
>>>
>>> Control Thread CPUHP thread
>>>
>>> cpuhp_thread_fun
>>> Wake control thread
>>> complete(&st->done);
>>>
>>> takedown_cpu
>>> kthread_park
>>> set_bit(KTHREAD_SHOULD_PARK
>>>
>>> Here cpuhp is looping,
>>> //success case
>>> Generally when issue is not
>>> coming
>>> it schedule out by below :
>>>
>>> ht->thread_should_run(td->cpu
>>> scheduler
>>> //failure case
>>> before schedule
>>> loop check
>>> (kthread_should_park()
>>> enter here as PARKED set
>>>
>>> wake_up_process(k)
>>
>> If k has TASK_PARKED, then wake_up_process() which uses TASK_NORMAL will
>> no-op, because:
>>
>> TASK_PARKED & TASK_NORMAL == 0
>>
>>> __kthread_parkme
>>> complete(&self->parked);
>>> SETS RUNNING
>>> schedule
>>
>> But suppose, you do get that store, and we get to schedule with
>> TASK_RUNNING, then schedule will no-op and we'll go around the loop and
>> not complete.
>>
>> See also:
>> lkml.kernel.org/r/[email protected]
>>
>> Either TASK_RUNNING gets set before we do schedule() and we go around
>> again, re-set TASK_PARKED, resched the condition and re-call schedule(),
>> or we schedule() first and ttwu() will not issue the TASK_RUNNING store.
>>
>> In either case, we'll eventually hit schedule() with TASK_PARKED. Then,
>> and only then will the complete() happen.
>>
>>> wait_for_completion(&kthread->parked);
>>
>> The point is, we'll only ever complete ^ that completion when we've
>> scheduled out the task in TASK_PARKED state. If the task didn't get
>> parked, no completion.
>
> Thanks for the detailed explanation, yes in all cases unpark will
> observe parked state only.
>>
>>
>> And that is the reason I like this approach above the others. It
>> guarantees the task really is parked when we ask for it. We don't have
>> to deal with the task still running and getting migrated to another CPU
>> nonsense.
>>
>
HI Peter,
We have tested with new patch and still seeing same issue, in this dumps
we don't have debug traces, but seems there still exist race from code
review , Can you please check it once:
Controller Thread CPUHP Thread
takedown_cpu
kthread_park
kthread_parkme
Set KTHREAD_SHOULD_PARK
smpboot_thread_fn
set Task interruptible
wake_up_process
Kthread_parkme
SET TASK_PARKED
schedule
raw_spin_lock(&rq->lock)
context_switch
finish_lock_switch
Case TASK_PARKED
kthread_park_complete
SET TASK_INTERRUPTIBLE
And also seeing the same warning during unpark of cpuhp from controller:
if (!wait_task_inactive(p, state)) {
WARN_ON(1);
return;
}
325.065893] [<ffffff8920ed0200>] kthread_unpark+0x80/0xd8
[ 325.065902] [<ffffff8920eab754>] bringup_cpu+0xa0/0x12c
[ 325.065910] [<ffffff8920eaae90>] cpuhp_invoke_callback+0xb4/0x5c8
[ 325.065917] [<ffffff8920eabd98>] cpuhp_up_callbacks+0x3c/0x154
[ 325.065924] [<ffffff8920ead220>] _cpu_up+0x134/0x208
[ 325.065931] [<ffffff8920ead45c>] do_cpu_up+0x168/0x1a0
[ 325.065938] [<ffffff8920ead4b8>] cpu_up+0x24/0x30
[ 325.065948] [<ffffff89215b1408>] cpu_subsys_online+0x20/0x2c
[ 325.065956] [<ffffff89215aac64>] device_online+0x70/0xb4
[ 325.065962] [<ffffff89215aad78>] online_store+0xd0/0xdc
[ 325.065971] [<ffffff89215a7424>] dev_attr_store+0x40/0x54
[ 325.065982] [<ffffff89210d8a98>] sysfs_kf_write+0x5c/0x74
[ 325.065988] [<ffffff89210d7b9c>] kernfs_fop_write+0xcc/0x1ec
[ 325.065999] [<ffffff8921049288>] vfs_write+0xb4/0x1d0
[ 325.066006] [<ffffff892104a858>] SyS_write+0x60/0xc0
[ 325.066014] [<ffffff8920e83770>] el0_svc_naked+0x24/0x28
And after this same crash occured:
[ 325.521307] [<ffffff8920ed4aac>] smpboot_thread_fn+0x26c/0x2c8
[ 325.527295] [<ffffff8920ecfb24>] kthread+0xf4/0x108
I will put more debug ftraces to check what is going on exactly.
Regards
Gaurav
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
Corrected the formatting, Sorry for spam.
>
> HI Peter,
>
> We have tested with new patch and still seeing same issue, in this dumps
> we don't have debug traces, but seems there still exist race from code
> review , Can you please check it once:
>
> Controller Thread CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
> smpboot_thread_fn
> set Task interruptible
>
>
> wake_up_process
>
> Kthread_parkme
> SET TASK_PARKED
> schedule
> raw_spin_lock(&rq->lock)
>
> context_switch
>
> finish_lock_switch
>
>
>
> Case TASK_PARKED
> kthread_park_complete
>
>
> SET TASK_INTERRUPTIBLE
>
>
> And also seeing the same warning during unpark of cpuhp from controller:
> if (!wait_task_inactive(p, state)) {
> WARN_ON(1);
> return;
> }
> 325.065893] [<ffffff8920ed0200>] kthread_unpark+0x80/0xd8
> [ 325.065902] [<ffffff8920eab754>] bringup_cpu+0xa0/0x12c
> [ 325.065910] [<ffffff8920eaae90>] cpuhp_invoke_callback+0xb4/0x5c8
> [ 325.065917] [<ffffff8920eabd98>] cpuhp_up_callbacks+0x3c/0x154
> [ 325.065924] [<ffffff8920ead220>] _cpu_up+0x134/0x208
> [ 325.065931] [<ffffff8920ead45c>] do_cpu_up+0x168/0x1a0
> [ 325.065938] [<ffffff8920ead4b8>] cpu_up+0x24/0x30
> [ 325.065948] [<ffffff89215b1408>] cpu_subsys_online+0x20/0x2c
> [ 325.065956] [<ffffff89215aac64>] device_online+0x70/0xb4
> [ 325.065962] [<ffffff89215aad78>] online_store+0xd0/0xdc
> [ 325.065971] [<ffffff89215a7424>] dev_attr_store+0x40/0x54
> [ 325.065982] [<ffffff89210d8a98>] sysfs_kf_write+0x5c/0x74
> [ 325.065988] [<ffffff89210d7b9c>] kernfs_fop_write+0xcc/0x1ec
> [ 325.065999] [<ffffff8921049288>] vfs_write+0xb4/0x1d0
> [ 325.066006] [<ffffff892104a858>] SyS_write+0x60/0xc0
> [ 325.066014] [<ffffff8920e83770>] el0_svc_naked+0x24/0x28
>
>
> And after this same crash occured:
> [ 325.521307] [<ffffff8920ed4aac>] smpboot_thread_fn+0x26c/0x2c8
> [ 325.527295] [<ffffff8920ecfb24>] kthread+0xf4/0x108
>
> I will put more debug ftraces to check what is going on exactly.
>
> Regards
> Gaurav
>
>
>
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
Hi Peter,
As last mentioned on mail, we are still seeing issue with the latest
approach and below is the susceptible race as mentioned earlier..
controller Thread CPUHP Thread
takedown_cpu
kthread_park
kthread_parkme
Set KTHREAD_SHOULD_PARK
smpboot_thread_fn
set Task interruptible
wake_up_process
if (!(p->state & state))
goto out;
Kthread_parkme
SET TASK_PARKED
schedule
raw_spin_lock(&rq->lock)
ttwu_remote
waiting for __task_rq_lock
context_switch
finish_lock_switch
Case TASK_PARKED
kthread_park_complete
SET Running
So it seems issue is still their with the latest mentioned fix
kthread, sched/wait: Fix kthread_parkme() completion issue.
Regards
Gaurav
On 5/7/2018 4:53 PM, Kohli, Gaurav wrote:
> Corrected the formatting, Sorry for spam.
>
>
>>
>> HI Peter,
>>
>> We have tested with new patch and still seeing same issue, in this
>> dumps we don't have debug traces, but seems there still exist race
>> from code review , Can you please check it once:
>>
>> Controller Thread CPUHP Thread
>> takedown_cpu
>> kthread_park
>> kthread_parkme
>> Set KTHREAD_SHOULD_PARK
>> smpboot_thread_fn
>> set Task interruptible
>>
>>
>> wake_up_process
>>
>> Kthread_parkme
>> SET TASK_PARKED
>> schedule
>> raw_spin_lock(&rq->lock)
>>
>> context_switch
>>
>> finish_lock_switch
>>
>>
>>
>> Case TASK_PARKED
>> kthread_park_complete
>>
>>
>> SET TASK_INTERRUPTIBLE
>>
>>
>> And also seeing the same warning during unpark of cpuhp from controller:
>> if (!wait_task_inactive(p, state)) {
>> WARN_ON(1);
>> return;
>> }
>> 325.065893] [<ffffff8920ed0200>] kthread_unpark+0x80/0xd8
>> [ 325.065902] [<ffffff8920eab754>] bringup_cpu+0xa0/0x12c
>> [ 325.065910] [<ffffff8920eaae90>] cpuhp_invoke_callback+0xb4/0x5c8
>> [ 325.065917] [<ffffff8920eabd98>] cpuhp_up_callbacks+0x3c/0x154
>> [ 325.065924] [<ffffff8920ead220>] _cpu_up+0x134/0x208
>> [ 325.065931] [<ffffff8920ead45c>] do_cpu_up+0x168/0x1a0
>> [ 325.065938] [<ffffff8920ead4b8>] cpu_up+0x24/0x30
>> [ 325.065948] [<ffffff89215b1408>] cpu_subsys_online+0x20/0x2c
>> [ 325.065956] [<ffffff89215aac64>] device_online+0x70/0xb4
>> [ 325.065962] [<ffffff89215aad78>] online_store+0xd0/0xdc
>> [ 325.065971] [<ffffff89215a7424>] dev_attr_store+0x40/0x54
>> [ 325.065982] [<ffffff89210d8a98>] sysfs_kf_write+0x5c/0x74
>> [ 325.065988] [<ffffff89210d7b9c>] kernfs_fop_write+0xcc/0x1ec
>> [ 325.065999] [<ffffff8921049288>] vfs_write+0xb4/0x1d0
>> [ 325.066006] [<ffffff892104a858>] SyS_write+0x60/0xc0
>> [ 325.066014] [<ffffff8920e83770>] el0_svc_naked+0x24/0x28
>>
>>
>> And after this same crash occured:
>> [ 325.521307] [<ffffff8920ed4aac>] smpboot_thread_fn+0x26c/0x2c8
>> [ 325.527295] [<ffffff8920ecfb24>] kthread+0xf4/0x108
>>
>> I will put more debug ftraces to check what is going on exactly.
>>
>> Regards
>> Gaurav
>>
>>
>>
>>
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
I have to admit that I didn't try to follow this discussion, somehow I thought
that the plan was to use set_special_state(PARKED)...
On 06/05, Kohli, Gaurav wrote:
>
> As last mentioned on mail, we are still seeing issue with the latest
> approach and below is the susceptible race as mentioned earlier..
> controller Thread CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
> smpboot_thread_fn
> set Task interruptible
>
>
> wake_up_process
> if (!(p->state & state))
> goto out;
>
> Kthread_parkme
> SET TASK_PARKED
> schedule
> raw_spin_lock(&rq->lock)
> ttwu_remote
> waiting for __task_rq_lock
> context_switch
>
> finish_lock_switch
>
>
>
> Case TASK_PARKED
> kthread_park_complete
>
>
> SET Running
I think you are right.
And, now that I look at 85f1abe0019fcb3ea10df7029056cf42702283a8
("kthread, sched/wait: Fix kthread_parkme() completion issue") I see this note
int the changelog:
The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.
OK, but __kthread_parkme() can be preempted before it calls schedule(), so the
caller still can be migrated? Plus kthread_park_complete() can be called twice.
No?
Oleg.
On Tue, Jun 05, 2018 at 05:08:41PM +0200, Oleg Nesterov wrote:
> On 06/05, Kohli, Gaurav wrote:
> >
> > As last mentioned on mail, we are still seeing issue with the latest
> > approach and below is the susceptible race as mentioned earlier..
> > controller Thread CPUHP Thread
> > takedown_cpu
> > kthread_park
> > kthread_parkme
> > Set KTHREAD_SHOULD_PARK
> > smpboot_thread_fn
> > set Task interruptible
> >
> >
> > wake_up_process
> > if (!(p->state & state))
> > goto out;
> >
> > Kthread_parkme
> > SET TASK_PARKED
> > schedule
> > raw_spin_lock(&rq->lock)
> > ttwu_remote
> > waiting for __task_rq_lock
> > context_switch
> >
> > finish_lock_switch
> >
> >
> >
> > Case TASK_PARKED
> > kthread_park_complete
> >
> >
> > SET Running
>
> I think you are right.
>
> And, now that I look at 85f1abe0019fcb3ea10df7029056cf42702283a8
> ("kthread, sched/wait: Fix kthread_parkme() completion issue") I see this note
> int the changelog:
>
> The alternative is to promote TASK_PARKED to a special state, this
> guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
> and we'll end up doing the right thing, but this preserves the whole
> icky business of potentially migating the still runnable thing.
>
> OK, but __kthread_parkme() can be preempted before it calls schedule(), so the
> caller still can be migrated? Plus kthread_park_complete() can be called twice.
Argh... I forgot TASK_DEAD does the whole thing with preempt_disable().
Let me stare at that a bit.
On Tue, Jun 05, 2018 at 05:22:12PM +0200, Peter Zijlstra wrote:
> > OK, but __kthread_parkme() can be preempted before it calls schedule(), so the
> > caller still can be migrated? Plus kthread_park_complete() can be called twice.
>
> Argh... I forgot TASK_DEAD does the whole thing with preempt_disable().
> Let me stare at that a bit.
This should ensure we only ever complete when we read PARKED, right?
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b259af4a..e513b4600796 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2641,7 +2641,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
* past. prev == current is still correct but we need to recalculate this_rq
* because prev may have moved to another CPU.
*/
-static struct rq *finish_task_switch(struct task_struct *prev)
+static struct rq *finish_task_switch(struct task_struct *prev, bool preempt)
__releases(rq->lock)
{
struct rq *rq = this_rq();
@@ -2674,7 +2674,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
*
* We must observe prev->state before clearing prev->on_cpu (in
* finish_task), otherwise a concurrent wakeup can get prev
- * running on another CPU and we could rave with its RUNNING -> DEAD
+ * running on another CPU and we could race with its RUNNING -> DEAD
* transition, resulting in a double drop.
*/
prev_state = prev->state;
@@ -2720,7 +2720,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
break;
case TASK_PARKED:
- kthread_park_complete(prev);
+ if (!preempt)
+ kthread_park_complete(prev);
break;
}
}
@@ -2784,7 +2785,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
* PREEMPT_COUNT kernels).
*/
- rq = finish_task_switch(prev);
+ rq = finish_task_switch(prev, false);
balance_callback(rq);
preempt_enable();
@@ -2797,7 +2798,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
- struct task_struct *next, struct rq_flags *rf)
+ struct task_struct *next, bool preempt, struct rq_flags *rf)
{
struct mm_struct *mm, *oldmm;
@@ -2839,7 +2840,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_to(prev, next, prev);
barrier();
- return finish_task_switch(prev);
+ return finish_task_switch(prev, preempt);
}
/*
@@ -3478,7 +3479,7 @@ static void __sched notrace __schedule(bool preempt)
trace_sched_switch(preempt, prev, next);
/* Also unlocks the rq: */
- rq = context_switch(rq, prev, next, &rf);
+ rq = context_switch(rq, prev, next, preempt, &rf);
} else {
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
rq_unlock_irq(rq, &rf);
@@ -3487,6 +3488,7 @@ static void __sched notrace __schedule(bool preempt)
balance_callback(rq);
}
+/* called with preemption disabled */
void __noreturn do_task_dead(void)
{
/* Causes final put_task_struct in finish_task_switch(): */
On 06/05, Peter Zijlstra wrote:
>
> On Tue, Jun 05, 2018 at 05:22:12PM +0200, Peter Zijlstra wrote:
>
> > > OK, but __kthread_parkme() can be preempted before it calls schedule(), so the
> > > caller still can be migrated? Plus kthread_park_complete() can be called twice.
> >
> > Argh... I forgot TASK_DEAD does the whole thing with preempt_disable().
> > Let me stare at that a bit.
>
> This should ensure we only ever complete when we read PARKED, right?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8d59b259af4a..e513b4600796 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,7 +2641,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> * past. prev == current is still correct but we need to recalculate this_rq
> * because prev may have moved to another CPU.
> */
> -static struct rq *finish_task_switch(struct task_struct *prev)
> +static struct rq *finish_task_switch(struct task_struct *prev, bool preempt)
> __releases(rq->lock)
> {
> struct rq *rq = this_rq();
> @@ -2674,7 +2674,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> *
> * We must observe prev->state before clearing prev->on_cpu (in
> * finish_task), otherwise a concurrent wakeup can get prev
> - * running on another CPU and we could rave with its RUNNING -> DEAD
> + * running on another CPU and we could race with its RUNNING -> DEAD
> * transition, resulting in a double drop.
> */
> prev_state = prev->state;
> @@ -2720,7 +2720,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> break;
>
> case TASK_PARKED:
> - kthread_park_complete(prev);
> + if (!preempt)
> + kthread_park_complete(prev);
Yes, but this won't fix the race decribed by Kohli...
Plus this complicates the schedule() paths for the very special case, and to me
it seems that all this kthread_park/unpark logic needs some serious cleanups...
Not that I can suggest something better right now.
Oleg.
Hi,
Just for info , the patch that I have shared earlier with pi_lock
approach has been tested since last one month and no issue has been
observed,
https://lkml.org/lkml/2018/4/25/189
Can we take this if it looks good?
Regards
Gaurav
On 6/5/2018 10:05 PM, Oleg Nesterov wrote:
> On 06/05, Peter Zijlstra wrote:
>>
>> On Tue, Jun 05, 2018 at 05:22:12PM +0200, Peter Zijlstra wrote:
>>
>>>> OK, but __kthread_parkme() can be preempted before it calls schedule(), so the
>>>> caller still can be migrated? Plus kthread_park_complete() can be called twice.
>>>
>>> Argh... I forgot TASK_DEAD does the whole thing with preempt_disable().
>>> Let me stare at that a bit.
>>
>> This should ensure we only ever complete when we read PARKED, right?
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8d59b259af4a..e513b4600796 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2641,7 +2641,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>> * past. prev == current is still correct but we need to recalculate this_rq
>> * because prev may have moved to another CPU.
>> */
>> -static struct rq *finish_task_switch(struct task_struct *prev)
>> +static struct rq *finish_task_switch(struct task_struct *prev, bool preempt)
>> __releases(rq->lock)
>> {
>> struct rq *rq = this_rq();
>> @@ -2674,7 +2674,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>> *
>> * We must observe prev->state before clearing prev->on_cpu (in
>> * finish_task), otherwise a concurrent wakeup can get prev
>> - * running on another CPU and we could rave with its RUNNING -> DEAD
>> + * running on another CPU and we could race with its RUNNING -> DEAD
>> * transition, resulting in a double drop.
>> */
>> prev_state = prev->state;
>> @@ -2720,7 +2720,8 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>> break;
>>
>> case TASK_PARKED:
>> - kthread_park_complete(prev);
>> + if (!preempt)
>> + kthread_park_complete(prev);
>
>
> Yes, but this won't fix the race decribed by Kohli...
>
> Plus this complicates the schedule() paths for the very special case, and to me
> it seems that all this kthread_park/unpark logic needs some serious cleanups...
>
> Not that I can suggest something better right now.
>
> Oleg.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On Tue, Jun 05, 2018 at 06:35:16PM +0200, Oleg Nesterov wrote:
> Yes, but this won't fix the race decribed by Kohli...
Clearly I'm not going strong today... and yes, looking at that again I
didn't address that.
> Plus this complicates the schedule() paths for the very special case
I checked, we already spilled @preempt onto the stack, context_switch()
is inlined so the extra argument is a no-op, if finish_task_switch()
also gets inlined (possible, could force it) the only additional cost is
the @preempt load from stack in the slow path. If it doesn't get inlined
we get the additional function call overhead, which should be minimal.
But yes, I wasn't a fan either.
> and to me it seems that all this kthread_park/unpark logic needs some
> serious cleanups...
I really hated how even with TASK_PARKED special the smpboot thread
could still get migrated, which is why I moved the completion. Do you
have any saner suggestions?
Humm, I suppose we could do wait_task_inactive() after
wait_for_completion(). I absolutely abhor wait_task_inactive(), but I
remember us both failing to fix that at least twice :/
Also, I think we still need TASK_PARKED as a special state for that.
How's the below?
---
include/linux/kthread.h | 1 -
include/linux/sched.h | 2 +-
kernel/kthread.c | 32 +++++++++++++++++++++++++-------
kernel/sched/core.c | 31 +++++++++++--------------------
4 files changed, 37 insertions(+), 29 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 2803264c512f..c1961761311d 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -62,7 +62,6 @@ void *kthread_probe_data(struct task_struct *k);
int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
-void kthread_park_complete(struct task_struct *k);
int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 14e4f9c12337..4e32c1cc7794 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -117,7 +117,7 @@ struct task_group;
* the comment with set_special_state().
*/
#define is_special_task_state(state) \
- ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_DEAD))
+ ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
#define __set_current_state(state_value) \
do { \
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 481951bf091d..8f66a3dc767a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -177,12 +177,24 @@ void *kthread_probe_data(struct task_struct *task)
static void __kthread_parkme(struct kthread *self)
{
for (;;) {
- set_current_state(TASK_PARKED);
+ /*
+ * TASK_PARKED is a special state; we must serialize against
+ * possible pending wakeups to avoid store-store collisions on
+ * task->state.
+ *
+ * Such a collision might possibly result in the task state
+ * changin from TASK_PARKED and us failing the
+ * wait_task_inactive() in kthread_park().
+ */
+ set_special_state(TASK_PARKED);
if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
break;
+
+ complete_all(&self->parked);
schedule();
}
__set_current_state(TASK_RUNNING);
+ reinit_completion(&self->parked);
}
void kthread_parkme(void)
@@ -191,11 +203,6 @@ void kthread_parkme(void)
}
EXPORT_SYMBOL_GPL(kthread_parkme);
-void kthread_park_complete(struct task_struct *k)
-{
- complete_all(&to_kthread(k)->parked);
-}
-
static int kthread(void *_create)
{
/* Copy data: it's on kthread's stack */
@@ -459,8 +466,10 @@ void kthread_unpark(struct task_struct *k)
if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags))
__kthread_bind(k, kthread->cpu, TASK_PARKED);
- reinit_completion(&kthread->parked);
clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ /*
+ * __kthread_parkme() will either see !SHOULD_PARK or get the wakeup.
+ */
wake_up_state(k, TASK_PARKED);
}
EXPORT_SYMBOL_GPL(kthread_unpark);
@@ -487,7 +496,16 @@ int kthread_park(struct task_struct *k)
set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
if (k != current) {
wake_up_process(k);
+ /*
+ * Wait for __kthread_parkme() to complete(), this means we
+ * _will_ have TASK_PARKED and are about to call schedule().
+ */
wait_for_completion(&kthread->parked);
+ /*
+ * Now wait for that schedule() to complete and the task to
+ * get scheduled out.
+ */
+ WARN_ON_ONCE(!wait_task_inactive(p, TASK_PARKED));
}
return 0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b259af4a..cf72c4eed7da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7,7 +7,6 @@
*/
#include "sched.h"
-#include <linux/kthread.h>
#include <linux/nospec.h>
#include <asm/switch_to.h>
@@ -2701,28 +2700,20 @@ static struct rq *finish_task_switch(struct task_struct *prev)
membarrier_mm_sync_core_before_usermode(mm);
mmdrop(mm);
}
- if (unlikely(prev_state & (TASK_DEAD|TASK_PARKED))) {
- switch (prev_state) {
- case TASK_DEAD:
- if (prev->sched_class->task_dead)
- prev->sched_class->task_dead(prev);
+ if (unlikely(prev_state == TASK_DEAD)) {
+ if (prev->sched_class->task_dead)
+ prev->sched_class->task_dead(prev);
- /*
- * Remove function-return probe instances associated with this
- * task and put them back on the free list.
- */
- kprobe_flush_task(prev);
-
- /* Task is done with its stack. */
- put_task_stack(prev);
+ /*
+ * Remove function-return probe instances associated with this
+ * task and put them back on the free list.
+ */
+ kprobe_flush_task(prev);
- put_task_struct(prev);
- break;
+ /* Task is done with its stack. */
+ put_task_stack(prev);
- case TASK_PARKED:
- kthread_park_complete(prev);
- break;
- }
+ put_task_struct(prev);
}
tick_nohz_task_switch();
On 06/05, Peter Zijlstra wrote:
>
> Also, I think we still need TASK_PARKED as a special state for that.
I think it would be nice to kill the TASK_PARKED state altogether. But I don't
know how. I'll try to look at this code later, but I am not sure I will find a
way to cleanup it...
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -177,12 +177,24 @@ void *kthread_probe_data(struct task_struct *task)
> static void __kthread_parkme(struct kthread *self)
> {
> for (;;) {
> - set_current_state(TASK_PARKED);
> + /*
> + * TASK_PARKED is a special state; we must serialize against
> + * possible pending wakeups to avoid store-store collisions on
> + * task->state.
> + *
> + * Such a collision might possibly result in the task state
> + * changin from TASK_PARKED and us failing the
> + * wait_task_inactive() in kthread_park().
> + */
> + set_special_state(TASK_PARKED);
Agreed,
> if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
> break;
> +
> + complete_all(&self->parked);
> schedule();
> }
> __set_current_state(TASK_RUNNING);
> + reinit_completion(&self->parked);
But how can we know that all the callers of kthread_park() have already returned
from wait_for_completion() ?
Oh. The very fact that __kthread_parkme() does complete_all() proves that we need
some serious cleanups. In particular, I think that kthread_park() on a parked kthread
must not be possible.
Just look at this code. It looks as if __kthread_parkme() can race with _unpark()
and thus we need this wait-event-like loop.
But if it can race with _unpark() then kthread_park() can block forever.
For the start, can't we change kthread_park()
- set_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
+ if (test_and_set_bit(...))
+ return -EAGAIN;
and s/complete_all/complete/ in __kthread_parkme() ?
IIUC, this will only affect smpboot_update_cpumask_percpu_thread() which can hit
an already parked thread, but it doesn't need to wait.
And it seems that smpboot_update_cpumask_percpu_thread() in turn needs some cleanups.
Hmm. and its single user: kernel/watchdog.c.
And speaking of watchdog.c, can't we simply kill the "watchdog/%u" threads? This is
off-topic, but can't watchdog_timer_fn() use stop_one_cpu_nowait(watchdog) ?
And I really think we should unexport kthread_park/unpark(), only smpboot_thread_fn()
should use them. kthread() should not play with __kthread_parkme(). And even
KTHREAD_SHOULD_PARK must die, I mean it should live in struct smp_hotplug_thread,
not in struct kthread.
OK, this is off-topic too.
In short, I think this patch is fine but I didn't read it carefully, will try tomorrow.
And, let me repeat, can't we avoid complete_all() ?
Oleg.
On Wed, Jun 06, 2018 at 03:51:16PM +0200, Oleg Nesterov wrote:
> And I really think we should unexport kthread_park/unpark(), only smpboot_thread_fn()
> should use them. kthread() should not play with __kthread_parkme(). And even
> KTHREAD_SHOULD_PARK must die, I mean it should live in struct smp_hotplug_thread,
> not in struct kthread.
I tend to agree; however we need to fix drm and md first, because they
already use them :-(
On Wed, Jun 06, 2018 at 03:51:16PM +0200, Oleg Nesterov wrote:
> In particular, I think that kthread_park() on a parked kthread
> must not be possible.
It happens though; I put in a WARN and someone triggered it -- although
I could not reproduce :/
On Wed, Jun 06, 2018 at 03:51:16PM +0200, Oleg Nesterov wrote:
> On 06/05, Peter Zijlstra wrote:
> >
> > Also, I think we still need TASK_PARKED as a special state for that.
>
> I think it would be nice to kill the TASK_PARKED state altogether. But I don't
> know how. I'll try to look at this code later, but I am not sure I will find a
> way to cleanup it...
So the nice thing about having TASK_PARKED is that it guarantees no
spurious wakeups. You know the thread will not wake up while it's CPU is
gone.
We could possibly re-purpose TASK_STOPPED (because kernel threads don't
do that) but that seems dodgy at best.
On Wed, Jun 06, 2018 at 03:51:16PM +0200, Oleg Nesterov wrote:
> IIUC, this will only affect smpboot_update_cpumask_percpu_thread() which can hit
> an already parked thread, but it doesn't need to wait.
>
> And it seems that smpboot_update_cpumask_percpu_thread() in turn needs some cleanups.
> Hmm. and its single user: kernel/watchdog.c.
>
> And speaking of watchdog.c, can't we simply kill the "watchdog/%u" threads? This is
> off-topic, but can't watchdog_timer_fn() use stop_one_cpu_nowait(watchdog) ?
>
> And I really think we should unexport kthread_park/unpark(), only smpboot_thread_fn()
> should use them. kthread() should not play with __kthread_parkme(). And even
> KTHREAD_SHOULD_PARK must die, I mean it should live in struct smp_hotplug_thread,
> not in struct kthread.
>
> OK, this is off-topic too.
> And, let me repeat, can't we avoid complete_all() ?
Yes, or at least if that watchdog crap is the only user.
I have most of the patch reworking watchdog.c to use stop_one_cpu*(),
and that cleans up lots -- of course, I've not tested it yet, so it
could also be breaking lots :-)
HI ,
In the latest patch mentioned, k should be their instead of p:
-WARN_ON_ONCE(!wait_task_inactive(p, TASK_PARKED))
+WARN_ON_ONCE(!wait_task_inactive(k, TASK_PARKED))
Regards
Gaurav
On 6/7/2018 12:29 AM, Peter Zijlstra wrote:
> On Wed, Jun 06, 2018 at 03:51:16PM +0200, Oleg Nesterov wrote:
>> IIUC, this will only affect smpboot_update_cpumask_percpu_thread() which can hit
>> an already parked thread, but it doesn't need to wait.
>>
>> And it seems that smpboot_update_cpumask_percpu_thread() in turn needs some cleanups.
>> Hmm. and its single user: kernel/watchdog.c.
>>
>> And speaking of watchdog.c, can't we simply kill the "watchdog/%u" threads? This is
>> off-topic, but can't watchdog_timer_fn() use stop_one_cpu_nowait(watchdog) ?
>>
>> And I really think we should unexport kthread_park/unpark(), only smpboot_thread_fn()
>> should use them. kthread() should not play with __kthread_parkme(). And even
>> KTHREAD_SHOULD_PARK must die, I mean it should live in struct smp_hotplug_thread,
>> not in struct kthread.
>>
>> OK, this is off-topic too.
>
>> And, let me repeat, can't we avoid complete_all() ?
>
> Yes, or at least if that watchdog crap is the only user.
>
> I have most of the patch reworking watchdog.c to use stop_one_cpu*(),
> and that cleans up lots -- of course, I've not tested it yet, so it
> could also be breaking lots :-)
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.