2014-10-07 19:54:08

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/1] sched: fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

And note that another caller of task_preempt_count(), set_cpu(), is
fine but it doesn't really need this helper.

And afaics we do not need ->saved_preempt_count at all, the trivial
patch below makes it unnecessary, we can kill it and all its users.

Not only this will simplify the code, this will make (well, almost)
the per-cpu preempt counter arch-agnostic.

Or I missed something?

Do you think this makes sense? If yes, I'll try to make the patches.

Oleg.
---

OBVIOUSLY INCOMPLETE, BREAKS EVERYTHING EXCEPT x86. Just to explain what
I mean.

(this depends on !__ARCH_WANT_UNLOCKED_CTXSW, but it was already removed
and x86 doesn't use it anyway).


--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -2279,6 +2279,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
{
struct rq *rq = this_rq();

+ preempt_count_set(PREEMPT_DISABLED);
finish_task_switch(rq, prev);

/*
@@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
+ int pc;

prepare_task_switch(rq, prev, next);

@@ -2338,10 +2340,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
#endif

context_tracking_task_switch(prev, next);
+
/* Here we just switch the register state and the stack. */
+ pc = this_cpu_read(__preempt_count);
switch_to(prev, next, prev);
-
barrier();
+ preempt_count_set(pc);
+
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack


2014-10-07 19:54:27

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/1] sched: fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

task_preempt_count() has nothing to do with the actual preempt counter,
thread_info->saved_preempt_count is only valid right after switch_to().

__trace_sched_switch_state() can use preempt_count(), prev is still the
current task when trace_sched_switch() is called.

Signed-off-by: Oleg Nesterov <[email protected]>
---
include/trace/events/sched.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5a..a7d67bc 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -100,7 +100,7 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
/*
* For all intents and purposes a preempted task is a running task.
*/
- if (task_preempt_count(p) & PREEMPT_ACTIVE)
+ if (preempt_count() & PREEMPT_ACTIVE)
state = TASK_RUNNING | TASK_STATE_MAX;
#endif

--
1.5.5.1

2014-10-08 08:00:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched: fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

On Tue, Oct 07, 2014 at 09:50:46PM +0200, Oleg Nesterov wrote:
> And note that another caller of task_preempt_count(), set_cpu(), is
> fine but it doesn't really need this helper.
>
> And afaics we do not need ->saved_preempt_count at all, the trivial
> patch below makes it unnecessary, we can kill it and all its users.
>
> Not only this will simplify the code, this will make (well, almost)
> the per-cpu preempt counter arch-agnostic.
>
> Or I missed something?

Two things, per-cpu isn't always faster on some archs, and load-store
archs have problems with PREEMPT_NEED_RESCHED, although arguably you can
do per-cpu preempt count without that.

> Do you think this makes sense? If yes, I'll try to make the patches.

It penalizes everything but x86 I think. There is no other arch that has
per-cpu preempt count atm.

2014-10-08 18:36:28

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/2] (Was: sched: fix the PREEMPT_ACTIVE check in __trace_sched_switch_state())

On 10/08, Peter Zijlstra wrote:
>
> On Tue, Oct 07, 2014 at 09:50:46PM +0200, Oleg Nesterov wrote:
> > And note that another caller of task_preempt_count(), set_cpu(), is
> > fine but it doesn't really need this helper.
> >
> > And afaics we do not need ->saved_preempt_count at all, the trivial
> > patch below makes it unnecessary, we can kill it and all its users.
> >
> > Not only this will simplify the code, this will make (well, almost)
> > the per-cpu preempt counter arch-agnostic.
> >
> > Or I missed something?
>
> Two things, per-cpu isn't always faster on some archs, and load-store
> archs have problems with PREEMPT_NEED_RESCHED, although arguably you can
> do per-cpu preempt count without that.

Ah, but I didn't mean we should make it per-cpu on every arch.

I meant that (imo) this change can cleanup x86 code, and it can also help
if we want to change another arch to use per-cpu preempt_count.

> > Do you think this makes sense? If yes, I'll try to make the patches.
>
> It penalizes everything but x86 I think.

I don't think so.

But please forget for the moment, lets discuss this later. Let me start
with 2 simple preparations which imho make sense anyway. Then we will see.

1/2 looks like the obvious bugfix (iirc we already discussed this a bit),
2/2 depends on this patch.

Oleg.

2014-10-08 18:36:47

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/2] sched: schedule_tail() should disable preemption

finish_task_switch() enables preemption, so post_schedule(rq) can be
called on the wrong (and even dead) CPU. Afaics, nothing really bad
can happen, but in this case we can wrongly clear rq->post_schedule
on that CPU. And this simply looks wrong in any case.

Another problem is that finish_task_switch() itself runs with preempt
enabled after finish_lock_switch(). If nothing else this means that
->sched_in() notifier can't trust its "cpu" arg.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/sched/core.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 703c7e6..3f267e8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2277,15 +2277,14 @@ static inline void post_schedule(struct rq *rq)
asmlinkage __visible void schedule_tail(struct task_struct *prev)
__releases(rq->lock)
{
- struct rq *rq = this_rq();
+ struct rq *rq;

+ /* finish_task_switch() drops rq->lock and enables preemtion */
+ preempt_disable();
+ rq = this_rq();
finish_task_switch(rq, prev);
-
- /*
- * FIXME: do we need to worry about rq being invalidated by the
- * task_switch?
- */
post_schedule(rq);
+ preempt_enable();

if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);
--
1.5.5.1

2014-10-08 18:37:11

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 2/2] sched: kill task_preempt_count()

task_preempt_count() is pointless if preemption counter is per-cpu,
currently this is x86 only. It is only valid if the task is not
running, and even in this case the only info it can provide is the
state of PREEMPT_ACTIVE bit.

Change its single caller to check p->on_rq instead, this should be
the same if p->state != TASK_RUNNING, and kill this helper.

Signed-off-by: Oleg Nesterov <[email protected]>
---
arch/x86/include/asm/preempt.h | 3 ---
include/asm-generic/preempt.h | 3 ---
kernel/sched/core.c | 2 +-
3 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 4008734..8f32718 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -30,9 +30,6 @@ static __always_inline void preempt_count_set(int pc)
/*
* must be macros to avoid header recursion hell
*/
-#define task_preempt_count(p) \
- (task_thread_info(p)->saved_preempt_count & ~PREEMPT_NEED_RESCHED)
-
#define init_task_preempt_count(p) do { \
task_thread_info(p)->saved_preempt_count = PREEMPT_DISABLED; \
} while (0)
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 1cd3f5d..eb6f9e6 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -23,9 +23,6 @@ static __always_inline void preempt_count_set(int pc)
/*
* must be macros to avoid header recursion hell
*/
-#define task_preempt_count(p) \
- (task_thread_info(p)->preempt_count & ~PREEMPT_NEED_RESCHED)
-
#define init_task_preempt_count(p) do { \
task_thread_info(p)->preempt_count = PREEMPT_DISABLED; \
} while (0)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3f267e8..cfe9905 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1056,7 +1056,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* ttwu() will sort out the placement.
*/
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
- !(task_preempt_count(p) & PREEMPT_ACTIVE));
+ !p->on_rq);

#ifdef CONFIG_LOCKDEP
/*
--
1.5.5.1

2014-10-08 19:40:06

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

On 10/08, Oleg Nesterov wrote:
>
> Another problem is that finish_task_switch() itself runs with preempt
> enabled after finish_lock_switch(). If nothing else this means that
> ->sched_in() notifier can't trust its "cpu" arg.

OOPS, this obviously can't happen, ->preempt_notifiers must be empty.
Remove this part from the changelog, please see v2.

I am not sure about finish_arch_post_lock_switch() ... but probably
it should be fine without preempt_disable.

-------------------------------------------------------------------------------
Subject: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

finish_task_switch() enables preemption, so post_schedule(rq) can be
called on the wrong (and even dead) CPU. Afaics, nothing really bad
can happen, but in this case we can wrongly clear rq->post_schedule
on that CPU. And this simply looks wrong in any case.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/sched/core.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 703c7e6..3f267e8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2277,15 +2277,14 @@ static inline void post_schedule(struct rq *rq)
asmlinkage __visible void schedule_tail(struct task_struct *prev)
__releases(rq->lock)
{
- struct rq *rq = this_rq();
+ struct rq *rq;

+ /* finish_task_switch() drops rq->lock and enables preemtion */
+ preempt_disable();
+ rq = this_rq();
finish_task_switch(rq, prev);
-
- /*
- * FIXME: do we need to worry about rq being invalidated by the
- * task_switch?
- */
post_schedule(rq);
+ preempt_enable();

if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);
--
1.5.5.1

2014-10-08 21:37:35

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

В Ср, 08/10/2014 в 21:36 +0200, Oleg Nesterov пишет:
> On 10/08, Oleg Nesterov wrote:
> >
> > Another problem is that finish_task_switch() itself runs with preempt
> > enabled after finish_lock_switch(). If nothing else this means that
> > ->sched_in() notifier can't trust its "cpu" arg.
>
> OOPS, this obviously can't happen, ->preempt_notifiers must be empty.
> Remove this part from the changelog, please see v2.
>
> I am not sure about finish_arch_post_lock_switch() ... but probably
> it should be fine without preempt_disable.
>
> -------------------------------------------------------------------------------
> Subject: [PATCH v2 1/2] sched: schedule_tail() should disable preemption
>
> finish_task_switch() enables preemption, so post_schedule(rq) can be
> called on the wrong (and even dead) CPU. Afaics, nothing really bad
> can happen, but in this case we can wrongly clear rq->post_schedule
> on that CPU. And this simply looks wrong in any case.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/sched/core.c | 11 +++++------
> 1 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 703c7e6..3f267e8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2277,15 +2277,14 @@ static inline void post_schedule(struct rq *rq)
> asmlinkage __visible void schedule_tail(struct task_struct *prev)
> __releases(rq->lock)
> {
> - struct rq *rq = this_rq();
> + struct rq *rq;
>
> + /* finish_task_switch() drops rq->lock and enables preemtion */
> + preempt_disable();

Maybe, the code would look simpler if we change
init_task_preempt_count() and create new tasks
with preempt_count() == 2, so this preempt_disable()
won't be necessary. But it's more or less subjectively.

Reviewed-by: Kirill Tkhai <[email protected]>

> + rq = this_rq();
> finish_task_switch(rq, prev);
> -
> - /*
> - * FIXME: do we need to worry about rq being invalidated by the
> - * task_switch?
> - */
> post_schedule(rq);
> + preempt_enable();
>
> if (current->set_child_tid)
> put_user(task_pid_vnr(current), current->set_child_tid);



2014-10-09 15:01:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

On 10/09, Kirill Tkhai wrote:
>
> В Ср, 08/10/2014 в 21:36 +0200, Oleg Nesterov пишет:
> > @@ -2277,15 +2277,14 @@ static inline void post_schedule(struct rq *rq)
> > asmlinkage __visible void schedule_tail(struct task_struct *prev)
> > __releases(rq->lock)
> > {
> > - struct rq *rq = this_rq();
> > + struct rq *rq;
> >
> > + /* finish_task_switch() drops rq->lock and enables preemtion */
> > + preempt_disable();
>
> Maybe, the code would look simpler if we change
> init_task_preempt_count() and create new tasks
> with preempt_count() == 2, so this preempt_disable()
> won't be necessary. But it's more or less subjectively.

Yes, yes, I thought about the same.

Except I think we should kill init_task_preempt_count() and change
schedule_tail()

- preempt_disable();
+ preempt_count_set(PREEMPT_DISABLED + 1);

but first we need to remove ->saved_preempt_count.

> Reviewed-by: Kirill Tkhai <[email protected]>

Thanks!

Oleg.

2014-10-09 15:17:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

On Thu, Oct 09, 2014 at 04:57:26PM +0200, Oleg Nesterov wrote:

> but first we need to remove ->saved_preempt_count.

Why do you want to kill that? Your earlier proposal would penalize every
!x86 arch by adding extra code to the scheduler core while they already
automagically preserve their thread_info::preempt_count.

2014-10-09 17:00:43

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

Peter,

let me first say that I understand that cleanups are always subjective.
So if you do not like it - I won't argue at all.

On 10/09, Peter Zijlstra wrote:
>
> On Thu, Oct 09, 2014 at 04:57:26PM +0200, Oleg Nesterov wrote:
>
> > but first we need to remove ->saved_preempt_count.
>
> Why do you want to kill that?

Because imo this makes the code a bit simpler. But (perhaps) mostly because
personally I dislike any "special" member in task_struct/thread_info, and
it seems to me that ->saved_preempt_count buys nothing. We only need it
to record/restore the counter before/after switch_to(), a local variably
looks better to me.

But again, see above. If the maintainer doesn't like the cleanup - then
it should be counted as uglification ;)

> Your earlier proposal would penalize every
> !x86 arch by adding extra code to the scheduler core while they already
> automagically preserve their thread_info::preempt_count.

Sure, and it can't be even compiled on !x86.

But this is simple, just we need a new helper, preempt_count_restore(),
defined as nop in asm-generic/preempt.h. Well, perhaps another helper
makes sense, preempt_count_raw() which simply reads the counter, but
this is minor.

After the patch below we can remove ->saved_preempt_count. Including
init_task_preempt_count(), it is no longer needed after the change in
schedule_tail().

No?

Oleg.


diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 8f32718..695307f 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -27,6 +27,11 @@ static __always_inline void preempt_count_set(int pc)
raw_cpu_write_4(__preempt_count, pc);
}

+static __always_inline void preempt_count_restore(int pc)
+{
+ preempt_count_set(pc);
+}
+
/*
* must be macros to avoid header recursion hell
*/
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index eb6f9e6..14de30e 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -20,6 +20,10 @@ static __always_inline void preempt_count_set(int pc)
*preempt_count_ptr() = pc;
}

+static __always_inline void preempt_count_restore(int pc)
+{
+}
+
/*
* must be macros to avoid header recursion hell
*/
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cfe9905..ad8ca02 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2279,6 +2279,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
{
struct rq *rq;

+ preempt_count_set(PREEMPT_DISABLED);
+
/* finish_task_switch() drops rq->lock and enables preemtion */
preempt_disable();
rq = this_rq();
@@ -2299,6 +2301,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
+ int pc;

prepare_task_switch(rq, prev, next);

@@ -2333,10 +2336,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
#endif

context_tracking_task_switch(prev, next);
+
+ pc = preempt_count();
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
-
barrier();
+ preempt_count_restore(pc);
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack

2014-10-09 17:28:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

On Thu, Oct 09, 2014 at 06:57:13PM +0200, Oleg Nesterov wrote:
> > Your earlier proposal would penalize every
> > !x86 arch by adding extra code to the scheduler core while they already
> > automagically preserve their thread_info::preempt_count.
>
> Sure, and it can't be even compiled on !x86.
>
> But this is simple, just we need a new helper, preempt_count_restore(),
> defined as nop in asm-generic/preempt.h. Well, perhaps another helper
> makes sense, preempt_count_raw() which simply reads the counter, but
> this is minor.
>
> After the patch below we can remove ->saved_preempt_count. Including
> init_task_preempt_count(), it is no longer needed after the change in
> schedule_tail().

Ah, right, this makes more sense.

> @@ -2333,10 +2336,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
> #endif
>
> context_tracking_task_switch(prev, next);
> +
> + pc = preempt_count();

The only problem here is that you can loose PREEMPT_NEED_RESCHED, I
haven't thought about whether that is a problem here or not.

> /* Here we just switch the register state and the stack. */
> switch_to(prev, next, prev);
> -
> barrier();
> + preempt_count_restore(pc);
> /*
> * this_rq must be evaluated again because prev may have moved
> * CPUs since it called schedule(), thus the 'rq' on its stack
>

2014-10-09 17:33:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

On Thu, Oct 09, 2014 at 07:28:34PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 09, 2014 at 06:57:13PM +0200, Oleg Nesterov wrote:
> > > Your earlier proposal would penalize every
> > > !x86 arch by adding extra code to the scheduler core while they already
> > > automagically preserve their thread_info::preempt_count.
> >
> > Sure, and it can't be even compiled on !x86.
> >
> > But this is simple, just we need a new helper, preempt_count_restore(),
> > defined as nop in asm-generic/preempt.h. Well, perhaps another helper
> > makes sense, preempt_count_raw() which simply reads the counter, but
> > this is minor.
> >
> > After the patch below we can remove ->saved_preempt_count. Including
> > init_task_preempt_count(), it is no longer needed after the change in
> > schedule_tail().
>
> Ah, right, this makes more sense.
>
> > @@ -2333,10 +2336,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > #endif
> >
> > context_tracking_task_switch(prev, next);
> > +
> > + pc = preempt_count();
>
> The only problem here is that you can loose PREEMPT_NEED_RESCHED, I
> haven't thought about whether that is a problem here or not.

Also, if you make that preempt_count_save(), to mirror the restore, you
can both preserve PREEMPT_NEED_RESCHED and avoid emitting code on !x86.

2014-10-09 17:58:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched: schedule_tail() should disable preemption

On 10/09, Peter Zijlstra wrote:
>
> On Thu, Oct 09, 2014 at 07:28:34PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 09, 2014 at 06:57:13PM +0200, Oleg Nesterov wrote:
> > > @@ -2333,10 +2336,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > > #endif
> > >
> > > context_tracking_task_switch(prev, next);
> > > +
> > > + pc = preempt_count();
> >
> > The only problem here is that you can loose PREEMPT_NEED_RESCHED, I
> > haven't thought about whether that is a problem here or not.

No, it must not be set. It was cleared by clear_preempt_need_resched()
in __schedule(), and nobody can set it.

(schedule_tail() also relies on fact it runs with irqs disabled).

But,

> Also, if you make that preempt_count_save(), to mirror the restore,

Yes, agreed. That is why I said we probably want preempt_count_raw()
(or _save). This way we avoid the unnecessary "&= ~PREEMPT_NEED_RESCHED"
on x86.

> can both preserve PREEMPT_NEED_RESCHED and avoid emitting code on !x86.

This too, although gcc should optimize this code out anyway. At least
it seems to do on x86 if I make preempt_count_restore() empty.

Oleg.

2014-10-09 19:35:37

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/1] sched: make finish_task_switch() return struct rq *

speaking of minor cleanups...

Let me send another once before ->saved_preempt_count removal, it
looks simple and less subjective.

Again, this is just minor cleanup so please feel free to ignore, but
somehow to me this duplicated this_rq() (and the comment!) looks a
bit annoying. Plus it saves a few insns, but this is minor.

Depends on "sched: schedule_tail() should disable preemption".

Oleg.

2014-10-09 19:36:01

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/1] sched: make finish_task_switch() return struct rq *

Both callers of finish_task_switch() need to recalculate this_rq()
and pass it as an argument, plus __schedule() does this again after
context_switch().

It would be simpler to call this_rq() once in finish_task_switch()
and return the this rq to the callers.

Note: probably "int cpu" in __schedule() should die; it is not used
and both rcu_note_context_switch() and wq_worker_sleeping() do not
really need this argument.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/sched/core.c | 37 +++++++++++++++----------------------
1 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cfe9905..3bfbd3d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2200,10 +2200,16 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
* so, we finish that here outside of the runqueue lock. (Doing it
* with the lock held can cause deadlocks; see schedule() for
* details.)
+ *
+ * The context switch have flipped the stack from under us and restored the
+ * local variables which were saved when this task called schedule() in the
+ * past. prev == current is still correct but we need to recalculate this_rq
+ * because prev may have moved to another CPU.
*/
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static struct rq *finish_task_switch(struct task_struct *prev)
__releases(rq->lock)
{
+ struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
long prev_state;

@@ -2243,6 +2249,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
}

tick_nohz_task_switch(current);
+ return rq;
}

#ifdef CONFIG_SMP
@@ -2281,8 +2288,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)

/* finish_task_switch() drops rq->lock and enables preemtion */
preempt_disable();
- rq = this_rq();
- finish_task_switch(rq, prev);
+ rq = finish_task_switch(prev);
post_schedule(rq);
preempt_enable();

@@ -2291,10 +2297,9 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
}

/*
- * context_switch - switch to the new MM and the new
- * thread's register state.
+ * context_switch - switch to the new MM and the new thread's register state.
*/
-static inline void
+static inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
@@ -2335,14 +2340,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
context_tracking_task_switch(prev, next);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
-
barrier();
- /*
- * this_rq must be evaluated again because prev may have moved
- * CPUs since it called schedule(), thus the 'rq' on its stack
- * frame will be invalid.
- */
- finish_task_switch(this_rq(), prev);
+
+ return finish_task_switch(prev);
}

/*
@@ -2802,15 +2802,8 @@ need_resched:
rq->curr = next;
++*switch_count;

- context_switch(rq, prev, next); /* unlocks the rq */
- /*
- * The context switch have flipped the stack from under us
- * and restored the local variables which were saved when
- * this task called schedule() in the past. prev == current
- * is still correct, but it can be moved to another cpu/rq.
- */
- cpu = smp_processor_id();
- rq = cpu_rq(cpu);
+ rq = context_switch(rq, prev, next); /* unlocks the rq */
+ cpu = rq->cpu;
} else
raw_spin_unlock_irq(&rq->lock);

--
1.5.5.1

2014-10-10 09:54:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched: fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

On Tue, Oct 07, 2014 at 09:51:08PM +0200, Oleg Nesterov wrote:
> task_preempt_count() has nothing to do with the actual preempt counter,
> thread_info->saved_preempt_count is only valid right after switch_to().
>
> __trace_sched_switch_state() can use preempt_count(), prev is still the
> current task when trace_sched_switch() is called.

This is true, but the paranoid in me would like a BUG_ON(p != current)
right along with that to make sure we don't accidentally change this.

But yes, its good to get rid of task_preempt_count(), its horrible.

> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> include/trace/events/sched.h | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 0a68d5a..a7d67bc 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -100,7 +100,7 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
> /*
> * For all intents and purposes a preempted task is a running task.
> */
> - if (task_preempt_count(p) & PREEMPT_ACTIVE)
> + if (preempt_count() & PREEMPT_ACTIVE)
> state = TASK_RUNNING | TASK_STATE_MAX;
> #endif
>
> --
> 1.5.5.1
>
>

2014-10-10 09:56:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched: fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

On Fri, Oct 10, 2014 at 11:54:52AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 07, 2014 at 09:51:08PM +0200, Oleg Nesterov wrote:
> > task_preempt_count() has nothing to do with the actual preempt counter,
> > thread_info->saved_preempt_count is only valid right after switch_to().
> >
> > __trace_sched_switch_state() can use preempt_count(), prev is still the
> > current task when trace_sched_switch() is called.
>
> This is true, but the paranoid in me would like a BUG_ON(p != current)
> right along with that to make sure we don't accidentally change this.
>
> But yes, its good to get rid of task_preempt_count(), its horrible.

Like so then?

---
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -97,16 +97,19 @@ static inline long __trace_sched_switch_
long state = p->state;

#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_SCHED_DEBUG
+ BUG_ON(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
/*
* For all intents and purposes a preempted task is a running task.
*/
- if (task_preempt_count(p) & PREEMPT_ACTIVE)
+ if (preempt_count() & PREEMPT_ACTIVE)
state = TASK_RUNNING | TASK_STATE_MAX;
-#endif
+#endif /* CONFIG_PREEMPT */

return state;
}
-#endif
+#endif /* CREATE_TRACE_POINTS */

/*
* Tracepoint for task switches, performed by the scheduler:

2014-10-10 10:06:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched: make finish_task_switch() return struct rq *

On Thu, Oct 09, 2014 at 09:32:32PM +0200, Oleg Nesterov wrote:
> @@ -2802,15 +2802,8 @@ need_resched:
> rq->curr = next;
> ++*switch_count;
>
> - context_switch(rq, prev, next); /* unlocks the rq */
> - /*
> - * The context switch have flipped the stack from under us
> - * and restored the local variables which were saved when
> - * this task called schedule() in the past. prev == current
> - * is still correct, but it can be moved to another cpu/rq.
> - */
> - cpu = smp_processor_id();
> - rq = cpu_rq(cpu);
> + rq = context_switch(rq, prev, next); /* unlocks the rq */
> + cpu = rq->cpu;

This won't compile on UP, cpu_of(rq) works though.

2014-10-10 17:10:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched: make finish_task_switch() return struct rq *

On 10/10, Peter Zijlstra wrote:
>
> On Thu, Oct 09, 2014 at 09:32:32PM +0200, Oleg Nesterov wrote:
> > @@ -2802,15 +2802,8 @@ need_resched:
> > rq->curr = next;
> > ++*switch_count;
> >
> > - context_switch(rq, prev, next); /* unlocks the rq */
> > - /*
> > - * The context switch have flipped the stack from under us
> > - * and restored the local variables which were saved when
> > - * this task called schedule() in the past. prev == current
> > - * is still correct, but it can be moved to another cpu/rq.
> > - */
> > - cpu = smp_processor_id();
> > - rq = cpu_rq(cpu);
> > + rq = context_switch(rq, prev, next); /* unlocks the rq */
> > + cpu = rq->cpu;
>
> This won't compile on UP, cpu_of(rq) works though.

Heh ;) and I swear I was going to write cpu_of().

and I agree with BUG_ON() in __trace_sched_switch() if CONFIG_SCHED_DEBUG.

I am travelling till next Wednesday, will resend everything once I return.

Thanks a lot!

Oleg.

2014-10-10 20:32:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched: make finish_task_switch() return struct rq *

On Fri, Oct 10, 2014 at 07:06:42PM +0200, Oleg Nesterov wrote:
> On 10/10, Peter Zijlstra wrote:
> >
> > On Thu, Oct 09, 2014 at 09:32:32PM +0200, Oleg Nesterov wrote:
> > > @@ -2802,15 +2802,8 @@ need_resched:
> > > rq->curr = next;
> > > ++*switch_count;
> > >
> > > - context_switch(rq, prev, next); /* unlocks the rq */
> > > - /*
> > > - * The context switch have flipped the stack from under us
> > > - * and restored the local variables which were saved when
> > > - * this task called schedule() in the past. prev == current
> > > - * is still correct, but it can be moved to another cpu/rq.
> > > - */
> > > - cpu = smp_processor_id();
> > > - rq = cpu_rq(cpu);
> > > + rq = context_switch(rq, prev, next); /* unlocks the rq */
> > > + cpu = rq->cpu;
> >
> > This won't compile on UP, cpu_of(rq) works though.
>
> Heh ;) and I swear I was going to write cpu_of().
>
> and I agree with BUG_ON() in __trace_sched_switch() if CONFIG_SCHED_DEBUG.
>
> I am travelling till next Wednesday, will resend everything once I return.

Not towards DUS by any chance? I fixed up the patches, they should be in
here:
https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=sched/core

If I missed some, please do resend (or ping me on) those.

2014-10-14 18:01:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched: make finish_task_switch() return struct rq *

On 10/10, Peter Zijlstra wrote:
>
> On Fri, Oct 10, 2014 at 07:06:42PM +0200, Oleg Nesterov wrote:
> >
> > I am travelling till next Wednesday, will resend everything once I return.
>
> Not towards DUS by any chance?

No, Moscow ;)

> I fixed up the patches, they should be in
> here:
> https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=sched/core

Thanks! In particular thanks for taking the preempt_schedule_context() fix.

Oleg.

Subject: [tip:sched/core] sched: Fix schedule_tail() to disable preemption

Commit-ID: 1a43a14a5bd9c32dbd7af35e35a5afa703944bcb
Gitweb: http://git.kernel.org/tip/1a43a14a5bd9c32dbd7af35e35a5afa703944bcb
Author: Oleg Nesterov <[email protected]>
AuthorDate: Wed, 8 Oct 2014 21:36:44 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 28 Oct 2014 10:47:54 +0100

sched: Fix schedule_tail() to disable preemption

finish_task_switch() enables preemption, so post_schedule(rq) can be
called on the wrong (and even dead) CPU. Afaics, nothing really bad
can happen, but in this case we can wrongly clear rq->post_schedule
on that CPU. And this simply looks wrong in any case.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cde8481..b493560 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2309,15 +2309,14 @@ static inline void post_schedule(struct rq *rq)
asmlinkage __visible void schedule_tail(struct task_struct *prev)
__releases(rq->lock)
{
- struct rq *rq = this_rq();
+ struct rq *rq;

+ /* finish_task_switch() drops rq->lock and enables preemtion */
+ preempt_disable();
+ rq = this_rq();
finish_task_switch(rq, prev);
-
- /*
- * FIXME: do we need to worry about rq being invalidated by the
- * task_switch?
- */
post_schedule(rq);
+ preempt_enable();

if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);

Subject: [tip:sched/core] sched: Fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

Commit-ID: 8f9fbf092cd0ae31722b42c9abb427a87d55c18a
Gitweb: http://git.kernel.org/tip/8f9fbf092cd0ae31722b42c9abb427a87d55c18a
Author: Oleg Nesterov <[email protected]>
AuthorDate: Tue, 7 Oct 2014 21:51:08 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 28 Oct 2014 10:47:53 +0100

sched: Fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()

task_preempt_count() has nothing to do with the actual preempt counter,
thread_info->saved_preempt_count is only valid right after switch_to().

__trace_sched_switch_state() can use preempt_count(), prev is still the
current task when trace_sched_switch() is called.

Signed-off-by: Oleg Nesterov <[email protected]>
[ Added BUG_ON(). ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/trace/events/sched.h | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5a..30fedaf 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -97,16 +97,19 @@ static inline long __trace_sched_switch_state(struct task_struct *p)
long state = p->state;

#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_SCHED_DEBUG
+ BUG_ON(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
/*
* For all intents and purposes a preempted task is a running task.
*/
- if (task_preempt_count(p) & PREEMPT_ACTIVE)
+ if (preempt_count() & PREEMPT_ACTIVE)
state = TASK_RUNNING | TASK_STATE_MAX;
-#endif
+#endif /* CONFIG_PREEMPT */

return state;
}
-#endif
+#endif /* CREATE_TRACE_POINTS */

/*
* Tracepoint for task switches, performed by the scheduler:

Subject: [tip:sched/core] sched: Make finish_task_switch() return ' struct rq *'

Commit-ID: dfa50b605c2a933b7bb1c1d575a0da4e897e3c7d
Gitweb: http://git.kernel.org/tip/dfa50b605c2a933b7bb1c1d575a0da4e897e3c7d
Author: Oleg Nesterov <[email protected]>
AuthorDate: Thu, 9 Oct 2014 21:32:32 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 28 Oct 2014 10:47:55 +0100

sched: Make finish_task_switch() return 'struct rq *'

Both callers of finish_task_switch() need to recalculate this_rq()
and pass it as an argument, plus __schedule() does this again after
context_switch().

It would be simpler to call this_rq() once in finish_task_switch()
and return the this rq to the callers.

Note: probably "int cpu" in __schedule() should die; it is not used
and both rcu_note_context_switch() and wq_worker_sleeping() do not
really need this argument.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 38 +++++++++++++++-----------------------
1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b493560..1b69603 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2220,7 +2220,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,

/**
* finish_task_switch - clean up after a task-switch
- * @rq: runqueue associated with task-switch
* @prev: the thread we just switched away from.
*
* finish_task_switch must be called after the context switch, paired
@@ -2232,10 +2231,16 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
* so, we finish that here outside of the runqueue lock. (Doing it
* with the lock held can cause deadlocks; see schedule() for
* details.)
+ *
+ * The context switch have flipped the stack from under us and restored the
+ * local variables which were saved when this task called schedule() in the
+ * past. prev == current is still correct but we need to recalculate this_rq
+ * because prev may have moved to another CPU.
*/
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static struct rq *finish_task_switch(struct task_struct *prev)
__releases(rq->lock)
{
+ struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
long prev_state;

@@ -2275,6 +2280,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
}

tick_nohz_task_switch(current);
+ return rq;
}

#ifdef CONFIG_SMP
@@ -2313,8 +2319,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)

/* finish_task_switch() drops rq->lock and enables preemtion */
preempt_disable();
- rq = this_rq();
- finish_task_switch(rq, prev);
+ rq = finish_task_switch(prev);
post_schedule(rq);
preempt_enable();

@@ -2323,10 +2328,9 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
}

/*
- * context_switch - switch to the new MM and the new
- * thread's register state.
+ * context_switch - switch to the new MM and the new thread's register state.
*/
-static inline void
+static inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
@@ -2365,14 +2369,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
context_tracking_task_switch(prev, next);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
-
barrier();
- /*
- * this_rq must be evaluated again because prev may have moved
- * CPUs since it called schedule(), thus the 'rq' on its stack
- * frame will be invalid.
- */
- finish_task_switch(this_rq(), prev);
+
+ return finish_task_switch(prev);
}

/*
@@ -2854,15 +2853,8 @@ need_resched:
rq->curr = next;
++*switch_count;

- context_switch(rq, prev, next); /* unlocks the rq */
- /*
- * The context switch have flipped the stack from under us
- * and restored the local variables which were saved when
- * this task called schedule() in the past. prev == current
- * is still correct, but it can be moved to another cpu/rq.
- */
- cpu = smp_processor_id();
- rq = cpu_rq(cpu);
+ rq = context_switch(rq, prev, next); /* unlocks the rq */
+ cpu = cpu_of(rq);
} else
raw_spin_unlock_irq(&rq->lock);

Subject: [tip:sched/core] sched: Kill task_preempt_count()

Commit-ID: e2336f6e51edda875a49770b616ed5b02a74665b
Gitweb: http://git.kernel.org/tip/e2336f6e51edda875a49770b616ed5b02a74665b
Author: Oleg Nesterov <[email protected]>
AuthorDate: Wed, 8 Oct 2014 20:33:48 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 28 Oct 2014 10:47:56 +0100

sched: Kill task_preempt_count()

task_preempt_count() is pointless if preemption counter is per-cpu,
currently this is x86 only. It is only valid if the task is not
running, and even in this case the only info it can provide is the
state of PREEMPT_ACTIVE bit.

Change its single caller to check p->on_rq instead, this should be
the same if p->state != TASK_RUNNING, and kill this helper.

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/preempt.h | 3 ---
include/asm-generic/preempt.h | 3 ---
kernel/sched/core.c | 2 +-
3 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 4008734..8f327184 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -30,9 +30,6 @@ static __always_inline void preempt_count_set(int pc)
/*
* must be macros to avoid header recursion hell
*/
-#define task_preempt_count(p) \
- (task_thread_info(p)->saved_preempt_count & ~PREEMPT_NEED_RESCHED)
-
#define init_task_preempt_count(p) do { \
task_thread_info(p)->saved_preempt_count = PREEMPT_DISABLED; \
} while (0)
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 1cd3f5d..eb6f9e6 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -23,9 +23,6 @@ static __always_inline void preempt_count_set(int pc)
/*
* must be macros to avoid header recursion hell
*/
-#define task_preempt_count(p) \
- (task_thread_info(p)->preempt_count & ~PREEMPT_NEED_RESCHED)
-
#define init_task_preempt_count(p) do { \
task_thread_info(p)->preempt_count = PREEMPT_DISABLED; \
} while (0)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1b69603..5c067fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1054,7 +1054,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* ttwu() will sort out the placement.
*/
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
- !(task_preempt_count(p) & PREEMPT_ACTIVE));
+ !p->on_rq);

#ifdef CONFIG_LOCKDEP
/*