2014-02-13 15:51:40

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH] sched/core: Create new task with twice disabled preemption

Preemption state on enter in finish_task_switch() is different
in cases of context_switch() and schedule_tail().

In the first case we have it twice disabled: at the start of
schedule() and during spin locking. In the second it is only
once: the value which was set in init_task_preempt_count().

For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
that all newly created tasks execute finish_arch_post_lock_switch()
and post_schedule() with preemption enabled.

It seems there is possible a problem in rare situations on arm64,
when one freshly created thread preempts another before
finish_arch_post_lock_switch() has finished. If mm is the same,
then TIF_SWITCH_MM on the second won't be set.

The second rare but possible issue is zeroing of post_schedule()
on a wrong cpu.

So, lets fix this and unify preempt_count state.

Signed-off-by: Kirill Tkhai <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/preempt.h | 6 ++++--
include/asm-generic/preempt.h | 6 ++++--
include/linux/sched.h | 2 ++
kernel/sched/core.c | 6 ++----
4 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index c8b0519..07fdf52 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -32,9 +32,11 @@ static __always_inline void preempt_count_set(int pc)
*/
#define task_preempt_count(p) \
(task_thread_info(p)->saved_preempt_count & ~PREEMPT_NEED_RESCHED)
-
+/*
+ * Disable it twice to enter schedule_tail() with preemption disabled.
+ */
#define init_task_preempt_count(p) do { \
- task_thread_info(p)->saved_preempt_count = PREEMPT_DISABLED; \
+ task_thread_info(p)->saved_preempt_count = PREEMPT_DISABLED_TWICE; \
} while (0)

#define init_idle_preempt_count(p, cpu) do { \
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 1cd3f5d..0f67846 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -25,9 +25,11 @@ static __always_inline void preempt_count_set(int pc)
*/
#define task_preempt_count(p) \
(task_thread_info(p)->preempt_count & ~PREEMPT_NEED_RESCHED)
-
+/*
+ * Disable it twice to enter schedule_tail() with preemption disabled.
+ */
#define init_task_preempt_count(p) do { \
- task_thread_info(p)->preempt_count = PREEMPT_DISABLED; \
+ task_thread_info(p)->preempt_count = PREEMPT_DISABLED_TWICE; \
} while (0)

#define init_idle_preempt_count(p, cpu) do { \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c49a258..f6a6c1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,8 +520,10 @@ struct task_cputime {

#ifdef CONFIG_PREEMPT_COUNT
#define PREEMPT_DISABLED (1 + PREEMPT_ENABLED)
+#define PREEMPT_DISABLED_TWICE (2 + PREEMPT_ENABLED)
#else
#define PREEMPT_DISABLED PREEMPT_ENABLED
+#define PREEMPT_DISABLED_TWICE PREEMPT_ENABLED
#endif

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764f..18aa7f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2203,12 +2203,10 @@ asmlinkage void schedule_tail(struct task_struct *prev)

finish_task_switch(rq, prev);

- /*
- * FIXME: do we need to worry about rq being invalidated by the
- * task_switch?
- */
post_schedule(rq);

+ preempt_enable();
+
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
/* In this case, finish_task_switch does not reenable preemption */
preempt_enable();


2014-02-13 16:00:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> that all newly created tasks execute finish_arch_post_lock_switch()
> and post_schedule() with preemption enabled.

That's IA64 and MIPS; do they have a 'good' reason to use this?

That is; the alternative is to fix those two archs and remove the
__ARCH_WANT_UNLOCKED_CTXSW clutter alltogether; which seems like a big
win to me.

2014-02-13 17:45:48

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On 13.02.2014 20:00, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
>> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
>> that all newly created tasks execute finish_arch_post_lock_switch()
>> and post_schedule() with preemption enabled.
>
> That's IA64 and MIPS; do they have a 'good' reason to use this?

It seems my description misleads reader, I'm sorry if so.

I mean all architectures *except* IA64 and MIPS. All, which
has no __ARCH_WANT_UNLOCKED_CTXSW defined.

IA64 and MIPS already have preempt_enable() in schedule_tail():

#ifdef __ARCH_WANT_UNLOCKED_CTXSW
/* In this case, finish_task_switch does not reenable preemption */
preempt_enable();
#endif

Their initial preemption is not decremented in finish_lock_switch().

So, we speak about x86, ARM64 etc.

Look at ARM64's finish_arch_post_lock_switch(). It looks a task
must to not be preempted between switch_mm() and this function.
But in case of new task this is possible.

Example:
RT thread p0 and RT thread p1 are on shared mm. System has 2 cpu.

p0 is bound to CPU0.
p1 is bound to CPU1.

p1 has set timer and it is sleeping.

p0 create fair thread f. Task f wakes on CPU1.

When f is between raw_spin_unlock_irq() and
finish_arch_post_lock_switch(), preemption is enabled.
In this moment the process p1 is waking on CPU1.

For p1 the check

if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next)) || prev != next)

in switch_mm() is not passed, because mm is the same. So, later
we do not do cpu_switch_mm() in finish_arch_post_lock_switch()
and we just go to userspace.

This is the problem I tried to solve. I don't know arm64, and I can't
say how it is serious.

But it looks the place is buggy.

Kirill

> That is; the alternative is to fix those two archs and remove the
> __ARCH_WANT_UNLOCKED_CTXSW clutter alltogether; which seems like a big
> win to me.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-02-14 10:53:04

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> On 13.02.2014 20:00, Peter Zijlstra wrote:
> > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> >> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> >> that all newly created tasks execute finish_arch_post_lock_switch()
> >> and post_schedule() with preemption enabled.
> >
> > That's IA64 and MIPS; do they have a 'good' reason to use this?
>
> It seems my description misleads reader, I'm sorry if so.
>
> I mean all architectures *except* IA64 and MIPS. All, which
> has no __ARCH_WANT_UNLOCKED_CTXSW defined.
>
> IA64 and MIPS already have preempt_enable() in schedule_tail():
>
> #ifdef __ARCH_WANT_UNLOCKED_CTXSW
> /* In this case, finish_task_switch does not reenable preemption */
> preempt_enable();
> #endif
>
> Their initial preemption is not decremented in finish_lock_switch().
>
> So, we speak about x86, ARM64 etc.
>
> Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> must to not be preempted between switch_mm() and this function.
> But in case of new task this is possible.

We had a thread about this at the end of last year:

https://lkml.org/lkml/2013/11/15/82

There is indeed a problem on arm64, something like this (and I think
s390 also needs a fix):

1. switch_mm() via check_and_switch_context() defers the actual mm
switch by setting TIF_SWITCH_MM
2. the context switch is considered 'done' by the kernel before
finish_arch_post_lock_switch() and therefore we can be preempted to a
new thread before finish_arch_post_lock_switch()
3. The new thread has the same mm as the preempted thread but we
actually missed the mm switching in finish_arch_post_lock_switch()
because TIF_SWITCH_MM is per thread rather than mm

> This is the problem I tried to solve. I don't know arm64, and I can't
> say how it is serious.

Have you managed to reproduce this? I don't say it doesn't exist, but I
want to make sure that any patch actually fixes it.

So we have more solutions, one of the first two suitable for stable:

1. Propagate the TIF_SWITCH_MM to the next thread (suggested by Martin)
2. Get rid of TIF_SWITCH_MM and use mm_cpumask for tracking (I already
have the patch, it just needs a lot more testing)
3. Re-write the ASID allocation algorithm to no longer require IPIs and
therefore drop finish_arch_post_lock_switch() (this can be done, so
pretty intrusive for stable)
4. Replace finish_arch_post_lock_switch() with finish_mm_switch() as per
Martin's patch and I think this would guarantee a call always, we can
move the mm switching from switch_mm() to finish_mm_switch() and no
need for flags to mark deferred mm switching

For arm64, we'll most likely go with 2 for stable and move to 3 shortly
after, no need for other deferred mm switching.

--
Catalin

2014-02-14 11:15:54

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

В Птн, 14/02/2014 в 10:52 +0000, Catalin Marinas пишет:
> On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> > On 13.02.2014 20:00, Peter Zijlstra wrote:
> > > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > >> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > >> that all newly created tasks execute finish_arch_post_lock_switch()
> > >> and post_schedule() with preemption enabled.
> > >
> > > That's IA64 and MIPS; do they have a 'good' reason to use this?
> >
> > It seems my description misleads reader, I'm sorry if so.
> >
> > I mean all architectures *except* IA64 and MIPS. All, which
> > has no __ARCH_WANT_UNLOCKED_CTXSW defined.
> >
> > IA64 and MIPS already have preempt_enable() in schedule_tail():
> >
> > #ifdef __ARCH_WANT_UNLOCKED_CTXSW
> > /* In this case, finish_task_switch does not reenable preemption */
> > preempt_enable();
> > #endif
> >
> > Their initial preemption is not decremented in finish_lock_switch().
> >
> > So, we speak about x86, ARM64 etc.
> >
> > Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> > must to not be preempted between switch_mm() and this function.
> > But in case of new task this is possible.
>
> We had a thread about this at the end of last year:
>
> https://lkml.org/lkml/2013/11/15/82
>
> There is indeed a problem on arm64, something like this (and I think
> s390 also needs a fix):
>
> 1. switch_mm() via check_and_switch_context() defers the actual mm
> switch by setting TIF_SWITCH_MM
> 2. the context switch is considered 'done' by the kernel before
> finish_arch_post_lock_switch() and therefore we can be preempted to a
> new thread before finish_arch_post_lock_switch()
> 3. The new thread has the same mm as the preempted thread but we
> actually missed the mm switching in finish_arch_post_lock_switch()
> because TIF_SWITCH_MM is per thread rather than mm
>
> > This is the problem I tried to solve. I don't know arm64, and I can't
> > say how it is serious.
>
> Have you managed to reproduce this? I don't say it doesn't exist, but I
> want to make sure that any patch actually fixes it.

No, I have not tried. I found this place while analysing scheduler code.
But it seems with the RT technics suggested previous message it's quite
possible.

> So we have more solutions, one of the first two suitable for stable:
>
> 1. Propagate the TIF_SWITCH_MM to the next thread (suggested by Martin)
> 2. Get rid of TIF_SWITCH_MM and use mm_cpumask for tracking (I already
> have the patch, it just needs a lot more testing)
> 3. Re-write the ASID allocation algorithm to no longer require IPIs and
> therefore drop finish_arch_post_lock_switch() (this can be done, so
> pretty intrusive for stable)
> 4. Replace finish_arch_post_lock_switch() with finish_mm_switch() as per
> Martin's patch and I think this would guarantee a call always, we can
> move the mm switching from switch_mm() to finish_mm_switch() and no
> need for flags to mark deferred mm switching
>
> For arm64, we'll most likely go with 2 for stable and move to 3 shortly
> after, no need for other deferred mm switching.
>

It's good, but one of architectures is a corner case. It seems to me
it's better to have the same enviroment/preemption for the first schedule
and for others.

This minimises the number of rare errors and this may help in the future.
The first schedule happens rare and it's tested worse.

Kirill

2014-02-14 12:22:01

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Fri, Feb 14, 2014 at 11:16:09AM +0000, Kirill Tkhai wrote:
> В Птн, 14/02/2014 в 10:52 +0000, Catalin Marinas пишет:
> > On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> > > Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> > > must to not be preempted between switch_mm() and this function.
> > > But in case of new task this is possible.
> >
> > We had a thread about this at the end of last year:
> >
> > https://lkml.org/lkml/2013/11/15/82
> >
> > There is indeed a problem on arm64, something like this (and I think
> > s390 also needs a fix):
> >
> > 1. switch_mm() via check_and_switch_context() defers the actual mm
> > switch by setting TIF_SWITCH_MM
> > 2. the context switch is considered 'done' by the kernel before
> > finish_arch_post_lock_switch() and therefore we can be preempted to a
> > new thread before finish_arch_post_lock_switch()
> > 3. The new thread has the same mm as the preempted thread but we
> > actually missed the mm switching in finish_arch_post_lock_switch()
> > because TIF_SWITCH_MM is per thread rather than mm
> >
> > > This is the problem I tried to solve. I don't know arm64, and I can't
> > > say how it is serious.
> >
> > Have you managed to reproduce this? I don't say it doesn't exist, but I
> > want to make sure that any patch actually fixes it.
>
> No, I have not tried. I found this place while analysing scheduler code.
> But it seems with the RT technics suggested previous message it's quite
> possible.

Now I think I confused myself. Looking through the __schedule() code,
context_switch() and therefore finish_arch_post_lock_switch() are called
with preemption disabled. So the scenario above cannot exist since the
current thread cannot be preempted between switch_mm() and
finish_arch_post_lock_switch(). Do I miss anything?

Now I get your point about schedule_tail() which calls
finish_task_switch() with a preempt count of 0. I'll get back to your
original patch.

Thanks.

--
Catalin

2014-02-14 12:33:12

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

В Птн, 14/02/2014 в 12:21 +0000, Catalin Marinas пишет:
> On Fri, Feb 14, 2014 at 11:16:09AM +0000, Kirill Tkhai wrote:
> > В Птн, 14/02/2014 в 10:52 +0000, Catalin Marinas пишет:
> > > On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> > > > Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> > > > must to not be preempted between switch_mm() and this function.
> > > > But in case of new task this is possible.
> > >
> > > We had a thread about this at the end of last year:
> > >
> > > https://lkml.org/lkml/2013/11/15/82
> > >
> > > There is indeed a problem on arm64, something like this (and I think
> > > s390 also needs a fix):
> > >
> > > 1. switch_mm() via check_and_switch_context() defers the actual mm
> > > switch by setting TIF_SWITCH_MM
> > > 2. the context switch is considered 'done' by the kernel before
> > > finish_arch_post_lock_switch() and therefore we can be preempted to a
> > > new thread before finish_arch_post_lock_switch()
> > > 3. The new thread has the same mm as the preempted thread but we
> > > actually missed the mm switching in finish_arch_post_lock_switch()
> > > because TIF_SWITCH_MM is per thread rather than mm
> > >
> > > > This is the problem I tried to solve. I don't know arm64, and I can't
> > > > say how it is serious.
> > >
> > > Have you managed to reproduce this? I don't say it doesn't exist, but I
> > > want to make sure that any patch actually fixes it.
> >
> > No, I have not tried. I found this place while analysing scheduler code.
> > But it seems with the RT technics suggested previous message it's quite
> > possible.
>
> Now I think I confused myself. Looking through the __schedule() code,
> context_switch() and therefore finish_arch_post_lock_switch() are called
> with preemption disabled. So the scenario above cannot exist since the
> current thread cannot be preempted between switch_mm() and
> finish_arch_post_lock_switch(). Do I miss anything?

Everything is right, the only case, which we have to worry, is
schedule_tail()

> Now I get your point about schedule_tail() which calls
> finish_task_switch() with a preempt count of 0. I'll get back to your
> original patch.
>
> Thanks.
>

Kirill

2014-02-14 12:35:43

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> Preemption state on enter in finish_task_switch() is different
> in cases of context_switch() and schedule_tail().
>
> In the first case we have it twice disabled: at the start of
> schedule() and during spin locking. In the second it is only
> once: the value which was set in init_task_preempt_count().
>
> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> that all newly created tasks execute finish_arch_post_lock_switch()
> and post_schedule() with preemption enabled.
>
> It seems there is possible a problem in rare situations on arm64,
> when one freshly created thread preempts another before
> finish_arch_post_lock_switch() has finished. If mm is the same,
> then TIF_SWITCH_MM on the second won't be set.
>
> The second rare but possible issue is zeroing of post_schedule()
> on a wrong cpu.
>
> So, lets fix this and unify preempt_count state.

An alternative to your patch:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131ef6aab..b932b6a4c716 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2210,6 +2210,10 @@ asmlinkage void schedule_tail(struct task_struct *prev)
{
struct rq *rq = this_rq();

+#ifndef __ARCH_WANT_UNLOCKED_CTXSW
+ /* In this case, finish_task_switch reenables preemption */
+ preempt_disable();
+#endif
finish_task_switch(rq, prev);

/*
@@ -2218,10 +2222,7 @@ asmlinkage void schedule_tail(struct task_struct *prev)
*/
post_schedule(rq);

-#ifdef __ARCH_WANT_UNLOCKED_CTXSW
- /* In this case, finish_task_switch does not reenable preemption */
preempt_enable();
-#endif
if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);
}

--
Catalin

2014-02-14 12:43:51

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

В Птн, 14/02/2014 в 12:35 +0000, Catalin Marinas пишет:
> On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > Preemption state on enter in finish_task_switch() is different
> > in cases of context_switch() and schedule_tail().
> >
> > In the first case we have it twice disabled: at the start of
> > schedule() and during spin locking. In the second it is only
> > once: the value which was set in init_task_preempt_count().
> >
> > For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > that all newly created tasks execute finish_arch_post_lock_switch()
> > and post_schedule() with preemption enabled.
> >
> > It seems there is possible a problem in rare situations on arm64,
> > when one freshly created thread preempts another before
> > finish_arch_post_lock_switch() has finished. If mm is the same,
> > then TIF_SWITCH_MM on the second won't be set.
> >
> > The second rare but possible issue is zeroing of post_schedule()
> > on a wrong cpu.
> >
> > So, lets fix this and unify preempt_count state.
>
> An alternative to your patch:

It looks better, than the initial.

You may add my Acked-by if you want.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b46131ef6aab..b932b6a4c716 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2210,6 +2210,10 @@ asmlinkage void schedule_tail(struct task_struct *prev)
> {
> struct rq *rq = this_rq();
>
> +#ifndef __ARCH_WANT_UNLOCKED_CTXSW
> + /* In this case, finish_task_switch reenables preemption */
> + preempt_disable();
> +#endif
> finish_task_switch(rq, prev);
>
> /*
> @@ -2218,10 +2222,7 @@ asmlinkage void schedule_tail(struct task_struct *prev)
> */
> post_schedule(rq);
>
> -#ifdef __ARCH_WANT_UNLOCKED_CTXSW
> - /* In this case, finish_task_switch does not reenable preemption */
> preempt_enable();
> -#endif
> if (current->set_child_tid)
> put_user(task_pid_vnr(current), current->set_child_tid);
> }
>

2014-02-14 15:49:17

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Fri, Feb 14, 2014 at 12:44:01PM +0000, Kirill Tkhai wrote:
> В Птн, 14/02/2014 в 12:35 +0000, Catalin Marinas пишет:
> > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > > Preemption state on enter in finish_task_switch() is different
> > > in cases of context_switch() and schedule_tail().
> > >
> > > In the first case we have it twice disabled: at the start of
> > > schedule() and during spin locking. In the second it is only
> > > once: the value which was set in init_task_preempt_count().
> > >
> > > For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > > that all newly created tasks execute finish_arch_post_lock_switch()
> > > and post_schedule() with preemption enabled.
> > >
> > > It seems there is possible a problem in rare situations on arm64,
> > > when one freshly created thread preempts another before
> > > finish_arch_post_lock_switch() has finished. If mm is the same,
> > > then TIF_SWITCH_MM on the second won't be set.
> > >
> > > The second rare but possible issue is zeroing of post_schedule()
> > > on a wrong cpu.
> > >
> > > So, lets fix this and unify preempt_count state.
> >
> > An alternative to your patch:
>
> It looks better, than the initial.
>
> You may add my Acked-by if you want.

Thanks for the ack. But apart from arm64, are there any other problems
with running part of finish_task_switch() and post_schedule() with
preemption enabled?

The finish_arch_post_lock_switch() is currently only used by arm and
arm64 (the former UP only) and arm no longer has the preemption problem
(see commit bdae73cd374e2). So I can either disable the preemption
around schedule_tail() call in arm64 or do it globally as per yours or
my patch.

Peter, Ingo, any thoughts? Do we care about preempt count consistency
across finish_task_switch() and post_schedule()?

Thanks.

--
Catalin

2014-02-17 09:37:48

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Fri, 14 Feb 2014 10:52:55 +0000
Catalin Marinas <[email protected]> wrote:

> On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> > On 13.02.2014 20:00, Peter Zijlstra wrote:
> > > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > >> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > >> that all newly created tasks execute finish_arch_post_lock_switch()
> > >> and post_schedule() with preemption enabled.
> > >
> > > That's IA64 and MIPS; do they have a 'good' reason to use this?
> >
> > It seems my description misleads reader, I'm sorry if so.
> >
> > I mean all architectures *except* IA64 and MIPS. All, which
> > has no __ARCH_WANT_UNLOCKED_CTXSW defined.
> >
> > IA64 and MIPS already have preempt_enable() in schedule_tail():
> >
> > #ifdef __ARCH_WANT_UNLOCKED_CTXSW
> > /* In this case, finish_task_switch does not reenable preemption */
> > preempt_enable();
> > #endif
> >
> > Their initial preemption is not decremented in finish_lock_switch().
> >
> > So, we speak about x86, ARM64 etc.
> >
> > Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> > must to not be preempted between switch_mm() and this function.
> > But in case of new task this is possible.
>
> We had a thread about this at the end of last year:
>
> https://lkml.org/lkml/2013/11/15/82
>
> There is indeed a problem on arm64, something like this (and I think
> s390 also needs a fix):
>
> 1. switch_mm() via check_and_switch_context() defers the actual mm
> switch by setting TIF_SWITCH_MM
> 2. the context switch is considered 'done' by the kernel before
> finish_arch_post_lock_switch() and therefore we can be preempted to a
> new thread before finish_arch_post_lock_switch()
> 3. The new thread has the same mm as the preempted thread but we
> actually missed the mm switching in finish_arch_post_lock_switch()
> because TIF_SWITCH_MM is per thread rather than mm
>
> > This is the problem I tried to solve. I don't know arm64, and I can't
> > say how it is serious.
>
> Have you managed to reproduce this? I don't say it doesn't exist, but I
> want to make sure that any patch actually fixes it.
>
> So we have more solutions, one of the first two suitable for stable:
>
> 1. Propagate the TIF_SWITCH_MM to the next thread (suggested by Martin)

This is what I put in place for s390 but with the name TIF_TLB_WAIT instead
of TIF_SWITCH_MM. I took the liberty to add the code to the features branch
of the linux-s390 tree including the common code change that is necessary:

https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features&id=09ddfb4d5602095aad04eada8bc8df59e873a6ef
https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features&id=525d65f8f66ac29136ba6d2336f5a73b038701e2

These patches will be included in a please-pull request with the next merge
window.

> 2. Get rid of TIF_SWITCH_MM and use mm_cpumask for tracking (I already
> have the patch, it just needs a lot more testing)
> 3. Re-write the ASID allocation algorithm to no longer require IPIs and
> therefore drop finish_arch_post_lock_switch() (this can be done, so
> pretty intrusive for stable)
> 4. Replace finish_arch_post_lock_switch() with finish_mm_switch() as per
> Martin's patch and I think this would guarantee a call always, we can
> move the mm switching from switch_mm() to finish_mm_switch() and no
> need for flags to mark deferred mm switching
>
> For arm64, we'll most likely go with 2 for stable and move to 3 shortly
> after, no need for other deferred mm switching.
>


--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2014-02-17 10:40:20

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Mon, Feb 17, 2014 at 09:37:38AM +0000, Martin Schwidefsky wrote:
> On Fri, 14 Feb 2014 10:52:55 +0000
> Catalin Marinas <[email protected]> wrote:
>
> > On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> > > On 13.02.2014 20:00, Peter Zijlstra wrote:
> > > > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > > >> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > > >> that all newly created tasks execute finish_arch_post_lock_switch()
> > > >> and post_schedule() with preemption enabled.
> > > >
> > > > That's IA64 and MIPS; do they have a 'good' reason to use this?
> > >
> > > It seems my description misleads reader, I'm sorry if so.
> > >
> > > I mean all architectures *except* IA64 and MIPS. All, which
> > > has no __ARCH_WANT_UNLOCKED_CTXSW defined.
> > >
> > > IA64 and MIPS already have preempt_enable() in schedule_tail():
> > >
> > > #ifdef __ARCH_WANT_UNLOCKED_CTXSW
> > > /* In this case, finish_task_switch does not reenable preemption */
> > > preempt_enable();
> > > #endif
> > >
> > > Their initial preemption is not decremented in finish_lock_switch().
> > >
> > > So, we speak about x86, ARM64 etc.
> > >
> > > Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> > > must to not be preempted between switch_mm() and this function.
> > > But in case of new task this is possible.
> >
> > We had a thread about this at the end of last year:
> >
> > https://lkml.org/lkml/2013/11/15/82
> >
> > There is indeed a problem on arm64, something like this (and I think
> > s390 also needs a fix):
> >
> > 1. switch_mm() via check_and_switch_context() defers the actual mm
> > switch by setting TIF_SWITCH_MM
> > 2. the context switch is considered 'done' by the kernel before
> > finish_arch_post_lock_switch() and therefore we can be preempted to a
> > new thread before finish_arch_post_lock_switch()
> > 3. The new thread has the same mm as the preempted thread but we
> > actually missed the mm switching in finish_arch_post_lock_switch()
> > because TIF_SWITCH_MM is per thread rather than mm
> >
> > > This is the problem I tried to solve. I don't know arm64, and I can't
> > > say how it is serious.
> >
> > Have you managed to reproduce this? I don't say it doesn't exist, but I
> > want to make sure that any patch actually fixes it.
> >
> > So we have more solutions, one of the first two suitable for stable:
> >
> > 1. Propagate the TIF_SWITCH_MM to the next thread (suggested by Martin)
>
> This is what I put in place for s390 but with the name TIF_TLB_WAIT instead
> of TIF_SWITCH_MM. I took the liberty to add the code to the features branch
> of the linux-s390 tree including the common code change that is necessary:
>
> https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features&id=09ddfb4d5602095aad04eada8bc8df59e873a6ef

I don't see a problem with additional calls to
finish_arch_post_lock_switch() on arm and arm64 but I would have done
this in more than one step:

1. Introduce finish_switch_mm()
2. Convert arm and arm64 to finish_switch_mm() (which means we no longer
check whether the interrupts are disabled in switch_mm() to defer the
switch
3. Remove generic finish_arch_post_lock_switch() because its
functionality has been entirely replaced by finish_switch_mm()

Anyway, we probably end up in the same place anyway.

But does this solve the problem of being preempted between switch_mm()
and finish_arch_post_lock_switch()? I guess we still need the same
guarantees that both switch_mm() and the hook happen on the same CPU.

> https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features&id=525d65f8f66ac29136ba6d2336f5a73b038701e2

That's a way to solve it for s390. I don't particularly like
transferring the mm switch pending TIF flag to the next task but I think
it does the job (just personal preference).

--
Catalin

2014-02-17 12:55:39

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

On Mon, 17 Feb 2014 10:40:06 +0000
Catalin Marinas <[email protected]> wrote:

> On Mon, Feb 17, 2014 at 09:37:38AM +0000, Martin Schwidefsky wrote:
> > On Fri, 14 Feb 2014 10:52:55 +0000
> > Catalin Marinas <[email protected]> wrote:
> >
> > > On Thu, Feb 13, 2014 at 09:32:22PM +0400, Kirill Tkhai wrote:
> > > > On 13.02.2014 20:00, Peter Zijlstra wrote:
> > > > > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > > > >> For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > > > >> that all newly created tasks execute finish_arch_post_lock_switch()
> > > > >> and post_schedule() with preemption enabled.
> > > > >
> > > > > That's IA64 and MIPS; do they have a 'good' reason to use this?
> > > >
> > > > It seems my description misleads reader, I'm sorry if so.
> > > >
> > > > I mean all architectures *except* IA64 and MIPS. All, which
> > > > has no __ARCH_WANT_UNLOCKED_CTXSW defined.
> > > >
> > > > IA64 and MIPS already have preempt_enable() in schedule_tail():
> > > >
> > > > #ifdef __ARCH_WANT_UNLOCKED_CTXSW
> > > > /* In this case, finish_task_switch does not reenable preemption */
> > > > preempt_enable();
> > > > #endif
> > > >
> > > > Their initial preemption is not decremented in finish_lock_switch().
> > > >
> > > > So, we speak about x86, ARM64 etc.
> > > >
> > > > Look at ARM64's finish_arch_post_lock_switch(). It looks a task
> > > > must to not be preempted between switch_mm() and this function.
> > > > But in case of new task this is possible.
> > >
> > > We had a thread about this at the end of last year:
> > >
> > > https://lkml.org/lkml/2013/11/15/82
> > >
> > > There is indeed a problem on arm64, something like this (and I think
> > > s390 also needs a fix):
> > >
> > > 1. switch_mm() via check_and_switch_context() defers the actual mm
> > > switch by setting TIF_SWITCH_MM
> > > 2. the context switch is considered 'done' by the kernel before
> > > finish_arch_post_lock_switch() and therefore we can be preempted to a
> > > new thread before finish_arch_post_lock_switch()
> > > 3. The new thread has the same mm as the preempted thread but we
> > > actually missed the mm switching in finish_arch_post_lock_switch()
> > > because TIF_SWITCH_MM is per thread rather than mm
> > >
> > > > This is the problem I tried to solve. I don't know arm64, and I can't
> > > > say how it is serious.
> > >
> > > Have you managed to reproduce this? I don't say it doesn't exist, but I
> > > want to make sure that any patch actually fixes it.
> > >
> > > So we have more solutions, one of the first two suitable for stable:
> > >
> > > 1. Propagate the TIF_SWITCH_MM to the next thread (suggested by Martin)
> >
> > This is what I put in place for s390 but with the name TIF_TLB_WAIT instead
> > of TIF_SWITCH_MM. I took the liberty to add the code to the features branch
> > of the linux-s390 tree including the common code change that is necessary:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features&id=09ddfb4d5602095aad04eada8bc8df59e873a6ef
>
> I don't see a problem with additional calls to
> finish_arch_post_lock_switch() on arm and arm64 but I would have done
> this in more than one step:
>
> 1. Introduce finish_switch_mm()
> 2. Convert arm and arm64 to finish_switch_mm() (which means we no longer
> check whether the interrupts are disabled in switch_mm() to defer the
> switch
> 3. Remove generic finish_arch_post_lock_switch() because its
> functionality has been entirely replaced by finish_switch_mm()
>
> Anyway, we probably end up in the same place anyway.

Peter pointed me to finish_arch_post_lock switch as a replacement for
finish_switch_mm. They are basically doing the same thing and I do not care
too much how the function is called. finish_arch_post_lock_switch is ok from
my point of view. If you want to change it be aware of the header file hell
you are getting into.

> But does this solve the problem of being preempted between switch_mm()
> and finish_arch_post_lock_switch()? I guess we still need the same
> guarantees that both switch_mm() and the hook happen on the same CPU.

By itself no, I do not think so. finish_arch_post_lock_switch is supposed
to be called with no locks held, so preemption should be enabled as well.

> > https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features&id=525d65f8f66ac29136ba6d2336f5a73b038701e2
>
> That's a way to solve it for s390. I don't particularly like
> transferring the mm switch pending TIF flag to the next task but I think
> it does the job (just personal preference).

It gets the job done. The alternative is another per-cpu bitmap for each
mm. I prefer the transfer of the TIF flag to the next task, we do that
for the machine check flag TIF_MCCK_PENDING anyway. One more bit to transfer
does not hurt.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2014-02-17 14:43:32

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Create new task with twice disabled preemption

В Птн, 14/02/2014 в 15:49 +0000, Catalin Marinas пишет:
> On Fri, Feb 14, 2014 at 12:44:01PM +0000, Kirill Tkhai wrote:
> > В Птн, 14/02/2014 в 12:35 +0000, Catalin Marinas пишет:
> > > On Thu, Feb 13, 2014 at 07:51:56PM +0400, Kirill Tkhai wrote:
> > > > Preemption state on enter in finish_task_switch() is different
> > > > in cases of context_switch() and schedule_tail().
> > > >
> > > > In the first case we have it twice disabled: at the start of
> > > > schedule() and during spin locking. In the second it is only
> > > > once: the value which was set in init_task_preempt_count().
> > > >
> > > > For archs without __ARCH_WANT_UNLOCKED_CTXSW set this means
> > > > that all newly created tasks execute finish_arch_post_lock_switch()
> > > > and post_schedule() with preemption enabled.
> > > >
> > > > It seems there is possible a problem in rare situations on arm64,
> > > > when one freshly created thread preempts another before
> > > > finish_arch_post_lock_switch() has finished. If mm is the same,
> > > > then TIF_SWITCH_MM on the second won't be set.
> > > >
> > > > The second rare but possible issue is zeroing of post_schedule()
> > > > on a wrong cpu.
> > > >
> > > > So, lets fix this and unify preempt_count state.
> > >
> > > An alternative to your patch:
> >
> > It looks better, than the initial.
> >
> > You may add my Acked-by if you want.
>
> Thanks for the ack. But apart from arm64, are there any other problems
> with running part of finish_task_switch() and post_schedule() with
> preemption enabled?

1)We have architecture-dependent finish_arch_post_lock_switch()
which is possible(?) to be fixed for every arch at the moment,
but someone may run into it in the future.

2)The second is fire_sched_in_preempt_notifiers(). It's generic interface
which is currently unused. It notifies about preemption, so it's bad
if additional preemption happens when it has not finished.

3)tick_nohz_task_switch() seems to be without problems. Just very-very
slightly performance.

4)If post_schedule() happens on wrong CPU, the system may occur imbalanced
for a short period. This happens, when post_schedule() of wrong class
is executed.

If we fix that once in scheduler we'll decide everything above now and
in the future. Also we'll decrease number of rare situations.

> The finish_arch_post_lock_switch() is currently only used by arm and
> arm64 (the former UP only) and arm no longer has the preemption problem
> (see commit bdae73cd374e2). So I can either disable the preemption
> around schedule_tail() call in arm64 or do it globally as per yours or
> my patch.
>
> Peter, Ingo, any thoughts? Do we care about preempt count consistency
> across finish_task_switch() and post_schedule()?
>
> Thanks.
>