2014-12-15 02:24:46

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/2] sched: Preemption fixlet and cleanup

Here is the reworked fix against the missing preemption check that Linus
reported + the need_resched() loop pulled up to __schedule() callers also
suggested by him.

Thanks.

Frederic Weisbecker (2):
sched: Fix missing preemption check in cond_resched()
sched: Pull resched loop to __schedule() callers

kernel/sched/core.c | 51 ++++++++++++++++++++++++++-------------------------
1 file changed, 26 insertions(+), 25 deletions(-)

--
2.1.3


2014-12-15 02:24:50

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/2] sched: Fix missing preemption check in cond_resched()

If an interrupt fires in cond_resched() between the call to __schedule()
and the PREEMPT_ACTIVE count decrementation with the interrupt having
set TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
being delayed because we are interrupting a preempt-disabled area, is
usually fixed up after preemption is re-enabled back with an explicit
call to preempt_schedule().

This is what preempt_enable() does but a raw preempt count decrement as
performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
preemption check. Therefore when such a race happens, the rescheduling
is going to be delayed until the next scheduler or preemption entrypoint.
This can be a problem for scheduler latency sensitive workloads.

Lets fix that by consolidating cond_resched() with preempt_schedule()
internals.

Reported-by: Linus Torvalds <[email protected]>
Reported-and-inspired-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/sched/core.c | 40 +++++++++++++++++++---------------------
1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b5797b7..069a2d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2877,6 +2877,21 @@ void __sched schedule_preempt_disabled(void)
preempt_disable();
}

+static void preempt_schedule_common(void)
+{
+ do {
+ __preempt_count_add(PREEMPT_ACTIVE);
+ __schedule();
+ __preempt_count_sub(PREEMPT_ACTIVE);
+
+ /*
+ * Check again in case we missed a preemption opportunity
+ * between schedule and now.
+ */
+ barrier();
+ } while (need_resched());
+}
+
#ifdef CONFIG_PREEMPT
/*
* this is the entry point to schedule() from in-kernel preemption
@@ -2892,17 +2907,7 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
if (likely(!preemptible()))
return;

- do {
- __preempt_count_add(PREEMPT_ACTIVE);
- __schedule();
- __preempt_count_sub(PREEMPT_ACTIVE);
-
- /*
- * Check again in case we missed a preemption opportunity
- * between schedule and now.
- */
- barrier();
- } while (need_resched());
+ preempt_schedule_common();
}
NOKPROBE_SYMBOL(preempt_schedule);
EXPORT_SYMBOL(preempt_schedule);
@@ -4202,17 +4207,10 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
}

-static void __cond_resched(void)
-{
- __preempt_count_add(PREEMPT_ACTIVE);
- __schedule();
- __preempt_count_sub(PREEMPT_ACTIVE);
-}
-
int __sched _cond_resched(void)
{
if (should_resched()) {
- __cond_resched();
+ preempt_schedule_common();
return 1;
}
return 0;
@@ -4237,7 +4235,7 @@ int __cond_resched_lock(spinlock_t *lock)
if (spin_needbreak(lock) || resched) {
spin_unlock(lock);
if (resched)
- __cond_resched();
+ preempt_schedule_common();
else
cpu_relax();
ret = 1;
@@ -4253,7 +4251,7 @@ int __sched __cond_resched_softirq(void)

if (should_resched()) {
local_bh_enable();
- __cond_resched();
+ preempt_schedule_common();
local_bh_disable();
return 1;
}
--
2.1.3

2014-12-15 02:25:00

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/2] sched: Pull resched loop to __schedule() callers

__schedule() disables preemption during its job and re-enables it
afterward without doing a preemption check to avoid recursion.

But if an event happens after the context switch which requires
rescheduling, we need to check again if a task of a higher priority
needs the CPU. A preempt irq can raise such a situation. To handle that,
__schedule() loops on need_resched().

But preempt_schedule_*() functions, which call __schedule(), also loop
on need_resched() to handle missed preempt irqs. Hence we end up with
the same loop happening twice.

Lets simplify that by attributing the need_resched() loop responsability
to all __schedule() callers.

There is a risk that the outer loop now handles reschedules that used
to be handled by the inner loop with the added overhead of caller details
(inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner
rescheduling loop weren't too frequent, this shouldn't matter. Especially
since the whole preemption path is now loosing one loop in any case.

Suggested-by: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/sched/core.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 069a2d8..368c8f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2748,6 +2748,10 @@ again:
* - explicit schedule() call
* - return from syscall or exception to user-space
* - return from interrupt-handler to user-space
+ *
+ * WARNING: all callers must re-check need_resched() afterward and reschedule
+ * accordingly in case an event triggered the need for rescheduling (such as
+ * an interrupt waking up a task) while preemption was disabled in __schedule().
*/
static void __sched __schedule(void)
{
@@ -2756,7 +2760,6 @@ static void __sched __schedule(void)
struct rq *rq;
int cpu;

-need_resched:
preempt_disable();
cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -2821,8 +2824,6 @@ need_resched:
post_schedule(rq);

sched_preempt_enable_no_resched();
- if (need_resched())
- goto need_resched;
}

static inline void sched_submit_work(struct task_struct *tsk)
@@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void)
struct task_struct *tsk = current;

sched_submit_work(tsk);
- __schedule();
+ do {
+ __schedule();
+ } while (need_resched());
}
EXPORT_SYMBOL(schedule);

--
2.1.3

2014-12-15 02:46:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers

On Sun, Dec 14, 2014 at 6:24 PM, Frederic Weisbecker <[email protected]> wrote:
> -need_resched:
> preempt_disable();
> cpu = smp_processor_id();
> rq = cpu_rq(cpu);
> @@ -2821,8 +2824,6 @@ need_resched:
> post_schedule(rq);
>
> sched_preempt_enable_no_resched();
> - if (need_resched())
> - goto need_resched;


So my suggestion was to move the
"preempt_disable()/enable_no_resched()" into the callers too.

Everybody already does that - except for "schedule()" itself. So that
would involve adding it here too:

> @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void)
> struct task_struct *tsk = current;
>
> sched_submit_work(tsk);
> - __schedule();
> + do {
preempt_disable();
> + __schedule();
sched_preempt_enable_no_resched();
> + } while (need_resched());
> }

Hmm?

That would mean that we have one less preempt update in the
__sched_preempt() cases. If somebody cares about the preempt counter
value, we'd have to increemnt the preempt count add values too, ie do

__preempt_count_add(PREEMPT_ACTIVE+1);

there, but I'm not convicned anybody cares about the exact count.

As it is, you end up doing the preempt count things twice in the
__sched_preempt() case: first the PREEMPT_ACTIVE count, and then the
count inside __schedule().

Hmm?

Linus

2014-12-15 16:32:41

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers

On Sun, Dec 14, 2014 at 06:46:27PM -0800, Linus Torvalds wrote:
> On Sun, Dec 14, 2014 at 6:24 PM, Frederic Weisbecker <[email protected]> wrote:
> > -need_resched:
> > preempt_disable();
> > cpu = smp_processor_id();
> > rq = cpu_rq(cpu);
> > @@ -2821,8 +2824,6 @@ need_resched:
> > post_schedule(rq);
> >
> > sched_preempt_enable_no_resched();
> > - if (need_resched())
> > - goto need_resched;
>
>
> So my suggestion was to move the
> "preempt_disable()/enable_no_resched()" into the callers too.
>
> Everybody already does that - except for "schedule()" itself. So that
> would involve adding it here too:
>
> > @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void)
> > struct task_struct *tsk = current;
> >
> > sched_submit_work(tsk);
> > - __schedule();
> > + do {
> preempt_disable();
> > + __schedule();
> sched_preempt_enable_no_resched();
> > + } while (need_resched());
> > }
>
> Hmm?

Indeed!

>
> That would mean that we have one less preempt update in the
> __sched_preempt() cases. If somebody cares about the preempt counter
> value, we'd have to increemnt the preempt count add values too, ie do
>
> __preempt_count_add(PREEMPT_ACTIVE+1);
>
> there, but I'm not convicned anybody cares about the exact count.

It _looks_ fine to only add PREEMPT_ACTIVE without PREEMPT_OFFSET. I guess
we are only interested in avoiding preemption.

But it may have an impact on some context checkers that rely on in_atomic*()
which ignore the PREEMPT_ACTIVE value. It shouldn't ignore that though but I
guess it's a hack for some specific situation. Now if we do what we plan,
PREEMPT_ACTIVE won't involve PREEMPT_OFFSET anymore, so in_atomic() should
handle PREEMPT_ACTIVE.

schedule_debug() for example relies on in_atomic_preempt_off() which wants
preemption disabled through PREEMPT_OFFSET. But we can tweak that.

This is the only context check I know of, lets hope there are no others lurking.

Ah and there is also the preempt off tracer which ignores the PREEMPT_ACTIVE
counts because they use __preempt_count_add/sub() and they use to be immediately
followed by preempt disable. So we need to use the non-underscored version of
preempt_count_add/sub() on preempt_schedule*() to trace correctly (note though
that ftrace only records preempt_count() & PREEMPT_MASK. So it's going to record
preemption disabled area with a 0 pc.

> As it is, you end up doing the preempt count things twice in the
> __sched_preempt() case: first the PREEMPT_ACTIVE count, and then the
> count inside __schedule().

Indeed so I'm going to split the changes in two steps:

1) disable preemption from schedule(), add PREEMPT_ACTIVE + PREEMPT_OFFSET from
preempt_schedule*() callers, make them use non-underscored preempt_count_add()
and remove the preempt_disable() from __schedule(). That change should be safe
and we remove the overhead of the double preempt_disable.

2) Optional change: remove the PREEMPT_OFFSET from the preempt_schedule*() callers
and tweak schedule_bug(). Having that as a seperate patch so it's easier to ignore,
drop or revert as it looks like a sensitive change.

2014-12-15 19:16:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers

On Mon, Dec 15, 2014 at 8:32 AM, Frederic Weisbecker <[email protected]> wrote:
>
> But it may have an impact on some context checkers that rely on in_atomic*()
> which ignore the PREEMPT_ACTIVE value. It shouldn't ignore that though but I
> guess it's a hack for some specific situation.

I think we should remove it. The only reason for it is the scheduler
itself, which used to have the in_atomic() check (ok, still has, it's
just called "in_atomic_preempt_off()").

But yes, if we keep the "mask off PREEMPT_ACTIVE" in in_atomic(), then
we do need to update the counts with "PREEMPT_ACTIVE+1" instead. Or
something like that.

Linus

2014-12-18 12:49:01

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/2] sched: Preemption fixlet and cleanup

Hi,

I am following the "frequent lockups in 3.18rc4" thread (see [0]).

So, people still work on it, and one fixlet was going towards the sched/x86.
Is there an update of your patchset [1] (especially of patch 2/2 [2]) around?

One good thing in the discussion of [0] would be a short summarize and
config-parameters to test with.

Thanks in advance.

Regards,
- Sedat -

[0] http://marc.info/?t=141600075000006&r=1&w=1
[1] http://marc.info/?l=linux-kernel&m=141861029214340&w=1
[2] http://marc.info/?t=141861036000004&r=1&w=1

P.S.: I have applied the 3 patches of Linux on top of Linux v3.18.1
and testing with that.

Linus Torvalds (3):
x86: mm: move mmap_sem unlock from mm_fault_error() to caller
x86: mm: consolidate VM_FAULT_RETRY handling
x86: mm: fix VM_FAULT_RETRY handling