2008-07-18 06:41:38

by Eric Miao

[permalink] [raw]
Subject: [PATCH] sched: do not stop ticks when cpu is not idle

Issue: the sched tick would be stopped in some race conditions.

One of issues caused by that is:

Since there is no timer ticks any more from then, the jiffies update will be
up to other interrupt to happen. The jiffies will not be updated for a long
time, until next interrupt happens. That will cause APIs like
wait_for_completion_timeout(&complete, timeout) to return timeout by mistake,
since it is using a old jiffies as start time.

Please see comments (1)~(6) inline for how the ticks are stopped
by mistake when cpu is not idle:

void cpu_idle(void)
{
...
while (1) {
void (*idle)(void) = pm_idle;
if (!idle)
idle = default_idle;
leds_event(led_idle_start);
tick_nohz_stop_sched_tick();
while (!need_resched())
idle();
leds_event(led_idle_end);
tick_nohz_restart_sched_tick();
(1) ticks are retarted before switch to other tasks
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}

asmlinkage void __sched schedule(void)
{
...
...
need_resched:
(6) the idle task will be scheduled out again and switch to next task,
with ticks stopped in (5). So the next task will be running with tick stopped.
preempt_disable();
cpu = smp_processor_id();
rq = cpu_rq(cpu);
rcu_qsctr_inc(cpu);
prev = rq->curr;
switch_count = &prev->nivcsw;

release_kernel_lock(prev);
need_resched_nonpreemptible:

schedule_debug(prev);

hrtick_clear(rq);

/*
* Do the rq-clock update outside the rq lock:
*/
local_irq_disable();
__update_rq_clock(rq);
spin_lock(&rq->lock);
clear_tsk_need_resched(prev); (2) resched flag is clear from idle task

....

context_switch(rq, prev, next); /* unlocks the rq */
(3) IRQ will be enabled at end of context_swtich( ).
...
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
(4) the idle task is scheduled back. If an interrupt happen here,
The irq_exit( ) will be called at end of the irq handler.
goto need_resched;
}

void irq_exit(void)
{
...
/* Make sure that timer wheel updates are propagated */
if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
tick_nohz_stop_sched_tick();
(5) The ticks will be stopped again since current
task is idle task and its resched flag is clear in (2).
rcu_irq_exit();
preempt_enable_no_resched();
}

Signed-off-by: Jack Ren <[email protected]>
---
kernel/sched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ff0a7e2..fd17d74 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4027,7 +4027,8 @@ need_resched_nonpreemptible:
rq->nr_switches++;
rq->curr = next;
++*switch_count;
-
+ if (rq->curr != rq->idle)
+ tick_nohz_restart_sched_tick();
context_switch(rq, prev, next); /* unlocks the rq */
/*
* the context switch might have flipped the stack from under
--
1.5.4


2008-07-18 10:25:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle


* eric miao <[email protected]> wrote:

> Issue: the sched tick would be stopped in some race conditions.

> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4027,7 +4027,8 @@ need_resched_nonpreemptible:
> rq->nr_switches++;
> rq->curr = next;
> ++*switch_count;
> -
> + if (rq->curr != rq->idle)
> + tick_nohz_restart_sched_tick();
> context_switch(rq, prev, next); /* unlocks the rq */

applied to tip/sched/urgent, thanks Eric.

Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)

It looks a bit ugly to me in the middle of schedule() - is there no wait
to solve this within kernel/time/*.c ?

Ingo

-------------->
commit ca1b5a8a9abb3db57562a838f41cdba842f13fe8
Author: eric miao <[email protected]>
Date: Fri Jul 18 14:41:29 2008 +0800

sched: do not stop ticks when cpu is not idle

Issue: the sched tick would be stopped in some race conditions.

One of issues caused by that is:

Since there is no timer ticks any more from then, the jiffies update will be
up to other interrupt to happen. The jiffies will not be updated for a long
time, until next interrupt happens. That will cause APIs like
wait_for_completion_timeout(&complete, timeout) to return timeout by mistake,
since it is using a old jiffies as start time.

Please see comments (1)~(6) inline for how the ticks are stopped
by mistake when cpu is not idle:

void cpu_idle(void)
{
...
while (1) {
void (*idle)(void) = pm_idle;
if (!idle)
idle = default_idle;
leds_event(led_idle_start);
tick_nohz_stop_sched_tick();
while (!need_resched())
idle();
leds_event(led_idle_end);
tick_nohz_restart_sched_tick();
(1) ticks are retarted before switch to other tasks
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}

asmlinkage void __sched schedule(void)
{
...
...
need_resched:
(6) the idle task will be scheduled out again and switch to next task,
with ticks stopped in (5). So the next task will be running with tick stopped.
preempt_disable();
cpu = smp_processor_id();
rq = cpu_rq(cpu);
rcu_qsctr_inc(cpu);
prev = rq->curr;
switch_count = &prev->nivcsw;

release_kernel_lock(prev);
need_resched_nonpreemptible:

schedule_debug(prev);

hrtick_clear(rq);

/*
* Do the rq-clock update outside the rq lock:
*/
local_irq_disable();
__update_rq_clock(rq);
spin_lock(&rq->lock);
clear_tsk_need_resched(prev); (2) resched flag is clear from idle task

....

context_switch(rq, prev, next); /* unlocks the rq */
(3) IRQ will be enabled at end of context_swtich( ).
...
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
(4) the idle task is scheduled back. If an interrupt happen here,
The irq_exit( ) will be called at end of the irq handler.
goto need_resched;
}

void irq_exit(void)
{
...
/* Make sure that timer wheel updates are propagated */
if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
tick_nohz_stop_sched_tick();
(5) The ticks will be stopped again since current
task is idle task and its resched flag is clear in (2).
rcu_irq_exit();
preempt_enable_no_resched();
}

Signed-off-by: Jack Ren <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1ee18db..e0e0162 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4446,7 +4446,8 @@ need_resched_nonpreemptible:
rq->nr_switches++;
rq->curr = next;
++*switch_count;
-
+ if (rq->curr != rq->idle)
+ tick_nohz_restart_sched_tick();
context_switch(rq, prev, next); /* unlocks the rq */
/*
* the context switch might have flipped the stack from under

2008-07-18 10:54:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle


* Ingo Molnar <[email protected]> wrote:

> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4446,7 +4446,8 @@ need_resched_nonpreemptible:
> rq->nr_switches++;
> rq->curr = next;
> ++*switch_count;
> -
> + if (rq->curr != rq->idle)
> + tick_nohz_restart_sched_tick();
> context_switch(rq, prev, next); /* unlocks the rq */

hm, one problem i can see is lock dependencies. This code is executed
with the rq lock while tick_nohz_restart_sched_tick() takes hr locks =>
not good. So i havent applied this just yet - this needs to be solved
differently.

Ingo

2008-07-18 11:07:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Fri, 2008-07-18 at 12:54 +0200, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4446,7 +4446,8 @@ need_resched_nonpreemptible:
> > rq->nr_switches++;
> > rq->curr = next;
> > ++*switch_count;
> > -
> > + if (rq->curr != rq->idle)
> > + tick_nohz_restart_sched_tick();
> > context_switch(rq, prev, next); /* unlocks the rq */
>
> hm, one problem i can see is lock dependencies. This code is executed
> with the rq lock while tick_nohz_restart_sched_tick() takes hr locks =>
> not good. So i havent applied this just yet - this needs to be solved
> differently.

Actually, that should work these days...

Also, I assume Eric actually tested this with lockdep enabled (right,
Eric?) and that'll shout - or rather, lockup hard in this case - if you
got it wrong.

2008-07-18 11:21:21

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

2008/7/18 eric miao <[email protected]>:
> Issue: the sched tick would be stopped in some race conditions.
>
> One of issues caused by that is:
>
> Since there is no timer ticks any more from then, the jiffies update will be
> up to other interrupt to happen. The jiffies will not be updated for a long
> time, until next interrupt happens. That will cause APIs like
> wait_for_completion_timeout(&complete, timeout) to return timeout by mistake,
> since it is using a old jiffies as start time.
>
> Please see comments (1)~(6) inline for how the ticks are stopped
> by mistake when cpu is not idle:
>
> void cpu_idle(void)
> {
> ...
> while (1) {
> void (*idle)(void) = pm_idle;
> if (!idle)
> idle = default_idle;
> leds_event(led_idle_start);
> tick_nohz_stop_sched_tick();
> while (!need_resched())
> idle();
> leds_event(led_idle_end);
> tick_nohz_restart_sched_tick();
> (1) ticks are retarted before switch to other tasks
> preempt_enable_no_resched();
> schedule();
> preempt_disable();
> }
> }
>
> asmlinkage void __sched schedule(void)
> {
> ...
> ...
> need_resched:
> (6) the idle task will be scheduled out again and switch to next task,
> with ticks stopped in (5). So the next task will be running with tick stopped.
> preempt_disable();
> cpu = smp_processor_id();
> rq = cpu_rq(cpu);
> rcu_qsctr_inc(cpu);
> prev = rq->curr;
> switch_count = &prev->nivcsw;
>
> release_kernel_lock(prev);
> need_resched_nonpreemptible:
>
> schedule_debug(prev);
>
> hrtick_clear(rq);
>
> /*
> * Do the rq-clock update outside the rq lock:
> */
> local_irq_disable();
> __update_rq_clock(rq);
> spin_lock(&rq->lock);
> clear_tsk_need_resched(prev); (2) resched flag is clear from idle task
>
> ....
>
> context_switch(rq, prev, next); /* unlocks the rq */
> (3) IRQ will be enabled at end of context_swtich( ).
> ...
> preempt_enable_no_resched();
> if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
> (4) the idle task is scheduled back. If an interrupt happen here,
> The irq_exit( ) will be called at end of the irq handler.
> goto need_resched;

(I've taken just a quick look so far, that's maybe why I'm a bit confused)

So what did set TIF_NEED_RESCHED flag here in (4)?

- at first, it was cleared in (2) - ok.
- An interrupt happens somewhere after context_switch() [ btw., what's
about archs that do ctx-switches with interrupts enabled... ]

irq_exit() calls tick_nohz_stop_sched_tick() _only_ when
!need_resched(), meaning TIF_NEED_RESCHED is _not_ set for rq->idle
(no new tasks were activated)
.

So do we have 2 interruts in a raw?

my (likely wrong) interpretation:

(1) schedule() some task - (switch to) -> idle

idle becomes active but is still running in schedule()

(2) an interrupt happens at (3), then irq_exit() calls
tick_nohz_stop_sched_tick()

so far, idle should still run -> this interrupt didn't lead to new
tasks having been activated

(3) another interrupt happens which actually wakes up a task;

TIF_NEED_RESCHED is set

(4) this fact is detected at (4) and --> goto need_resched() to pick
up a new task.

(5) we kind of have idle -> new task reschedule but cpu_idle() never
happened to be called _so_ sched-ticks were not resterted...

is it like this or I'm missing something?


> }
>
> void irq_exit(void)
> {
> ...
> /* Make sure that timer wheel updates are propagated */
> if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
> tick_nohz_stop_sched_tick();
> (5) The ticks will be stopped again since current
> task is idle task and its resched flag is clear in (2).
> rcu_irq_exit();
> preempt_enable_no_resched();
> }
>
> Signed-off-by: Jack Ren <[email protected]>
> ---
> kernel/sched.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ff0a7e2..fd17d74 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4027,7 +4027,8 @@ need_resched_nonpreemptible:
> rq->nr_switches++;
> rq->curr = next;
> ++*switch_count;
> -
> + if (rq->curr != rq->idle)
> + tick_nohz_restart_sched_tick();
> context_switch(rq, prev, next); /* unlocks the rq */
> /*
> * the context switch might have flipped the stack from under
> --
> 1.5.4
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



--
Best regards,
Dmitry Adamushko

2008-07-18 13:52:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Fri, 18 Jul 2008, Ingo Molnar wrote:
>
> * eric miao <[email protected]> wrote:
>
> > Issue: the sched tick would be stopped in some race conditions.
>
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4027,7 +4027,8 @@ need_resched_nonpreemptible:
> > rq->nr_switches++;
> > rq->curr = next;
> > ++*switch_count;
> > -
> > + if (rq->curr != rq->idle)
> > + tick_nohz_restart_sched_tick();
> > context_switch(rq, prev, next); /* unlocks the rq */
>
> applied to tip/sched/urgent, thanks Eric.
>
> Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)

Yes. I did not understand the issue when Jack pointed it out to me,
but with Erics explanation it's really clear. Thanks for tracking that
down.

> It looks a bit ugly to me in the middle of schedule() - is there no wait
> to solve this within kernel/time/*.c ?

Hmm, yes. I think the proper fix is to enable the tick stop mechanism
in the idle loop and disable it before we go to schedule. That takes
an additional parameter to tick_nohz_stop_sched_tick(), but we then
gain a very clear section where the nohz mimic can be active.

I'll whip up a patch.

Thanks,

tglx

2008-07-18 14:39:12

by Eric Miao

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Fri, Jul 18, 2008 at 9:52 PM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 18 Jul 2008, Ingo Molnar wrote:
>>
>> * eric miao <[email protected]> wrote:
>>
>> > Issue: the sched tick would be stopped in some race conditions.
>>
>> > --- a/kernel/sched.c
>> > +++ b/kernel/sched.c
>> > @@ -4027,7 +4027,8 @@ need_resched_nonpreemptible:
>> > rq->nr_switches++;
>> > rq->curr = next;
>> > ++*switch_count;
>> > -
>> > + if (rq->curr != rq->idle)
>> > + tick_nohz_restart_sched_tick();
>> > context_switch(rq, prev, next); /* unlocks the rq */
>>
>> applied to tip/sched/urgent, thanks Eric.
>>
>> Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)
>
> Yes. I did not understand the issue when Jack pointed it out to me,
> but with Erics explanation it's really clear. Thanks for tracking that
> down.

Actually, Jack did most of the analysis and came up with this quick
fix.

>
>> It looks a bit ugly to me in the middle of schedule() - is there no wait
>> to solve this within kernel/time/*.c ?
>
> Hmm, yes. I think the proper fix is to enable the tick stop mechanism
> in the idle loop and disable it before we go to schedule. That takes
> an additional parameter to tick_nohz_stop_sched_tick(), but we then
> gain a very clear section where the nohz mimic can be active.
>
> I'll whip up a patch.

Sounds great, thanks.

>
> Thanks,
>
> tglx
>



--
Cheers
- eric

2008-07-18 15:28:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Fri, 18 Jul 2008, eric miao wrote:
> On Fri, Jul 18, 2008 at 9:52 PM, Thomas Gleixner <[email protected]> wrote:
> >> Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)
> >
> > Yes. I did not understand the issue when Jack pointed it out to me,
> > but with Erics explanation it's really clear. Thanks for tracking that
> > down.
>
> Actually, Jack did most of the analysis and came up with this quick
> fix.
>
> >
> >> It looks a bit ugly to me in the middle of schedule() - is there no wait
> >> to solve this within kernel/time/*.c ?
> >
> > Hmm, yes. I think the proper fix is to enable the tick stop mechanism
> > in the idle loop and disable it before we go to schedule. That takes
> > an additional parameter to tick_nohz_stop_sched_tick(), but we then
> > gain a very clear section where the nohz mimic can be active.
> >
> > I'll whip up a patch.
>
> Sounds great, thanks.

Hey, thanks for tracking that down. I was banging my head against the
wall when I understood the problem.

I tried to pinpoint the occasional softlockup bug reports, but I
probably stared too long into that code so I just saw what I expected
to see.

Can you give the patch below a try please ?

Thanks,

tglx

------------------->
Subject: nohz: prevent tick stop outside of the idle loop
From: Thomas Gleixner <[email protected]>

Jack Ran and Eric Miao tracked down the following long standing
problem in the NOHZ code:

scheduler switch to idle task
enable interrupts

Window starts here

----> interrupt happens (does not set NEED_RESCHED)
irq_exit() stops the tick

----> interrupt happens (does set NEED_RESCHED)

return from schedule()

cpu_idle(): preempt_disable();

Window ends here

The interrupts can happen at any point inside the race window. The
first interrupt stops the tick, the second one causes the scheduler to
rerun and switch away from idle again and we end up with the tick
disabled.

The fact that it needs two interrupts where the first one does not set
NEED_RESCHED and the second one does made the bug obscure and extremly
hard to reproduce and analyse. Kudos to Jack and Eric.

Solution: Limit the NOHZ functionality to the idle loop to make sure
that we can not run into such a situation ever again.

cpu_idle()
{
preempt_disable();

while(1) {
tick_nohz_stop_sched_tick(1); <- tell NOHZ code that we
are in the idle loop

while (!need_resched())
halt();

tick_nohz_restart_sched_tick(); <- disables NOHZ mode
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}

In hindsight we should have done this forever, but ...

/me grabs a large brown paperbag.

Debugged-by: Jack Ren <[email protected]>,
Debugged-by: eric miao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>

---
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 199b368..89bfded 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -162,7 +162,7 @@ void cpu_idle(void)
if (!idle)
idle = default_idle;
leds_event(led_idle_start);
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched())
idle();
leds_event(led_idle_end);
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 6cf9df1..ff820a9 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -31,7 +31,7 @@ void cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched())
cpu_idle_sleep();
tick_nohz_restart_sched_tick();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 53c2cd2..77800dd 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -105,7 +105,7 @@ void cpu_idle(void)
#endif
if (!idle)
idle = default_idle;
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched())
idle();
tick_nohz_restart_sched_tick();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index c06f5b5..b16facd 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -53,7 +53,7 @@ void __noreturn cpu_idle(void)
{
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched()) {
#ifdef CONFIG_SMTC_IDLE_HOOK_DEBUG
extern void smtc_idle_loop_hook(void);
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index c3cf0e8..d308a9f 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -60,7 +60,7 @@ void cpu_idle(void)

set_thread_flag(TIF_POLLING_NRFLAG);
while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched() && !cpu_should_die()) {
ppc64_runlatch_off();

diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index b721207..70b688c 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -561,7 +561,7 @@ static void yield_shared_processor(void)
static void iseries_shared_idle(void)
{
while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched() && !hvlpevent_is_pending()) {
local_irq_disable();
ppc64_runlatch_off();
@@ -591,7 +591,7 @@ static void iseries_dedicated_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
if (!need_resched()) {
while (!need_resched()) {
ppc64_runlatch_off();
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index b98e37a..921892c 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -86,7 +86,7 @@ void cpu_idle(void)
if (!idle)
idle = default_idle;

- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched())
idle();
tick_nohz_restart_sched_tick();
diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c
index 2084f81..0798928 100644
--- a/arch/sparc64/kernel/process.c
+++ b/arch/sparc64/kernel/process.c
@@ -97,7 +97,7 @@ void cpu_idle(void)
set_thread_flag(TIF_POLLING_NRFLAG);

while(1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);

while (!need_resched() && !cpu_is_offline(cpu))
sparc64_yield(cpu);
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 83603cf..a1c6d07 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -243,7 +243,7 @@ void default_idle(void)
if (need_resched())
schedule();

- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
nsecs = disable_timer();
idle_sleep(nsecs);
tick_nohz_restart_sched_tick();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 0c3927a..53bc653 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -128,7 +128,7 @@ void cpu_idle(void)

/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched()) {

check_pgt_cache();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a8e5362..9a10c18 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -120,7 +120,7 @@ void cpu_idle(void)
current_thread_info()->status |= TS_POLLING;
/* endless idle loop with no priority at all */
while (1) {
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(1);
while (!need_resched()) {

rmb();
diff --git a/include/linux/tick.h b/include/linux/tick.h
index a881c65..d3c0269 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -49,6 +49,7 @@ struct tick_sched {
unsigned long check_clocks;
enum tick_nohz_mode nohz_mode;
ktime_t idle_tick;
+ int inidle;
int tick_stopped;
unsigned long idle_jiffies;
unsigned long idle_calls;
@@ -105,14 +106,14 @@ static inline int tick_check_oneshot_change(int allow_nohz) { return 0; }
#endif /* !CONFIG_GENERIC_CLOCKEVENTS */

# ifdef CONFIG_NO_HZ
-extern void tick_nohz_stop_sched_tick(void);
+extern void tick_nohz_stop_sched_tick(int inidle);
extern void tick_nohz_restart_sched_tick(void);
extern void tick_nohz_update_jiffies(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern void tick_nohz_stop_idle(int cpu);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
# else
-static inline void tick_nohz_stop_sched_tick(void) { }
+static inline void tick_nohz_stop_sched_tick(int inidle) { }
static inline void tick_nohz_restart_sched_tick(void) { }
static inline void tick_nohz_update_jiffies(void) { }
static inline ktime_t tick_nohz_get_sleep_length(void)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 81e2fe0..f6b03d5 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -286,7 +286,7 @@ void irq_exit(void)
#ifdef CONFIG_NO_HZ
/* Make sure that timer wheel updates are propagated */
if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
- tick_nohz_stop_sched_tick();
+ tick_nohz_stop_sched_tick(0);
rcu_irq_exit();
#endif
preempt_enable_no_resched();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index beef7cc..a5c26d2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -195,7 +195,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
* Called either from the idle loop or from irq_exit() when an idle period was
* just interrupted by an interrupt which did not cause a reschedule.
*/
-void tick_nohz_stop_sched_tick(void)
+void tick_nohz_stop_sched_tick(int inidle)
{
unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
struct tick_sched *ts;
@@ -224,6 +224,11 @@ void tick_nohz_stop_sched_tick(void)
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;

+ if (!inidle && !ts->inidle)
+ goto end;
+
+ ts->inidle = 1;
+
if (need_resched())
goto end;

@@ -373,11 +378,14 @@ void tick_nohz_restart_sched_tick(void)
local_irq_disable();
tick_nohz_stop_idle(cpu);

- if (!ts->tick_stopped) {
+ if (!ts->inidle || !ts->tick_stopped) {
+ ts->inidle = 0;
local_irq_enable();
return;
}

+ ts->inidle = 0;
+
rcu_exit_nohz();

/* Update jiffies first */

2008-07-18 16:30:58

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Fri, Jul 18, 2008 at 05:27:28PM +0200, Thomas Gleixner wrote:
> On Fri, 18 Jul 2008, eric miao wrote:
> > On Fri, Jul 18, 2008 at 9:52 PM, Thomas Gleixner <[email protected]> wrote:
> > >> Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)
> > >
> > > Yes. I did not understand the issue when Jack pointed it out to me,
> > > but with Erics explanation it's really clear. Thanks for tracking that
> > > down.
> >
> > Actually, Jack did most of the analysis and came up with this quick
> > fix.
> >
> > >
> > >> It looks a bit ugly to me in the middle of schedule() - is there no wait
> > >> to solve this within kernel/time/*.c ?
> > >
> > > Hmm, yes. I think the proper fix is to enable the tick stop mechanism
> > > in the idle loop and disable it before we go to schedule. That takes
> > > an additional parameter to tick_nohz_stop_sched_tick(), but we then
> > > gain a very clear section where the nohz mimic can be active.
> > >
> > > I'll whip up a patch.
> >
> > Sounds great, thanks.
>
> Hey, thanks for tracking that down. I was banging my head against the
> wall when I understood the problem.
>
> I tried to pinpoint the occasional softlockup bug reports, but I
> probably stared too long into that code so I just saw what I expected
> to see.
>
> Can you give the patch below a try please ?

If this patch works, could you also add the s390 hunk before submitting?
Thanks! ;)

2008-07-18 22:28:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2008-07-18 at 12:54 +0200, Ingo Molnar wrote:
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -4446,7 +4446,8 @@ need_resched_nonpreemptible:
> > > rq->nr_switches++;
> > > rq->curr = next;
> > > ++*switch_count;
> > > -
> > > + if (rq->curr != rq->idle)
> > > + tick_nohz_restart_sched_tick();
> > > context_switch(rq, prev, next); /* unlocks the rq */
> >
> > hm, one problem i can see is lock dependencies. This code is executed
> > with the rq lock while tick_nohz_restart_sched_tick() takes hr locks =>
> > not good. So i havent applied this just yet - this needs to be solved
> > differently.
>
> Actually, that should work these days...
>
> Also, I assume Eric actually tested this with lockdep enabled (right,
> Eric?) and that'll shout - or rather, lockup hard in this case - if
> you got it wrong.

nope:

[ 0.188011] =================================
[ 0.188011] [ INFO: inconsistent lock state ]
[ 0.188011] 2.6.26-tip-03835-g9d964b9-dirty #20198
[ 0.188011] ---------------------------------
[ 0.188011] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
[ 0.188011] swapper/0 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 0.188011] (&rq->rq_lock_key){+...}, at: [<ffffffff816a4b35>] schedule+0x191/0x900
[ 0.188011] {in-hardirq-W} state was registered at:
[ 0.188011] [<ffffffffffffffff>] 0xffffffffffffffff

Ingo

2008-07-19 07:32:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Fri, 18 Jul 2008, Heiko Carstens wrote:
> On Fri, Jul 18, 2008 at 05:27:28PM +0200, Thomas Gleixner wrote:
> > Can you give the patch below a try please ?
>
> If this patch works, could you also add the s390 hunk before submitting?

Sure. It should have been there already :(

Thanks,
tglx

2008-07-21 07:34:44

by Jack Ren

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

Hi Thomas,
I applied your patch, and rerun the test on my environment, the
issue went away. I also analyzed your patch and did not find any other
race condition so far.

cheers,
Jack Ren

On Fri, Jul 18, 2008 at 11:27 PM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 18 Jul 2008, eric miao wrote:
>> On Fri, Jul 18, 2008 at 9:52 PM, Thomas Gleixner <[email protected]> wrote:
>> >> Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)
>> >
>> > Yes. I did not understand the issue when Jack pointed it out to me,
>> > but with Erics explanation it's really clear. Thanks for tracking that
>> > down.
>>
>> Actually, Jack did most of the analysis and came up with this quick
>> fix.
>>
>> >
>> >> It looks a bit ugly to me in the middle of schedule() - is there no wait
>> >> to solve this within kernel/time/*.c ?
>> >
>> > Hmm, yes. I think the proper fix is to enable the tick stop mechanism
>> > in the idle loop and disable it before we go to schedule. That takes
>> > an additional parameter to tick_nohz_stop_sched_tick(), but we then
>> > gain a very clear section where the nohz mimic can be active.
>> >
>> > I'll whip up a patch.
>>
>> Sounds great, thanks.
>
> Hey, thanks for tracking that down. I was banging my head against the
> wall when I understood the problem.
>
> I tried to pinpoint the occasional softlockup bug reports, but I
> probably stared too long into that code so I just saw what I expected
> to see.
>
> Can you give the patch below a try please ?
>
> Thanks,
>
> tglx
>
> ------------------->
> Subject: nohz: prevent tick stop outside of the idle loop
> From: Thomas Gleixner <[email protected]>
>
> Jack Ran and Eric Miao tracked down the following long standing
> problem in the NOHZ code:
>
> scheduler switch to idle task
> enable interrupts
>
> Window starts here
>
> ----> interrupt happens (does not set NEED_RESCHED)
> irq_exit() stops the tick
>
> ----> interrupt happens (does set NEED_RESCHED)
>
> return from schedule()
>
> cpu_idle(): preempt_disable();
>
> Window ends here
>
> The interrupts can happen at any point inside the race window. The
> first interrupt stops the tick, the second one causes the scheduler to
> rerun and switch away from idle again and we end up with the tick
> disabled.
>
> The fact that it needs two interrupts where the first one does not set
> NEED_RESCHED and the second one does made the bug obscure and extremly
> hard to reproduce and analyse. Kudos to Jack and Eric.
>
> Solution: Limit the NOHZ functionality to the idle loop to make sure
> that we can not run into such a situation ever again.
>
> cpu_idle()
> {
> preempt_disable();
>
> while(1) {
> tick_nohz_stop_sched_tick(1); <- tell NOHZ code that we
> are in the idle loop
>
> while (!need_resched())
> halt();
>
> tick_nohz_restart_sched_tick(); <- disables NOHZ mode
> preempt_enable_no_resched();
> schedule();
> preempt_disable();
> }
> }
>
> In hindsight we should have done this forever, but ...
>
> /me grabs a large brown paperbag.
>
> Debugged-by: Jack Ren <[email protected]>,
> Debugged-by: eric miao <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
>
> ---
> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index 199b368..89bfded 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -162,7 +162,7 @@ void cpu_idle(void)
> if (!idle)
> idle = default_idle;
> leds_event(led_idle_start);
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched())
> idle();
> leds_event(led_idle_end);
> diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
> index 6cf9df1..ff820a9 100644
> --- a/arch/avr32/kernel/process.c
> +++ b/arch/avr32/kernel/process.c
> @@ -31,7 +31,7 @@ void cpu_idle(void)
> {
> /* endless idle loop with no priority at all */
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched())
> cpu_idle_sleep();
> tick_nohz_restart_sched_tick();
> diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
> index 53c2cd2..77800dd 100644
> --- a/arch/blackfin/kernel/process.c
> +++ b/arch/blackfin/kernel/process.c
> @@ -105,7 +105,7 @@ void cpu_idle(void)
> #endif
> if (!idle)
> idle = default_idle;
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched())
> idle();
> tick_nohz_restart_sched_tick();
> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> index c06f5b5..b16facd 100644
> --- a/arch/mips/kernel/process.c
> +++ b/arch/mips/kernel/process.c
> @@ -53,7 +53,7 @@ void __noreturn cpu_idle(void)
> {
> /* endless idle loop with no priority at all */
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched()) {
> #ifdef CONFIG_SMTC_IDLE_HOOK_DEBUG
> extern void smtc_idle_loop_hook(void);
> diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
> index c3cf0e8..d308a9f 100644
> --- a/arch/powerpc/kernel/idle.c
> +++ b/arch/powerpc/kernel/idle.c
> @@ -60,7 +60,7 @@ void cpu_idle(void)
>
> set_thread_flag(TIF_POLLING_NRFLAG);
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched() && !cpu_should_die()) {
> ppc64_runlatch_off();
>
> diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
> index b721207..70b688c 100644
> --- a/arch/powerpc/platforms/iseries/setup.c
> +++ b/arch/powerpc/platforms/iseries/setup.c
> @@ -561,7 +561,7 @@ static void yield_shared_processor(void)
> static void iseries_shared_idle(void)
> {
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched() && !hvlpevent_is_pending()) {
> local_irq_disable();
> ppc64_runlatch_off();
> @@ -591,7 +591,7 @@ static void iseries_dedicated_idle(void)
> set_thread_flag(TIF_POLLING_NRFLAG);
>
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> if (!need_resched()) {
> while (!need_resched()) {
> ppc64_runlatch_off();
> diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
> index b98e37a..921892c 100644
> --- a/arch/sh/kernel/process_32.c
> +++ b/arch/sh/kernel/process_32.c
> @@ -86,7 +86,7 @@ void cpu_idle(void)
> if (!idle)
> idle = default_idle;
>
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched())
> idle();
> tick_nohz_restart_sched_tick();
> diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c
> index 2084f81..0798928 100644
> --- a/arch/sparc64/kernel/process.c
> +++ b/arch/sparc64/kernel/process.c
> @@ -97,7 +97,7 @@ void cpu_idle(void)
> set_thread_flag(TIF_POLLING_NRFLAG);
>
> while(1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
>
> while (!need_resched() && !cpu_is_offline(cpu))
> sparc64_yield(cpu);
> diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
> index 83603cf..a1c6d07 100644
> --- a/arch/um/kernel/process.c
> +++ b/arch/um/kernel/process.c
> @@ -243,7 +243,7 @@ void default_idle(void)
> if (need_resched())
> schedule();
>
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> nsecs = disable_timer();
> idle_sleep(nsecs);
> tick_nohz_restart_sched_tick();
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 0c3927a..53bc653 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -128,7 +128,7 @@ void cpu_idle(void)
>
> /* endless idle loop with no priority at all */
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched()) {
>
> check_pgt_cache();
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index a8e5362..9a10c18 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -120,7 +120,7 @@ void cpu_idle(void)
> current_thread_info()->status |= TS_POLLING;
> /* endless idle loop with no priority at all */
> while (1) {
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(1);
> while (!need_resched()) {
>
> rmb();
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index a881c65..d3c0269 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -49,6 +49,7 @@ struct tick_sched {
> unsigned long check_clocks;
> enum tick_nohz_mode nohz_mode;
> ktime_t idle_tick;
> + int inidle;
> int tick_stopped;
> unsigned long idle_jiffies;
> unsigned long idle_calls;
> @@ -105,14 +106,14 @@ static inline int tick_check_oneshot_change(int allow_nohz) { return 0; }
> #endif /* !CONFIG_GENERIC_CLOCKEVENTS */
>
> # ifdef CONFIG_NO_HZ
> -extern void tick_nohz_stop_sched_tick(void);
> +extern void tick_nohz_stop_sched_tick(int inidle);
> extern void tick_nohz_restart_sched_tick(void);
> extern void tick_nohz_update_jiffies(void);
> extern ktime_t tick_nohz_get_sleep_length(void);
> extern void tick_nohz_stop_idle(int cpu);
> extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> # else
> -static inline void tick_nohz_stop_sched_tick(void) { }
> +static inline void tick_nohz_stop_sched_tick(int inidle) { }
> static inline void tick_nohz_restart_sched_tick(void) { }
> static inline void tick_nohz_update_jiffies(void) { }
> static inline ktime_t tick_nohz_get_sleep_length(void)
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 81e2fe0..f6b03d5 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -286,7 +286,7 @@ void irq_exit(void)
> #ifdef CONFIG_NO_HZ
> /* Make sure that timer wheel updates are propagated */
> if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
> - tick_nohz_stop_sched_tick();
> + tick_nohz_stop_sched_tick(0);
> rcu_irq_exit();
> #endif
> preempt_enable_no_resched();
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index beef7cc..a5c26d2 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -195,7 +195,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
> * Called either from the idle loop or from irq_exit() when an idle period was
> * just interrupted by an interrupt which did not cause a reschedule.
> */
> -void tick_nohz_stop_sched_tick(void)
> +void tick_nohz_stop_sched_tick(int inidle)
> {
> unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
> struct tick_sched *ts;
> @@ -224,6 +224,11 @@ void tick_nohz_stop_sched_tick(void)
> if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> goto end;
>
> + if (!inidle && !ts->inidle)
> + goto end;
> +
> + ts->inidle = 1;
> +
> if (need_resched())
> goto end;
>
> @@ -373,11 +378,14 @@ void tick_nohz_restart_sched_tick(void)
> local_irq_disable();
> tick_nohz_stop_idle(cpu);
>
> - if (!ts->tick_stopped) {
> + if (!ts->inidle || !ts->tick_stopped) {
> + ts->inidle = 0;
> local_irq_enable();
> return;
> }
>
> + ts->inidle = 0;
> +
> rcu_exit_nohz();
>
> /* Update jiffies first */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-07-21 19:13:31

by Philippe Troin

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

Thomas Gleixner <[email protected]> writes:

> On Fri, 18 Jul 2008, eric miao wrote:
> > On Fri, Jul 18, 2008 at 9:52 PM, Thomas Gleixner <[email protected]> wrote:
> > >> Thomas, Peter, Dmitry, do you concur with the analysis? (commit below)
> > >
> > > Yes. I did not understand the issue when Jack pointed it out to me,
> > > but with Erics explanation it's really clear. Thanks for tracking that
> > > down.
> >
> > Actually, Jack did most of the analysis and came up with this quick
> > fix.
> >
> > >
> > >> It looks a bit ugly to me in the middle of schedule() - is there no wait
> > >> to solve this within kernel/time/*.c ?
> > >
> > > Hmm, yes. I think the proper fix is to enable the tick stop mechanism
> > > in the idle loop and disable it before we go to schedule. That takes
> > > an additional parameter to tick_nohz_stop_sched_tick(), but we then
> > > gain a very clear section where the nohz mimic can be active.
> > >
> > > I'll whip up a patch.
> >
> > Sounds great, thanks.
>
> Hey, thanks for tracking that down. I was banging my head against the
> wall when I understood the problem.
>
> I tried to pinpoint the occasional softlockup bug reports, but I
> probably stared too long into that code so I just saw what I expected
> to see.
>
> Can you give the patch below a try please ?

Hi Thomas,

I've seen weird timer behavior on both i386 and x86_64 on SMP
machines. By weird I mean:

- time stops for a few hours, then resumes as if nothing happened;

- time flows too fast or slow (4x faster to 2x slower depending on
phase of the moon);

- the last one I've seen (yesterday), was:
sleep(1) sleeps for 1 second, but
select(0, NULL, NULL, NULL, 0.5) sleeps for nine seconds.

I have been trying to track this problem for a few weeks now, without
success. Booting a CONFIG_NO_HZ-enabled kernel with "highres=off
nohz=off" does not make a difference. However booting a kernel with
CONFIG_NO_HZ and CONFIG_HIGH_RES_TIMERS disabled seems to be working
(I cannot garantee that since I've been using that for 48h so far, but
sometimes the problem takes a few days to manifest itself).

After a cursory reading of your patch, it looks to me that the race
could happen on a kernel compiled with CONFIG_NO_HZ and
CONFIG_HIGH_RES_TIMERS and booted with "nohz=off highres=off". Can
you confirm that?

If you need more details (dmesg, lspci, etc), I have posted some
details on LKML ( http://lkml.org/lkml/2008/7/9/330 ) and I have a bug
posted on the Fedora/RH bugzilla (
https://bugzilla.redhat.com/show_bug.cgi?id=451824 ).

Phil.

2008-07-21 20:25:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

On Mon, 21 Jul 2008, Philippe Troin wrote:
> Thomas Gleixner <[email protected]> writes:
> I've seen weird timer behavior on both i386 and x86_64 on SMP
> machines. By weird I mean:
>
> - time stops for a few hours, then resumes as if nothing happened;
>
> - time flows too fast or slow (4x faster to 2x slower depending on
> phase of the moon);
>
> - the last one I've seen (yesterday), was:
> sleep(1) sleeps for 1 second, but
> select(0, NULL, NULL, NULL, 0.5) sleeps for nine seconds.
>
> I have been trying to track this problem for a few weeks now, without
> success. Booting a CONFIG_NO_HZ-enabled kernel with "highres=off
> nohz=off" does not make a difference. However booting a kernel with
> CONFIG_NO_HZ and CONFIG_HIGH_RES_TIMERS disabled seems to be working
> (I cannot garantee that since I've been using that for 48h so far, but
> sometimes the problem takes a few days to manifest itself).
>
> After a cursory reading of your patch, it looks to me that the race
> could happen on a kernel compiled with CONFIG_NO_HZ and
> CONFIG_HIGH_RES_TIMERS and booted with "nohz=off highres=off". Can
> you confirm that?

No, I can not confirm that. With nohz=off / highres=off that code path
is not invoked.

> If you need more details (dmesg, lspci, etc), I have posted some
> details on LKML ( http://lkml.org/lkml/2008/7/9/330 ) and I have a bug
> posted on the Fedora/RH bugzilla (
> https://bugzilla.redhat.com/show_bug.cgi?id=451824 ).

Will have a look.

Question: which clocksource is active ?

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Thanks,

tglx

2008-07-21 20:53:47

by Philippe Troin

[permalink] [raw]
Subject: Re: [PATCH] sched: do not stop ticks when cpu is not idle

Thomas Gleixner <[email protected]> writes:

> On Mon, 21 Jul 2008, Philippe Troin wrote:
> > Thomas Gleixner <[email protected]> writes:
> > I've seen weird timer behavior on both i386 and x86_64 on SMP
> > machines. By weird I mean:
> >
> > - time stops for a few hours, then resumes as if nothing happened;
> >
> > - time flows too fast or slow (4x faster to 2x slower depending on
> > phase of the moon);
> >
> > - the last one I've seen (yesterday), was:
> > sleep(1) sleeps for 1 second, but
> > select(0, NULL, NULL, NULL, 0.5) sleeps for nine seconds.
> >
> > I have been trying to track this problem for a few weeks now, without
> > success. Booting a CONFIG_NO_HZ-enabled kernel with "highres=off
> > nohz=off" does not make a difference. However booting a kernel with
> > CONFIG_NO_HZ and CONFIG_HIGH_RES_TIMERS disabled seems to be working
> > (I cannot garantee that since I've been using that for 48h so far, but
> > sometimes the problem takes a few days to manifest itself).
> >
> > After a cursory reading of your patch, it looks to me that the race
> > could happen on a kernel compiled with CONFIG_NO_HZ and
> > CONFIG_HIGH_RES_TIMERS and booted with "nohz=off highres=off". Can
> > you confirm that?
>
> No, I can not confirm that. With nohz=off / highres=off that code path
> is not invoked.

Darn. You're right, on a more detailed reading:

With CONFIG_NO_HZ set, the tick_nohz_stop_sched_tick() function is
defined (declared in tick.h and defined in tick-sched.c).

There's nothing to prevent tick_nohz_stop_sched_tick() to be called
from cpu_idle().

However in tick_nohz_stop_sched_tick(), ts->nohz_mode ==
NOHZ_MODE_INACTIVE is true and the function bails out early. And
just before the section which was patched.

> > If you need more details (dmesg, lspci, etc), I have posted some
> > details on LKML ( http://lkml.org/lkml/2008/7/9/330 ) and I have a bug
> > posted on the Fedora/RH bugzilla (
> > https://bugzilla.redhat.com/show_bug.cgi?id=451824 ).
>
> Will have a look.
>
> Question: which clocksource is active ?
>
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource

As mentionned earlier I found two systems showing up the problem, a
dual Pentium III system (i386) and a dual Opteron system running in
64-bit (x86_64).

On the i386:

current_clocksource is jiffies

On this one, the symptoms tend to be that the clock goes too fast or
too slow, always by an integer multiple (seen 2x slower and 4x
faster so far).

Once on this system, while the clock was running 4x faster, changing
current_clocksource to tsc (the only other available choice)
reestablished the "normal flow of time" :) Back to jiffies, and the
clock went back to 4x faster. I could switch back and forth.

On the x86_64:

current_clocksource is hpet

On the dual Opteron system, the symptoms I've seen are that the
system becomes unresponsive, with some "stuck" processes, and the
time not changing for long periods of time (like a few hours).

It's also on this sytem that I saw yesterday:

sleep(1) takes 1 seconds.
select(0, NULL, NULL, NULL, .5) takes 9 seconds.
date was reporting a wall time flowing normally.

A question I had was: when the system(s) gets wedged, what kind of
debugging information could I gather on the live system before I
reboot?

Phil.