LinuxLists.cc - [nohz] 2a16fc93d2c: kernel lockup on idle injection

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

Cc'ing Thomas as well..

On 12 December 2014 at 01:12, Fengguang Wu <[email protected]> wrote:
> Hi Viresh,
>
> We noticed the below lockup regression on commit 2a16fc93d2c ("nohz:
> Avoid tick's double reprogramming in highres mode").
>
> testbox/testcase/testparams: ivb42/idle-inject/60s-200%-10cp
>
> b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> :5 100% 1:1 last_state.is_incomplete_run
> :5 100% 1:1 last_state.running
>
> testbox/testcase/testparams: lkp-sb03/idle-inject/60s-200%-10cp
>
> b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c
> ---------------- --------------------------
> :7 100% 1:1 last_state.is_incomplete_run
> :7 100% 1:1 last_state.running
>
> Where test box ivb42 is Ivy Bridge-EP and lkp-sb03 is Sandy Bridge-EP.
>
> To reproduce:
>
> apt-get install ruby ruby-oj
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/setup-local job.yaml # the job file attached in this email
> bin/run-local job.yaml
>
> Basically what the test case does is to
>
> - find a Sandy Bridge or newer machine
> - look for a cooling device with type “intel_powerclamp”
> - set cur_state to 10
> - run any CPU extensive workload
>
> Then expect soft lockup. It's very reproducible.

Thanks Fengguang. Yes I am able to reproduce it, but don't know yet what
went wrong..

--
viresh

2014-12-15 07:26:06

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

Hi Viresh,

Let me explain why I think this is happening.

1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
and receives an interrupt.

2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
in its handler, assuming that tick_nohz_irq_exit() will take care of
programming the clock event device appropriately, and hence it would
requeue or cancel the tick_sched timer.

3. But the intel_powerclamp driver injects an idle period only.
*The CPU however is not idle*. It has work on its runqueue and the
rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
get called on any interrupt*.

4. As a consequence, when we get a hrtimer interrupt during the period
that the powerclamp driver is mimicking idle, the exit path of the
interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
that would have got removed due to the above commit will not get
enqueued back on for any pending timers that there might be. Besides
this, *jiffies never gets updated*.

5. If you look at the code of the powerclamp driver, clamp_thread()
loops on jiffies getting updated. It continues to do so with preemption
disabled and no tick_sched timer to force a scheduler tick to update the
jiffies. Since this happens on cpus in a package, all of them get soft
lockedup.

Hope the above explanation makes sense.

Regards
Preeti U Murthy

On 12/12/2014 05:27 PM, Viresh Kumar wrote:
> Cc'ing Thomas as well..
>
> On 12 December 2014 at 01:12, Fengguang Wu <[email protected]> wrote:
>> Hi Viresh,
>>
>> We noticed the below lockup regression on commit 2a16fc93d2c ("nohz:
>> Avoid tick's double reprogramming in highres mode").
>>
>> testbox/testcase/testparams: ivb42/idle-inject/60s-200%-10cp
>>
>> b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c
>> ---------------- --------------------------
>> fail:runs %reproduction fail:runs
>> | | |
>> :5 100% 1:1 last_state.is_incomplete_run
>> :5 100% 1:1 last_state.running
>>
>> testbox/testcase/testparams: lkp-sb03/idle-inject/60s-200%-10cp
>>
>> b5e995e671d8e4d7 2a16fc93d2c9568e16d45db77c
>> ---------------- --------------------------
>> :7 100% 1:1 last_state.is_incomplete_run
>> :7 100% 1:1 last_state.running
>>
>> Where test box ivb42 is Ivy Bridge-EP and lkp-sb03 is Sandy Bridge-EP.
>>
>> To reproduce:
>>
>> apt-get install ruby ruby-oj
>> git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
>> cd lkp-tests
>> bin/setup-local job.yaml # the job file attached in this email
>> bin/run-local job.yaml
>>
>> Basically what the test case does is to
>>
>> - find a Sandy Bridge or newer machine
>> - look for a cooling device with type “intel_powerclamp”
>> - set cur_state to 10
>> - run any CPU extensive workload
>>
>> Then expect soft lockup. It's very reproducible.
>
> Thanks Fengguang. Yes I am able to reproduce it, but don't know yet what
> went wrong..
>
> --
> viresh
>

2014-12-15 09:32:21

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On 15 December 2014 at 12:55, Preeti U Murthy <[email protected]> wrote:
> Hi Viresh,
>
> Let me explain why I think this is happening.
>
> 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
> and receives an interrupt.

Bang on target. Yeah that's the part we missed while writing this patch :)

> 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
> in its handler, assuming that tick_nohz_irq_exit() will take care of
> programming the clock event device appropriately, and hence it would
> requeue or cancel the tick_sched timer.

Correct.

> 3. But the intel_powerclamp driver injects an idle period only.
> *The CPU however is not idle*. It has work on its runqueue and the
> rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
> get called on any interrupt*.

Still good..

> 4. As a consequence, when we get a hrtimer interrupt during the period
> that the powerclamp driver is mimicking idle, the exit path of the
> interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
> that would have got removed due to the above commit will not get
> enqueued back on for any pending timers that there might be. Besides
> this, *jiffies never gets updated*.

Jiffies can be updated by any CPU and there is something called a control
cpu with powerclamp driver. BUT we may have got interrupted before the
powerclamp timer expired and so we are stuck in the

while (time_before(jiffies, target_jiffies))

loop for ever.

> Hope the above explanation makes sense.

Mostly good. Thanks for helping out.

Now, what's the right solution going forward ?

- Revert the offending commit ..
- Or still try to avoid reprogramming if we can ..

This is what I could come up with to still avoid reprogramming of tick:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index cc0a5b6f741b..49f4278f69e2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1100,7 +1100,7 @@ static enum hrtimer_restart
tick_sched_timer(struct hrtimer *timer)
tick_sched_handle(ts, regs);

/* No need to reprogram if we are in idle or full dynticks mode */
- if (unlikely(ts->tick_stopped))
+ if (unlikely(ts->tick_stopped && (is_idle_task(current) ||
!ts->inidle)))
return HRTIMER_NORESTART;

hrtimer_forward(timer, now, tick_period);

Above change checks why we have stopped tick..
- The cpu has gone idle (really): is_idle_task(current)
- The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts->inidle

This fixed the issues with powerclamp in my case.

@Fengguang: Can you please check if this fixes it for you as well?

@Thomas: Please let me know if you want me to send this fix
or you want to revert the original commit itself.

Thanks.

--
Viresh

2014-12-15 09:43:56

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On 12/15/2014 03:02 PM, Viresh Kumar wrote:
> On 15 December 2014 at 12:55, Preeti U Murthy <[email protected]> wrote:
>> Hi Viresh,
>>
>> Let me explain why I think this is happening.
>>
>> 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
>> and receives an interrupt.
>
> Bang on target. Yeah that's the part we missed while writing this patch :)
>
>> 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
>> in its handler, assuming that tick_nohz_irq_exit() will take care of
>> programming the clock event device appropriately, and hence it would
>> requeue or cancel the tick_sched timer.
>
> Correct.
>
>> 3. But the intel_powerclamp driver injects an idle period only.
>> *The CPU however is not idle*. It has work on its runqueue and the
>> rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
>> get called on any interrupt*.
>
> Still good..
>
>> 4. As a consequence, when we get a hrtimer interrupt during the period
>> that the powerclamp driver is mimicking idle, the exit path of the
>> interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
>> that would have got removed due to the above commit will not get
>> enqueued back on for any pending timers that there might be. Besides
>> this, *jiffies never gets updated*.
>
> Jiffies can be updated by any CPU and there is something called a control
> cpu with powerclamp driver. BUT we may have got interrupted before the
> powerclamp timer expired and so we are stuck in the
>
> while (time_before(jiffies, target_jiffies))
>
> loop for ever.
>
>> Hope the above explanation makes sense.
>
> Mostly good. Thanks for helping out.
>
> Now, what's the right solution going forward ?
>
> - Revert the offending commit ..
> - Or still try to avoid reprogramming if we can ..
>
> This is what I could come up with to still avoid reprogramming of tick:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index cc0a5b6f741b..49f4278f69e2 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1100,7 +1100,7 @@ static enum hrtimer_restart
> tick_sched_timer(struct hrtimer *timer)
> tick_sched_handle(ts, regs);
>
> /* No need to reprogram if we are in idle or full dynticks mode */
> - if (unlikely(ts->tick_stopped))
> + if (unlikely(ts->tick_stopped && (is_idle_task(current) ||
> !ts->inidle)))
> return HRTIMER_NORESTART;
>
> hrtimer_forward(timer, now, tick_period);
>
>

Looks good to me. You can add my Reviewed-by to the above patch.

>
> Above change checks why we have stopped tick..
> - The cpu has gone idle (really): is_idle_task(current)
> - The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts->inidle
>
> This fixed the issues with powerclamp in my case.
>
> @Fengguang: Can you please check if this fixes it for you as well?
>
> @Thomas: Please let me know if you want me to send this fix
> or you want to revert the original commit itself.

Regards
Preeti U Murthy
>
> Thanks.
>
> --
> Viresh
>

2014-12-15 21:25:07

by Jacob Pan

[permalink] [raw]

Subject: RE: [nohz] 2a16fc93d2c: kernel lockup on idle injection

-----Original Message-----
From: Preeti U Murthy [mailto:[email protected]]
Sent: Monday, December 15, 2014 1:44 AM
To: Viresh Kumar; Thomas Gleixner; Wu, Fengguang
Cc: Frederic Weisbecker; Pan, Jacob jun; LKML; LKP
Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On 12/15/2014 03:02 PM, Viresh Kumar wrote:
> On 15 December 2014 at 12:55, Preeti U Murthy <[email protected]> wrote:
>> Hi Viresh,
>>
>> Let me explain why I think this is happening.
>>
>> 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is
>> idle* and receives an interrupt.
>
> Bang on target. Yeah that's the part we missed while writing this
> patch :)
>
>> 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
>> in its handler, assuming that tick_nohz_irq_exit() will take care of
>> programming the clock event device appropriately, and hence it would
>> requeue or cancel the tick_sched timer.
>
> Correct.
>
>> 3. But the intel_powerclamp driver injects an idle period only.
>> *The CPU however is not idle*. It has work on its runqueue and the
>> rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will
>> rq->not
>> get called on any interrupt*.
>
> Still good..
>
>> 4. As a consequence, when we get a hrtimer interrupt during the
>> period that the powerclamp driver is mimicking idle, the exit path of
>> the interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched
>> timer that would have got removed due to the above commit will not
>> get enqueued back on for any pending timers that there might be.
>> Besides this, *jiffies never gets updated*.
>
> Jiffies can be updated by any CPU and there is something called a
> control cpu with powerclamp driver. BUT we may have got interrupted
> before the powerclamp timer expired and so we are stuck in the
>
> while (time_before(jiffies, target_jiffies))
>
> loop for ever.
>
>> Hope the above explanation makes sense.
>
> Mostly good. Thanks for helping out.
>
> Now, what's the right solution going forward ?
>
> - Revert the offending commit ..
> - Or still try to avoid reprogramming if we can ..
>
> This is what I could come up with to still avoid reprogramming of tick:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index
> cc0a5b6f741b..49f4278f69e2 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1100,7 +1100,7 @@ static enum hrtimer_restart
> tick_sched_timer(struct hrtimer *timer)
> tick_sched_handle(ts, regs);
>
> /* No need to reprogram if we are in idle or full dynticks mode */
> - if (unlikely(ts->tick_stopped))
> + if (unlikely(ts->tick_stopped && (is_idle_task(current) ||
> !ts->inidle)))
> return HRTIMER_NORESTART;
>
> hrtimer_forward(timer, now, tick_period);
>
>

Looks good to me. You can add my Reviewed-by to the above patch.

>
> Above change checks why we have stopped tick..
> - The cpu has gone idle (really): is_idle_task(current)
> - The cpu isn't in idle mode, i.e. its in nohz-full mode: !ts->inidle
>
> This fixed the issues with powerclamp in my case.
>
> @Fengguang: Can you please check if this fixes it for you as well?
>
I have tested this fix and confirm powerclamp is working properly now.

However, we also have a planned patch for consolidated idle loop. With this patch it causes some erratic behavior in idle injection.
I can’t seem to synchronize/align idle time around jiffies with this patch + fix.

Any suggestions welcome.
https://lkml.org/lkml/2014/6/4/56

+peter

> @Thomas: Please let me know if you want me to send this fix or you
> want to revert the original commit itself.

Regards
Preeti U Murthy
>
> Thanks.
>
> --
> Viresh
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2014-12-15 23:44:51

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Mon, Dec 15, 2014 at 03:02:17PM +0530, Viresh Kumar wrote:
> On 15 December 2014 at 12:55, Preeti U Murthy <[email protected]> wrote:
> > Hi Viresh,
> >
> > Let me explain why I think this is happening.
> >
> > 1. tick_nohz_irq_enter/exit() both get called *only if the cpu is idle*
> > and receives an interrupt.
>
> Bang on target. Yeah that's the part we missed while writing this patch :)
>
> > 2. Commit 2a16fc93d2c9568e1, cancels programming of tick_sched timer
> > in its handler, assuming that tick_nohz_irq_exit() will take care of
> > programming the clock event device appropriately, and hence it would
> > requeue or cancel the tick_sched timer.
>
> Correct.
>
> > 3. But the intel_powerclamp driver injects an idle period only.
> > *The CPU however is not idle*. It has work on its runqueue and the
> > rq->curr != idle. This means that *tick_nohz_irq_enter()/exit() will not
> > get called on any interrupt*.
>
> Still good..
>
> > 4. As a consequence, when we get a hrtimer interrupt during the period
> > that the powerclamp driver is mimicking idle, the exit path of the
> > interrupt never calls tick_nohz_irq_exit(). Hence the tick_sched timer
> > that would have got removed due to the above commit will not get
> > enqueued back on for any pending timers that there might be. Besides
> > this, *jiffies never gets updated*.
>
> Jiffies can be updated by any CPU and there is something called a control
> cpu with powerclamp driver. BUT we may have got interrupted before the
> powerclamp timer expired and so we are stuck in the
>
> while (time_before(jiffies, target_jiffies))
>
> loop for ever.
>
> > Hope the above explanation makes sense.
>
> Mostly good. Thanks for helping out.
>
> Now, what's the right solution going forward ?
>
> - Revert the offending commit ..
> - Or still try to avoid reprogramming if we can ..
>
> This is what I could come up with to still avoid reprogramming of tick:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index cc0a5b6f741b..49f4278f69e2 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1100,7 +1100,7 @@ static enum hrtimer_restart
> tick_sched_timer(struct hrtimer *timer)
> tick_sched_handle(ts, regs);
>
> /* No need to reprogram if we are in idle or full dynticks mode */
> - if (unlikely(ts->tick_stopped))
> + if (unlikely(ts->tick_stopped && (is_idle_task(current) ||
> !ts->inidle)))
> return HRTIMER_NORESTART;
>
> hrtimer_forward(timer, now, tick_period);

So to summarize: I see it enqueues a timer then it loops on that timer expiration.
On that loop we stop the CPU and we expect the timer to fire and wake the thread up.
But if the delayed tick fires too early, before the timer actually
expires, then we go to sleep again because we haven't yet reached the delay,
but since tick_nohz_irq_exit() is only called on idle tasks and not for threads
like powerclamp, the tick isn't rescheduled to handle the remaining timer slice
so we sleep forever.

Hence if we really want to stop the tick when we mimic idle from powerclamp driver,
we must call tick_nohz_irq_exit() on irq exit to do it correctly.

It happened to work by accident before the commit because we were rescheduling the
tick from itself without tick_nohz_irq_exit() to cancel anything. And that restored
the periodic behaviour necessary to complete the delay.

So the above change is rather a hack than a solution.

We have several choices:

1) Revert the commit. But this has to be a temporary solution really. Powerclamp has
to be fixed and handle tick_nohz_irq_exit().

2) Remove powerclamp tick stop until somebody fixes it to handle nohz properly.

2) Fix it directly. But I believe there is a release that is going to miss the fix
and suffer the regression. Does the regression matter for anybody? Is powerclamp
meant for anything else than testing (I have no idea what it's used for)?

So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be called
for both idle tasks and powerclamp kthreads. Checking ts->inidle can be a good way to match
both. That means we might need to use a reduced part of idle_cpu() to avoid redundant checks.
tick_irq_enter() must be called as well for powerclamp, in case it's the only CPU running, it
has to fixup the timekeeping alone.

2014-12-16 04:18:44

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On 16 December 2014 at 02:54, Pan, Jacob jun <[email protected]> wrote:

> Looks good to me. You can add my Reviewed-by to the above patch.

Thanks.

> I have tested this fix and confirm powerclamp is working properly now.

Oh, nice.

> However, we also have a planned patch for consolidated idle loop. With this patch it causes some erratic behavior in idle injection.
> I can’t seem to synchronize/align idle time around jiffies with this patch + fix.
>
> Any suggestions welcome.
> https://lkml.org/lkml/2014/6/4/56

And all works fine without this patch ?
2a16fc93d2c9 ("nohz: Avoid tick's double reprogramming in highres mode")

I really don't know what stuff out of the two patches I posted (The above one
and the fix I posted yesterday), will possible make the synchronization bad ..

--
viresh

2014-12-16 04:53:16

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

+ Peter from Jacob's mail ..

On 16 December 2014 at 05:14, Frederic Weisbecker <[email protected]> wrote:
> So to summarize: I see it enqueues a timer then it loops on that timer expiration.
> On that loop we stop the CPU and we expect the timer to fire and wake the thread up.
> But if the delayed tick fires too early, before the timer actually
> expires, then we go to sleep again because we haven't yet reached the delay,
> but since tick_nohz_irq_exit() is only called on idle tasks and not for threads
> like powerclamp, the tick isn't rescheduled to handle the remaining timer slice
> so we sleep forever.

Perfect !!

> Hence if we really want to stop the tick when we mimic idle from powerclamp driver,
> we must call tick_nohz_irq_exit() on irq exit to do it correctly.
>
> It happened to work by accident before the commit because we were rescheduling the
> tick from itself without tick_nohz_irq_exit() to cancel anything. And that restored
> the periodic behaviour necessary to complete the delay.
>
> So the above change is rather a hack than a solution.
>
> We have several choices:
>
> 1) Revert the commit. But this has to be a temporary solution really. Powerclamp has
> to be fixed and handle tick_nohz_irq_exit().
>
> 2) Remove powerclamp tick stop until somebody fixes it to handle nohz properly.
>
> 2) Fix it directly. But I believe there is a release that is going to miss the fix
> and suffer the regression. Does the regression matter for anybody? Is powerclamp
> meant for anything else than testing (I have no idea what it's used for)?
>
> So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be called
> for both idle tasks and powerclamp kthreads. Checking ts->inidle can be a good way to match
> both. That means we might need to use a reduced part of idle_cpu() to avoid redundant checks.
> tick_irq_enter() must be called as well for powerclamp, in case it's the only CPU running, it
> has to fixup the timekeeping alone.

Yeah, you can call my fix a Hacky one. I agree..

But I don't know if calling tick_nohz_irq_exit() from these threads wouldn't
be hacky as well. And ofcourse my knowledge wouldn't be adequate here to
judge that :)

It looked a bit odd to me. Why should we call irq-exit from the threads working
with idle? And that's not what we do even for the real-idle loop as well ..

Also from Jacob's referenced patch, we would be moving to a consolidated
idle-loop for both real idle and threads like powerclamp, and this part may be
tricky then..

Untested, but what about something like this?

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 5918d227730f..5e4bfc367735 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -321,7 +321,7 @@ asmlinkage __visible void do_softirq(void)
void irq_enter(void)
{
rcu_irq_enter();
- if (is_idle_task(current) && !in_interrupt()) {
+ if (tick_idle_active() && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
* here, as softirq will be serviced on return from interrupt.
@@ -363,7 +363,7 @@ static inline void tick_irq_exit(void)
int cpu = smp_processor_id();

/* Make sure that timer wheel updates are propagated */
- if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
+ if ((tick_idle_active() && !need_resched()) ||
tick_nohz_full_cpu(cpu)) {
if (!in_interrupt())
tick_nohz_irq_exit();
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 07c2bad0afce..e52b76037c0a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -46,6 +46,13 @@ struct tick_sched *tick_get_tick_sched(int cpu)
return &per_cpu(tick_cpu_sched, cpu);
}

+bool tick_idle_active(void)
+{
+ struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
+
+ return ts->idle_active;
+}
+
/*
* Must be called with interrupts disabled !
*/

I am not sure of the purpose of the idle-checks in the irq-paths.
All I understood from git logs is, these are to check if we
came out of dynticks mode and need to start/stop ticks again..

Then why not handle these idle mimicking threads as well?
We just need to check ts->idle_active which will take care of both
the cases, real idle and these threads..

Sorry, if this will break things further :(

--
viresh

2014-12-16 09:36:45

by Preeti U Murthy

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On 12/16/2014 10:23 AM, Viresh Kumar wrote:
> + Peter from Jacob's mail ..
>
> On 16 December 2014 at 05:14, Frederic Weisbecker <[email protected]> wrote:
>> So to summarize: I see it enqueues a timer then it loops on that timer expiration.
>> On that loop we stop the CPU and we expect the timer to fire and wake the thread up.
>> But if the delayed tick fires too early, before the timer actually
>> expires, then we go to sleep again because we haven't yet reached the delay,
>> but since tick_nohz_irq_exit() is only called on idle tasks and not for threads
>> like powerclamp, the tick isn't rescheduled to handle the remaining timer slice
>> so we sleep forever.
>
> Perfect !!
>
>> Hence if we really want to stop the tick when we mimic idle from powerclamp driver,
>> we must call tick_nohz_irq_exit() on irq exit to do it correctly.
>>
>> It happened to work by accident before the commit because we were rescheduling the
>> tick from itself without tick_nohz_irq_exit() to cancel anything. And that restored
>> the periodic behaviour necessary to complete the delay.
>>
>> So the above change is rather a hack than a solution.
>>
>> We have several choices:
>>
>> 1) Revert the commit. But this has to be a temporary solution really. Powerclamp has
>> to be fixed and handle tick_nohz_irq_exit().
>>
>> 2) Remove powerclamp tick stop until somebody fixes it to handle nohz properly.
>>
>> 2) Fix it directly. But I believe there is a release that is going to miss the fix
>> and suffer the regression. Does the regression matter for anybody? Is powerclamp
>> meant for anything else than testing (I have no idea what it's used for)?
>>
>> So to fix powerclamp to handle nohz correctly, tick_nohz_irq_exit() must be called
>> for both idle tasks and powerclamp kthreads. Checking ts->inidle can be a good way to match

As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit()
paths was to take care of *tick stopped* cases.

Before handling interrupts we would want jiffies to be updated, which is
done in tick_nohz_irq_enter(). And after handling interrupts we need to
verify if any timers/RCU callbacks were queued during an interrupt.
Both of these will not be handled properly
*only when the tick is stopped* right?

If so, the checks which permit entry into these functions should at a
minimum include a check on ts->tick_stopped(). The rest of the checks
around if the CPU is idle/need_resched() may be needed in conjunction,
but will not be complete without checking if ticks are stopped. I see
that tick_nohz_irq_enter() has a check on ts->tick_stopped, which is
correct, but tick_noz_irq_exit() does not.

Adding this check to tick_nohz_irq_exit() will not only solve this
regression but is probably a fix to a long standing bug in the
conditional check in tick_nohz_irq_exit().

>> both. That means we might need to use a reduced part of idle_cpu() to avoid redundant checks.
>> tick_irq_enter() must be called as well for powerclamp, in case it's the only CPU running, it
>> has to fixup the timekeeping alone.
>
> Yeah, you can call my fix a Hacky one. I agree..
>
> But I don't know if calling tick_nohz_irq_exit() from these threads wouldn't
> be hacky as well. And ofcourse my knowledge wouldn't be adequate here to
> judge that :)
>
> It looked a bit odd to me. Why should we call irq-exit from the threads working
> with idle? And that's not what we do even for the real-idle loop as well ..
>
> Also from Jacob's referenced patch, we would be moving to a consolidated
> idle-loop for both real idle and threads like powerclamp, and this part may be
> tricky then..
>
> Untested, but what about something like this?
>
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 5918d227730f..5e4bfc367735 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -321,7 +321,7 @@ asmlinkage __visible void do_softirq(void)
> void irq_enter(void)
> {
> rcu_irq_enter();
> - if (is_idle_task(current) && !in_interrupt()) {
> + if (tick_idle_active() && !in_interrupt()) {
> /*
> * Prevent raise_softirq from needlessly waking up ksoftirqd
> * here, as softirq will be serviced on return from interrupt.
> @@ -363,7 +363,7 @@ static inline void tick_irq_exit(void)
> int cpu = smp_processor_id();
>
> /* Make sure that timer wheel updates are propagated */
> - if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
> + if ((tick_idle_active() && !need_resched()) ||

For reasons mentioned above, this check alone will not do. It may solve
this regression,but the check is incomplete.

IMO it should simply be :
if (tick_nohz_tick_stopped() || tick_nohz_full_cpu())

Regards
Preeti U Murthy

2014-12-16 12:49:18

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Preeti U Murthy wrote:
> As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit()
> paths was to take care of *tick stopped* cases.
>
> Before handling interrupts we would want jiffies to be updated, which is
> done in tick_nohz_irq_enter(). And after handling interrupts we need to
> verify if any timers/RCU callbacks were queued during an interrupt.
> Both of these will not be handled properly
> *only when the tick is stopped* right?
>
> If so, the checks which permit entry into these functions should at a
> minimum include a check on ts->tick_stopped(). The rest of the checks
> around if the CPU is idle/need_resched() may be needed in conjunction,
> but will not be complete without checking if ticks are stopped. I see
> that tick_nohz_irq_enter() has a check on ts->tick_stopped, which is
> correct, but tick_noz_irq_exit() does not.
>
> Adding this check to tick_nohz_irq_exit() will not only solve this
> regression but is probably a fix to a long standing bug in the
> conditional check in tick_nohz_irq_exit().

This is a complete mess caused by the full nohz hackery and you're
just making it worse. Lets ignore the powerclamp crap for now.

The initial nohz idle logic was simple:

irq_enter
if (ts->tick_stopped)
update_timekeeping()

irq_exit
if (ts->inidle)
tick_nohz_update_sched_tick();

And that was completely correct. On irq_enter the only relevant
information is whether the tick is stopped or not. On irq_exit the
tick handling is only interesting if we are inidle.

If we're not inidle then the tick is not stopped either. If we are
inidle the tick might be stopped or not. So the interrupt can have
added/removed/modified timers which have consequences on the
tick_stopped state and possibly kick the machine out of idle
completely.

If the tick was not stopped, then the removal/modification of a timer
in the interrupt can cause the tick to be stoppable at irq_exit.

So making irq_exit:

if (!ts->tick_stopped)
return;

is fundamentally wrong as you miss the oportunity to stop the tick
after the changes done by the interrupt handler.

So the possible states are:

ts->inidle ts->tick_stopped
0 0 valid
0 1 BUG
1 0 valid
1 1 valid

And we are transitioning in and out of that via:

tick_nohz_idle_enter()
ts->inidle = true;
if (stop_possible)
stop_tick(ts)

tick_nohz_idle_exit()
ts->inidle = false;
if (ts->tick_stopped)
start_tick(ts)

irq_exit needs to take care of the following situations if we are
inidle:

if (ts->tick_stopped) {
if (stop_possible) {
/* handle timer changes (earlier/later) */
update_tick(ts);
} else {
kick_out_of_idle();
}
} else {
if (stop_possible)
stop_tick(ts)
}

Now with nohzfull the state space looks like this:

ts->inidle ts->infullnohz ts->tick_stopped
0 0 0 valid
0 0 1 BUG
1 0 0 valid
1 0 1 valid

0 1 0 valid
0 1 1 valid
1 1 0 BUG
1 1 1 BUG

You might have noticed that we have neither ts->infullnohz nor
functions which transitions in and out of that state.

And that's where the whole problem starts. The nohz full stuff is
trying to evaluate everything dynamically which is just insane.

So we want to have functions which do:

tick_nohz_full_enter()
ts->infullnohz = true;
if (stop_possible)
stop_tick(ts);

tick_nohz_full_exit()
ts->infullnohz = false;
if (ts->tick_stopped)
start_tick(ts);

Plus irq_exit would become:

irq_exit
if (ts->inidle)
tick_nohz_update_sched_tick();

else if (ts->infullnohz)
tick_nohz_full_update_sched_tick();

You need to keep track of the fact that the cpu entered fullnohz and
work from there. Whether the tick is stopped or not does not matter at
all because that is a seperate decision like in the nohz idle case.

Everything else is voodoo programming.

Now the powerclamp mess is a different story.

Calling tick_nohz_idle_enter()/exit() from outside the idle task is
just broken. Period.

Trying to work around that madness in the core code is just fiddling
with the symptoms and ignoring the root cause. And the root cause is
simply that powerclamp calls tick_nohz_idle_enter()/exit().

So there are two things to solve:

1) Remove the powerclamp stupidity

2) Fix the whole nohz full mess with proper state tracking.

If you folks insist on curing the symptoms and ignoring the root
causes I'm going to mark NOHZ_FULL broken and NOHZ depend on
POWERCLAMP=n.

The commit in question, does not really cause a regression, it merily
unearths the utter broken crap which existed before in both NOHZ FULL
and powerclamp and just ever worked by chance.

Thanks,

tglx

2014-12-16 14:20:39

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, Dec 16, 2014 at 01:49:03PM +0100, Thomas Gleixner wrote:
> And that's where the whole problem starts. The nohz full stuff is
> trying to evaluate everything dynamically which is just insane.
>
> So we want to have functions which do:
>
> tick_nohz_full_enter()
> ts->infullnohz = true;
> if (stop_possible)
> stop_tick(ts);
>
> tick_nohz_full_exit()
> ts->infullnohz = false;
> if (ts->tick_stopped)
> start_tick(ts);
>
> Plus irq_exit would become:
>
> irq_exit
> if (ts->inidle)
> tick_nohz_update_sched_tick();
>
> else if (ts->infullnohz)
> tick_nohz_full_update_sched_tick();

So I can do that indeed. But then it's going to break the jump label that's
off in 99.99% of the case.

Now I can wrap that into parallel functions:

irq_exit
if (tick_nohz_idle())
tick_nohz_update_sched_tick();

else if (tick_nohz_full())
tick_nohz_full_update_sched_tick();

Just to be sure I understand you well. By ts->infullnohz, you mean the fact
that a CPU _wants_ to be in full nohz, not whether it _can_ right? Whether
the CPU wants to be in full nohz is decided in boottime with nohz_full= parameter.
Whether it can is dynamically checked on top of scheduler, perf, posix cpu timers, etc...

But I agree with you on the design. First check which nohz flavour we are
under, second call the appropriate function. That's what it actually does but
it's badly buried in functions, I'm going to fix that.

>
> You need to keep track of the fact that the cpu entered fullnohz and
> work from there. Whether the tick is stopped or not does not matter at
> all because that is a seperate decision like in the nohz idle case.
>
> Everything else is voodoo programming.
>
> Now the powerclamp mess is a different story.
>
> Calling tick_nohz_idle_enter()/exit() from outside the idle task is
> just broken. Period.
>
> Trying to work around that madness in the core code is just fiddling
> with the symptoms and ignoring the root cause. And the root cause is
> simply that powerclamp calls tick_nohz_idle_enter()/exit().
>
> So there are two things to solve:
>
> 1) Remove the powerclamp stupidity
>
> 2) Fix the whole nohz full mess with proper state tracking.
>
> If you folks insist on curing the symptoms and ignoring the root
> causes I'm going to mark NOHZ_FULL broken and NOHZ depend on
> POWERCLAMP=n.
>
> The commit in question, does not really cause a regression, it merily
> unearths the utter broken crap which existed before in both NOHZ FULL
> and powerclamp and just ever worked by chance.

Agreed.

Thanks.

2014-12-16 14:32:46

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Thomas Gleixner wrote:
> Now the powerclamp mess is a different story.
>
> Calling tick_nohz_idle_enter()/exit() from outside the idle task is
> just broken. Period.
>
> Trying to work around that madness in the core code is just fiddling
> with the symptoms and ignoring the root cause. And the root cause is
> simply that powerclamp calls tick_nohz_idle_enter()/exit().

The only sane solution for that insanity is to rip out the thread code
from powerclamp and replace it with a simple per cpu timer.

timerfn()
{
too_hot = magic();
sched_fair_this_cpu_wreckaged(too_hot);
restart_clamp_timer();
}

And have the following in sched

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df2cdf77f899..d257b62ee533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4989,6 +4989,34 @@ preempt:
set_last_buddy(se);
}

+#ifdef CONFIG_WRECKAGE_SCHED_FAIR
+static struct static_key wreckage_key = STATIC_KEY_INIT;
+static DEFINE_PER_CPU(cpu_is_wreckaged, bool);
+
+void sched_fair_this_cpu_wreckaged(bool wreckaged)
+{
+ this_cpu_write(sched_block_fair, wreckaged);
+}
+
+void sched_fair_cpu_wreckage_control(bool on)
+{
+ if (on)
+ static_key_slow_inc(&wreckage_key);
+ else
+ static_key_slow_dec(&wreckage_key);
+}
+
+static inline bool class_fair_disabled(void)
+{
+ if (static_key_false(&wreckage_key))
+ return false;
+ return this_cpu_read(sched_block_fair);
+}
+
+#else
+static inline bool class_fair_disabled(void) { return false; }
+#endif
+
static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev)
{
@@ -4997,6 +5025,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
struct task_struct *p;
int new_tasks;

+ if (class_fair_disabled())
+ goto idle;
again:
#ifdef CONFIG_FAIR_GROUP_SCHED
if (!cfs_rq->nr_running)

The static key is enabled once the powerclamp mess starts. So nobody
else than powerclamp users are affected by this and rightfully so.

Not pretty, but better than a gazillion workarounds all over the place
to make "pretending I'm idle" actually work. This is basically the
same mechanism as we have with the RT throttler, where a RT hog will
be put onto hold for some time. We just put all sched other tasks on
hold while still allowing RT tasks and everything else to work.

Thoughts?

Thanks,

tglx

2014-12-16 14:50:53

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Frederic Weisbecker wrote:
> On Tue, Dec 16, 2014 at 01:49:03PM +0100, Thomas Gleixner wrote:
> > And that's where the whole problem starts. The nohz full stuff is
> > trying to evaluate everything dynamically which is just insane.
> >
> > So we want to have functions which do:
> >
> > tick_nohz_full_enter()
> > ts->infullnohz = true;
> > if (stop_possible)
> > stop_tick(ts);
> >
> > tick_nohz_full_exit()
> > ts->infullnohz = false;
> > if (ts->tick_stopped)
> > start_tick(ts);
> >
> > Plus irq_exit would become:
> >
> > irq_exit
> > if (ts->inidle)
> > tick_nohz_update_sched_tick();
> >
> > else if (ts->infullnohz)
> > tick_nohz_full_update_sched_tick();
>
> So I can do that indeed. But then it's going to break the jump label that's
> off in 99.99% of the case.

Why so?

> Now I can wrap that into parallel functions:
>
> irq_exit
> if (tick_nohz_idle())
> tick_nohz_update_sched_tick();
>
> else if (tick_nohz_full())
> tick_nohz_full_update_sched_tick();
>
> Just to be sure I understand you well. By ts->infullnohz, you mean
> the fact that a CPU _wants_ to be in full nohz, not whether it _can_
> right? Whether the CPU wants to be in full nohz is decided in
> boottime with nohz_full= parameter.

NO. Whether I read it from ts->infullnohz or from the cpumask is the
same. What I care about is the state change.

In the NOHZ idle case we do not care whether NOHZ=y and nohz=on/off on
the kernel command line. All we care about is ts->inidle and we have
very clear points of changing that state. See below.

> Whether it can is dynamically checked on top of scheduler, perf,
> posix cpu timers, etc...

Look at it the same way as idle. We do not care whether the cpu marks
itself as idle at some random place, we care about the internal state
of NOHZ.

>From NOHZ point of view the relevant places are tick_nohz_idle_enter()
and tick_nohz_idle_exit(). We mark that with ts->inidle and we take
care of preparing on enter (evtl. stopping the tick) and cleaning up
on exit (evtl. restarting the tick). And the information that we are
inidle is important in the irq_exit() path. If the NOHZ facility is
NOT inidle then it has nothing to do. You CANNOT rely on asynchronous
state. That simply never works reliably and just leads to band aids
and random state checks all over the place.

Now in the nohz full case, we want proper functions which say:

this_cpu_enters_nohz_full_now()
and
this_cpu_exits_nohz_full_now()

So like we do in tick_nohz_idle_enter() and tick_nohz_idle_exit() we
have a clear state change in the nohz code and not something which is
randomly deduced from async state all over the place.

So that only tells the NOHZ core that the rest of the system decided
that the cpu can actually go into full nohz mode. Whether the tick
code disables the tick or not is decided by the tick code.

It's not different from idle.

Thanks,

tglx

2014-12-16 14:57:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, Dec 16, 2014 at 03:32:28PM +0100, Thomas Gleixner wrote:
> @@ -4997,6 +5025,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> struct task_struct *p;
> int new_tasks;
>
> + if (class_fair_disabled())
> + goto idle;

We don't want to do new idle balancing here I think, just return NULL.

> again:
> #ifdef CONFIG_FAIR_GROUP_SCHED
> if (!cfs_rq->nr_running)
>
> The static key is enabled once the powerclamp mess starts. So nobody
> else than powerclamp users are affected by this and rightfully so.
>
> Not pretty, but better than a gazillion workarounds all over the place
> to make "pretending I'm idle" actually work. This is basically the
> same mechanism as we have with the RT throttler, where a RT hog will
> be put onto hold for some time. We just put all sched other tasks on
> hold while still allowing RT tasks and everything else to work.
>
> Thoughts?

Other than hating it on sight right? ;-)

So let me try and understand the problem with the emulated idle thing
better (running idle from FIFO threads).

Suppose we are in nohz_full:

ts->inidle ts->infullnohz ts->tick_stopped

0 1 1 valid

Then the powerclamp fake idle thread comes in, this increase nr_running
and will result in leaving infullnohz and will re-start the
tick_stopped.

0 0 0 valid

Then we 'start' the idle loop, and end up in:

1 0 1 valid

No problem there, right? And it looks to be the same in reverse.

I suppose the tricky bit is what happens when the cpu was idle; in that
case we'll end up with 1 running thread in state:

1 0 1 valid

But need to avoid ending up in:

1 1 1 BUG

Which should be relatively simple by never entering nohzfull when 'idle'.

However with your proposed thingy, I think we'll end up in:

1 1 1 BUG

Because we don't start another thread, so infullnohz will stay valid,
however we'll also be 'forced' into idle (with nr_running > 0) and stop
the tick.

A remote wakeup might result in nr_running going from 1->2 and seeing
infullnohz == 1, try and restart the tick, while we're idle!

Of course, we can fix that too, by clearing nohzfull when going 'idle',
after all, nohzfull will re-establish itself automagically when the tick
detects but the one task afterwards.

So both cases need work, neither works out of the box afaict. But I
can't see one really being better than the other either -- am I missing
obvious things again?

2014-12-16 16:55:19

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Peter Zijlstra wrote:
> On Tue, Dec 16, 2014 at 03:32:28PM +0100, Thomas Gleixner wrote:
> So let me try and understand the problem with the emulated idle thing
> better (running idle from FIFO threads).
>
> I suppose the tricky bit is what happens when the cpu was idle; in that
> case we'll end up with 1 running thread in state:
>
> 1 0 1 valid
>
> But need to avoid ending up in:
>
> 1 1 1 BUG
>
> Which should be relatively simple by never entering nohzfull when 'idle'.

Well, yes.

> However with your proposed thingy, I think we'll end up in:
>
> 1 1 1 BUG
>
> Because we don't start another thread, so infullnohz will stay valid,
> however we'll also be 'forced' into idle (with nr_running > 0) and stop
> the tick.

Indeed.

> A remote wakeup might result in nr_running going from 1->2 and seeing
> infullnohz == 1, try and restart the tick, while we're idle!
>
> Of course, we can fix that too, by clearing nohzfull when going 'idle',
> after all, nohzfull will re-establish itself automagically when the tick
> detects but the one task afterwards.
>
> So both cases need work, neither works out of the box afaict. But I

Agreed.

> can't see one really being better than the other either -- am I missing
> obvious things again?

No, it's both butt ugly, though what really bothers me is that we
pretend to be idle without being in the idle thread. That makes a lot
of stuff really weird.

Anything which relies on is_idle_task(current), idle_cpu() and a bunch
of other things just fail to work while we pretend being "the idle"
task.

That's just wrong. That powerclamp stuff needs to put the cpu
(including the scheduler, etc.) into a well defined state.

The fake idle task is everything else than well defined state, really.

Thanks,

tglx

2014-12-16 17:26:59

by Jacob Pan

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014 09:48:42 +0530
Viresh Kumar <[email protected]> wrote:

> On 16 December 2014 at 02:54, Pan, Jacob jun
> <[email protected]> wrote:
>
> > Looks good to me. You can add my Reviewed-by to the above patch.
>
> Thanks.
>
> > I have tested this fix and confirm powerclamp is working properly
> > now.
>
> Oh, nice.
>
> > However, we also have a planned patch for consolidated idle loop.
> > With this patch it causes some erratic behavior in idle injection.
> > I can’t seem to synchronize/align idle time around jiffies with
> > this patch + fix.
> >
> > Any suggestions welcome.
> > https://lkml.org/lkml/2014/6/4/56
>
> And all works fine without this patch ?
well, there are other things i need to improve. but this patch
definitely causes some new unwanted behavior.
> 2a16fc93d2c9 ("nohz: Avoid tick's double reprogramming in highres
> mode")
>
> I really don't know what stuff out of the two patches I posted (The
> above one and the fix I posted yesterday), will possible make the
> synchronization bad ..
>
But since your patch has real benefits and does not cause regression
with the current code, there is no reason to hold it back. I was just
hoping to get help on debugging this.

thanks,

Jacob
> --
> viresh

2014-12-16 21:15:51

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Jacob Pan wrote:
> On Tue, 16 Dec 2014 09:48:42 +0530
> Viresh Kumar <[email protected]> wrote:
> > I really don't know what stuff out of the two patches I posted (The
> > above one and the fix I posted yesterday), will possible make the
> > synchronization bad ..
> >
> But since your patch has real benefits and does not cause regression
> with the current code, there is no reason to hold it back. I was just
> hoping to get help on debugging this.

As I said elsewhere in this thread, this is not going anywhere. The
only sane decision is to revert 2a16fc93d2c9 for now.

Aside of that both NOHZ full and powerclamp need to be fixed
proper. Both of them are a complete disaster and just a steaming pile
of abuse and bandaids fixing the abuse. Read my other replies on that
before you ponder to argue.

The real solution for both things at the moment would be to make them
depend on BROKEN. I'm serious about that, really.

Your's grumpy

tglx

2014-12-16 21:21:41

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Thomas Gleixner wrote:
> On Tue, 16 Dec 2014, Frederic Weisbecker wrote:
> So like we do in tick_nohz_idle_enter() and tick_nohz_idle_exit() we
> have a clear state change in the nohz code and not something which is
> randomly deduced from async state all over the place.
>
> So that only tells the NOHZ core that the rest of the system decided
> that the cpu can actually go into full nohz mode. Whether the tick
> code disables the tick or not is decided by the tick code.
>
> It's not different from idle.

That made me look deeper into it. So we have a full zoo of functions
related to this nohz full stuff

extern void __tick_nohz_full_check(void);
extern void tick_nohz_full_kick(void);
extern void tick_nohz_full_kick_cpu(int cpu);
extern void tick_nohz_full_kick_all(void);
extern void __tick_nohz_task_switch(struct task_struct *tsk);

Plus a few inline functions

static inline bool tick_nohz_full_enabled(void)
{
if (!context_tracking_is_enabled())
return false;

return tick_nohz_full_running;
}

static inline void tick_nohz_full_check(void)
{
if (tick_nohz_full_enabled())
__tick_nohz_full_check();
}

static inline void tick_nohz_task_switch(struct task_struct *tsk)
{
if (tick_nohz_full_enabled())
__tick_nohz_task_switch(tsk);
}

context_tracking_is_enabled() is actually the static key you talked
about, right? Very intuitive - NOT!

Now lets look at the call site of tick_nohz_task_switch(). That's
invoked at the end of finish_task_switch(). Looking at one of the
worst case call chains here:

finish_task_switch()
tick_nohz_task_switch()
__tick_nohz_task_switch()
tick_nohz_tick_stopped()
can_stop_full_tick()
sched_can_stop_tick()
posix_cpu_timers_can_stop_tick()
perf_event_can_stop_tick()
tick_nohz_full_kick()
tick_nohz_full_kick()
irq_work_queue()
arch_irq_work_raise()

And then you take the self ipi and do:

irq_work_run()
can_stop_full_tick()
sched_can_stop_tick()
posix_cpu_timers_can_stop_tick()
perf_event_can_stop_tick()
tick_nohz_restart_sched_tick();

and right after that:

irq_exit()
tick_nohz_full_stop_tick()
can_stop_full_tick()
sched_can_stop_tick()
posix_cpu_timers_can_stop_tick()
perf_event_can_stop_tick()

What a trainwreck in the scheduler hotpath!

So you have three states which can change the nohzfull state:

1) rq->nr_running > 1
2) current->uses_posix_timer
3) perf needing the tick

So instead of evaluating the whole nonsense a gazillion times in a row
and firing pointless self ipis why are you not looking at the obvious
solution of sane state change tracking?

DEFINE_PER_CPU(nohz_full_must_tick, unsigned long);

enum {
NOHZ_SCHED_NEEDS_TICK,
NOHZ_POSIXTIMER_NEEDS_TICK,
NOHZ_PERF_NEEEDS_TICK,
};

/* rq->lock is held for evaluating rq->nr_running */
static void sched_ttwu_nohz(struct rq *rq)
{
if (nohz_full_disabled())
return;

if (rq->nr_running != 2)
return;
set_bit(NOHZ_SCHED_NEEDS_TICK, this_cpu_ptr(nohz_full_must_tick);
}

/* rq->lock is held for evaluating rq->nr_running */
static void sched_ttwu_remote_nohz(struct rq *rq)
{
if (nohz_full_disabled())
return;

if (rq->nr_running != 2)
return;
/*
* Force smp_send_reschedule(). irq_exit() on the
* remote cpu will handle the rest.
*/
set_bit(NOHZ_SCHED_NEEDS_TICK, per_cpu_ptr(nohz_full_must_tick, rq->cpu);
rq->force_send_resched = true;
}

/* rq->lock is held for evaluating rq->nr_running */
static void sched_out_nohz(struct rq *rq)
{
if (nohz_full_disabled())
return;

if (rq->nr_running >= 2)
return;
clear_bit(NOHZ_SCHED_NEEDS_TICK, this_cpu_ptr(nohz_full_must_tick);
}

/* preeemption is disabled */
static void sched_in_nohz(struct task_struct *task)
{
if (nohz_full_disabled())
return;

if (!task_uses_posix_timers(task))
clear_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
this_cpu_ptr(nohz_full_must_tick));
else
set_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
this_cpu_ptr(nohz_full_must_tick));

local_irq_disable();
tick_full_nohz_update_state();
local_irq_enable();
}

/* interrupts are disabled */
void tick_full_nohz_update_state(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

if (this_cpu_read(nohz_full_must_tick)) {
if (ts->infullnohz)
tick_full_nohz_exit();
} else {
if (!ts->infullnohz)
tick_full_nohz_enter();
else
__tick_full_nohz_enter();
}
}

So the irq_exit() path becomes:

/* interrupts are disabled */
void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

if (ts->inidle) {
__tick_nohz_idle_enter();
return;
}
if (nohz_full_disabled())
return;
tick_nohz_update_state();
}

No self IPI, no irq work, no triple evaluation of the same condition
chain and proper state tracking of infullnohz inside of the tick code.

Precicely TWO points where the NOHZ internal state gets updated:

switch_to() and irq_exit()

No conflict between infullnohz and inidle because it's all internal to
the NOHZ code.

Simple, isn't it?

Now for perf and posix cpu timers I'm sure you'll find equally simple
solutions which do not add extra update points. switch_to() and
irq_exit() are enough. It's really not rocket science.

While at it let me rant about some other things.

1) Remote accounting

I really cannot understand why all of you NOHZ enthusiasts just
avoid to make that work in the first place as it reduces the
problem space significantly.

Instead of that you folks come up with broken patches for
debugfs/sysfs/sysctl/... to forcefully disable the tick. Sure it's
harder to implement remote accounting proper than hacking up some
random disable mechanism, but the latter is just utter waste of
time and resources.

I told all of you from the very beginning that remote accounting
is a key mechanism for this, but you keep insisting on hacking
around it.

If done right, then the whole kick ALL cpus nonsense in posix cpu
timers can just go away. You can do the complete posix cpu timer
thing from the housekeeper cpu(s) once you have remote accounting
implemented.

2) Perf

I doubt that the tick_nohz_full_kick() in __perf_event_overflow()
is actually necessary when you implement the above scheme.

The nr_freq_events kick ALL cpus is horrible as well. This really
wants to be restricted on the target cpus and not wholesale global
just because it is conveniant.

Managerial summary:

1) Stop fiddling on symptoms. Tackle the root causes

2) Stop creating overengineered monstrosities which just create more
race conditions and corner cases than actual functionality

3) Fix that ASAP before I slap BROKEN on it or get annoyed enough to
rip it out completely.

Thanks,

tglx

2014-12-16 22:49:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, Dec 16, 2014 at 10:21:27PM +0100, Thomas Gleixner wrote:
> DEFINE_PER_CPU(nohz_full_must_tick, unsigned long);
>
> enum {
> NOHZ_SCHED_NEEDS_TICK,
> NOHZ_POSIXTIMER_NEEDS_TICK,
> NOHZ_PERF_NEEEDS_TICK,
> };
>
> /* rq->lock is held for evaluating rq->nr_running */
> static void sched_ttwu_nohz(struct rq *rq)
> {
> if (nohz_full_disabled())
> return;
>
> if (rq->nr_running != 2)
> return;
> set_bit(NOHZ_SCHED_NEEDS_TICK, this_cpu_ptr(nohz_full_must_tick);
> }
>
> /* rq->lock is held for evaluating rq->nr_running */
> static void sched_ttwu_remote_nohz(struct rq *rq)
> {
> if (nohz_full_disabled())
> return;
>
> if (rq->nr_running != 2)
> return;
> /*
> * Force smp_send_reschedule(). irq_exit() on the
> * remote cpu will handle the rest.
> */

smp_send_reschedule() is magic and does not guarantee irq_{enter,exit}()
being called, although we could audit and fix that.

> set_bit(NOHZ_SCHED_NEEDS_TICK, per_cpu_ptr(nohz_full_must_tick, rq->cpu);
> rq->force_send_resched = true;
> }

I'd make that force_send_resched the return value or so..

> /* rq->lock is held for evaluating rq->nr_running */
> static void sched_out_nohz(struct rq *rq)
> {
> if (nohz_full_disabled())
> return;
>
> if (rq->nr_running >= 2)
> return;
> clear_bit(NOHZ_SCHED_NEEDS_TICK, this_cpu_ptr(nohz_full_must_tick);
> }
>
> /* preeemption is disabled */
> static void sched_in_nohz(struct task_struct *task)
> {
> if (nohz_full_disabled())
> return;
>
> if (!task_uses_posix_timers(task))
> clear_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
> this_cpu_ptr(nohz_full_must_tick));
> else
> set_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
> this_cpu_ptr(nohz_full_must_tick));
>

/me hands you a few spare {} :-)

Arguably test state before doing a possibly pointless update?

> local_irq_disable();
> tick_full_nohz_update_state();
> local_irq_enable();
> }

But yes, that should work just fine..

2014-12-16 22:55:11

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, 16 Dec 2014, Peter Zijlstra wrote:
> On Tue, Dec 16, 2014 at 10:21:27PM +0100, Thomas Gleixner wrote:
> > /* rq->lock is held for evaluating rq->nr_running */
> > static void sched_ttwu_remote_nohz(struct rq *rq)
> > {
> > if (nohz_full_disabled())
> > return;
> >
> > if (rq->nr_running != 2)
> > return;
> > /*
> > * Force smp_send_reschedule(). irq_exit() on the
> > * remote cpu will handle the rest.
> > */
>
> smp_send_reschedule() is magic and does not guarantee irq_{enter,exit}()
> being called, although we could audit and fix that.

I know. I did not want to do all the work for the nohz
folks. Otherwise I would have simply sent a patch. :)

> > if (!task_uses_posix_timers(task))
> > clear_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
> > this_cpu_ptr(nohz_full_must_tick));
> > else
> > set_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
> > this_cpu_ptr(nohz_full_must_tick));
> >
>
> /me hands you a few spare {} :-)

Without the proper instruction manual they are pretty useless.

> Arguably test state before doing a possibly pointless update?
>
> > local_irq_disable();
> > tick_full_nohz_update_state();
> > local_irq_enable();
> > }
>
> But yes, that should work just fine..

So I'm not the only one who thinks that this needs a proper
reimplementation :)

Thanks,

tglx

2014-12-17 00:12:33

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, Dec 16, 2014 at 10:21:27PM +0100, Thomas Gleixner wrote:
> Now lets look at the call site of tick_nohz_task_switch(). That's
> invoked at the end of finish_task_switch(). Looking at one of the
> worst case call chains here:
>
> finish_task_switch()
> tick_nohz_task_switch()
> __tick_nohz_task_switch()
> tick_nohz_tick_stopped()
> can_stop_full_tick()
> sched_can_stop_tick()
> posix_cpu_timers_can_stop_tick()
> perf_event_can_stop_tick()
> tick_nohz_full_kick()
> tick_nohz_full_kick()
> irq_work_queue()
> arch_irq_work_raise()
>
> And then you take the self ipi and do:
>
> irq_work_run()
> can_stop_full_tick()
> sched_can_stop_tick()
> posix_cpu_timers_can_stop_tick()
> perf_event_can_stop_tick()
> tick_nohz_restart_sched_tick();
>
> and right after that:
>
> irq_exit()
> tick_nohz_full_stop_tick()
> can_stop_full_tick()
> sched_can_stop_tick()
> posix_cpu_timers_can_stop_tick()
> perf_event_can_stop_tick()
>
> What a trainwreck in the scheduler hotpath!
>
> So you have three states which can change the nohzfull state:
>
> 1) rq->nr_running > 1
> 2) current->uses_posix_timer
> 3) perf needing the tick
>
> So instead of evaluating the whole nonsense a gazillion times in a row
> and firing pointless self ipis why are you not looking at the obvious
> solution of sane state change tracking?

Because I had to do it wrong first before somebody like you with a fresh mind
comes, look at the whole picture and propose a rethink :-)

>
> DEFINE_PER_CPU(nohz_full_must_tick, unsigned long);
>
> enum {
> NOHZ_SCHED_NEEDS_TICK,
> NOHZ_POSIXTIMER_NEEDS_TICK,
> NOHZ_PERF_NEEEDS_TICK,
> };
>
> /* rq->lock is held for evaluating rq->nr_running */
> static void sched_ttwu_nohz(struct rq *rq)
> {
> if (nohz_full_disabled())
> return;
>
> if (rq->nr_running != 2)
> return;
> set_bit(NOHZ_SCHED_NEEDS_TICK, this_cpu_ptr(nohz_full_must_tick);
> }
>
> /* rq->lock is held for evaluating rq->nr_running */
> static void sched_ttwu_remote_nohz(struct rq *rq)
> {
> if (nohz_full_disabled())
> return;
>
> if (rq->nr_running != 2)
> return;
> /*
> * Force smp_send_reschedule(). irq_exit() on the
> * remote cpu will handle the rest.
> */
> set_bit(NOHZ_SCHED_NEEDS_TICK, per_cpu_ptr(nohz_full_must_tick, rq->cpu);
> rq->force_send_resched = true;
> }
>
> /* rq->lock is held for evaluating rq->nr_running */
> static void sched_out_nohz(struct rq *rq)
> {
> if (nohz_full_disabled())
> return;
>
> if (rq->nr_running >= 2)
> return;
> clear_bit(NOHZ_SCHED_NEEDS_TICK, this_cpu_ptr(nohz_full_must_tick);
> }
>
> /* preeemption is disabled */
> static void sched_in_nohz(struct task_struct *task)
> {
> if (nohz_full_disabled())
> return;
>
> if (!task_uses_posix_timers(task))
> clear_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
> this_cpu_ptr(nohz_full_must_tick));
> else
> set_bit(NOHZ_POSIXTIMER_NEEDS_TICK,
> this_cpu_ptr(nohz_full_must_tick));
>
> local_irq_disable();
> tick_full_nohz_update_state();
> local_irq_enable();
> }

Yeah, checking/toggling the nohz full pre-requirement synchronously
indeed looks like a very good idea!

>
> /* interrupts are disabled */
> void tick_full_nohz_update_state(void)
> {
> struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>
> if (this_cpu_read(nohz_full_must_tick)) {
> if (ts->infullnohz)
> tick_full_nohz_exit();
> } else {
> if (!ts->infullnohz)
> tick_full_nohz_enter();
> else
> __tick_full_nohz_enter();

Ah and doing both stop and restart from irq exit is something I have indeed
planned on trying for a while. That indeed leaves the need for a specific IPI.
And that can even be done in a standalone change (with the rest of your proposal
incrementally).

Just lack of time, there are tons of things to do and I'm alone working on this.

> }
> }
>
> So the irq_exit() path becomes:
>
> /* interrupts are disabled */
> void tick_nohz_irq_exit(void)
> {
> struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>
> if (ts->inidle) {
> __tick_nohz_idle_enter();
> return;
> }
> if (nohz_full_disabled())
> return;
> tick_nohz_update_state();
> }
>
> No self IPI, no irq work, no triple evaluation of the same condition
> chain and proper state tracking of infullnohz inside of the tick code.

Yeah it looks like we can piggyback the scheduler IPI that way since we
restart the tick from irq exit. That looks good.

Probably perf and posix cpu timer still need IPIs when armed, but they are
corner cases.

>
> Precicely TWO points where the NOHZ internal state gets updated:
>
> switch_to() and irq_exit()

Yes. Side note: we've been using self IPIs on switch_to() until now when tick
restart is needed (use of perf or posix cpu timer) because we hold the rq lock
and the task could be under any random locking scenario that could conflict with what's
used in the nohz restart path. Now probably it's fine to restart inline in post schedule
(no rq lock) as the nohz path doesn't use mutex or any sleeping lock that could conflict
with what holds a scheduling in task.

>
> No conflict between infullnohz and inidle because it's all internal to
> the NOHZ code.
>
> Simple, isn't it?

Definetly, I'll rework that.

>
> Now for perf and posix cpu timers I'm sure you'll find equally simple
> solutions which do not add extra update points. switch_to() and
> irq_exit() are enough. It's really not rocket science.
>
> While at it let me rant about some other things.
>
> 1) Remote accounting
>
> I really cannot understand why all of you NOHZ enthusiasts
> just avoid to make that work in the first place as it reduces the
> problem space significantly.
>
> Instead of that you folks come up with broken patches for
> debugfs/sysfs/sysctl/... to forcefully disable the tick. Sure it's
> harder to implement remote accounting proper than hacking up some
> random disable mechanism, but the latter is just utter waste of
> time and resources.
>
> I told all of you from the very beginning that remote accounting
> is a key mechanism for this, but you keep insisting on hacking
> around it.

I don't, and I think everybody has understood we are not going to accept
hacks to solve the 1hz issue.

>
> If done right, then the whole kick ALL cpus nonsense in posix cpu
> timers can just go away. You can do the complete posix cpu timer
> thing from the housekeeper cpu(s) once you have remote accounting
> implemented.

I guess by remote accounting you're referring to the scheduler_tick()
offloading to housekeeper. Well I totally agree that we should try this
solution, I have agreed with that direction since you proposed it for the
first time.

I just can't duplicate myself so somebody else has to get his hands dirty
(besides the change to make is neither difficult nor huge) or you can
wait for me to handle that task in one year (optimistic schedule), once
I'll be done with cputime cleanups, workqueue and timers affinity, context
tracking cleanups reported by Oleg, nohz reorganization just reported by you, etc...

Full dynticks is growing into a complicated and not so small anymore project.
More precisely it can't remain anymore as a one-man project, it could still
be one-man at this stage only if it was an out-of-tree project.

I'm pretty sure you know what this is about to maintain a project with a
(variably) cruel lack of manpower ;-)

>
> 2) Perf
>
> I doubt that the tick_nohz_full_kick() in __perf_event_overflow()
> is actually necessary when you implement the above scheme.
>
> The nr_freq_events kick ALL cpus is horrible as well. This really
> wants to be restricted on the target cpus and not wholesale global
> just because it is conveniant.
>
> Managerial summary:
>
> 1) Stop fiddling on symptoms. Tackle the root causes
>
> 2) Stop creating overengineered monstrosities which just create more
> race conditions and corner cases than actual functionality
>
> 3) Fix that ASAP before I slap BROKEN on it or get annoyed enough to
> rip it out completely.

May I remind you that I introduced and now maintain nohz full because you asked
it in the first place. If you wanted a clear and palatable design on the very first
go, you should have reviewed it before it got merged in the upstream subsystem you
maintain.

Moreover I'm not the kind of kernel developer who ignores reviews (whether pre or
post merge).

So, no point in threatening a dependency to BROKEN or ripping out except perhaps to
push me out to other projects with more thankful managers.

2014-12-17 00:26:40

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Tue, Dec 16, 2014 at 11:54:51PM +0100, Thomas Gleixner wrote:
> > But yes, that should work just fine..
>
> So I'm not the only one who thinks that this needs a proper
> reimplementation :)

Besides, if this works, we'll have way less IPIs, and people
usually like that :-)

2014-12-17 09:12:14

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Wed, 17 Dec 2014, Frederic Weisbecker wrote:
> On Tue, Dec 16, 2014 at 10:21:27PM +0100, Thomas Gleixner wrote:
> > So instead of evaluating the whole nonsense a gazillion times in a row
> > and firing pointless self ipis why are you not looking at the obvious
> > solution of sane state change tracking?
>
> Because I had to do it wrong first before somebody like you with a fresh mind
> comes, look at the whole picture and propose a rethink :-)

Point taken.

> > I told all of you from the very beginning that remote accounting
> > is a key mechanism for this, but you keep insisting on hacking
> > around it.
>
> I don't, and I think everybody has understood we are not going to accept
> hacks to solve the 1hz issue.

You wish.

> I'm pretty sure you know what this is about to maintain a project with a
> (variably) cruel lack of manpower ;-)

Oh yes! :(

> Moreover I'm not the kind of kernel developer who ignores reviews (whether pre or
> post merge).

I know that.

> So, no point in threatening a dependency to BROKEN or ripping out except perhaps to
> push me out to other projects with more thankful managers.

This was not directed to you, really.

This was directed at the folks who randomly "fix" that code by voodoo
bandaids instead of analyzing the root causes in the first
place. You're the least one to blame for that.

Thanks,

tglx

2014-12-17 12:31:58

by Preeti Murthy

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

Hi Thomas,

On Tue, Dec 16, 2014 at 6:19 PM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 16 Dec 2014, Preeti U Murthy wrote:
>> As far as I can see, the primary purpose of tick_nohz_irq_enter()/exit()
>> paths was to take care of *tick stopped* cases.
>>
>> Before handling interrupts we would want jiffies to be updated, which is
>> done in tick_nohz_irq_enter(). And after handling interrupts we need to
>> verify if any timers/RCU callbacks were queued during an interrupt.
>> Both of these will not be handled properly
>> *only when the tick is stopped* right?
>>
>> If so, the checks which permit entry into these functions should at a
>> minimum include a check on ts->tick_stopped(). The rest of the checks
>> around if the CPU is idle/need_resched() may be needed in conjunction,
>> but will not be complete without checking if ticks are stopped. I see
>> that tick_nohz_irq_enter() has a check on ts->tick_stopped, which is
>> correct, but tick_noz_irq_exit() does not.
>>
>> Adding this check to tick_nohz_irq_exit() will not only solve this
>> regression but is probably a fix to a long standing bug in the
>> conditional check in tick_nohz_irq_exit().
>
> This is a complete mess caused by the full nohz hackery and you're
> just making it worse. Lets ignore the powerclamp crap for now.
>
> The initial nohz idle logic was simple:
>
> irq_enter
> if (ts->tick_stopped)
> update_timekeeping()
>
> irq_exit
> if (ts->inidle)
> tick_nohz_update_sched_tick();
>
> And that was completely correct. On irq_enter the only relevant
> information is whether the tick is stopped or not. On irq_exit the
> tick handling is only interesting if we are inidle.
>
> If we're not inidle then the tick is not stopped either. If we are
> inidle the tick might be stopped or not. So the interrupt can have
> added/removed/modified timers which have consequences on the
> tick_stopped state and possibly kick the machine out of idle
> completely.
>
> If the tick was not stopped, then the removal/modification of a timer
> in the interrupt can cause the tick to be stoppable at irq_exit.
>
> So making irq_exit:
>
> if (!ts->tick_stopped)
> return;
>
> is fundamentally wrong as you miss the oportunity to stop the tick
> after the changes done by the interrupt handler.

Ok I see now. Thanks a lot for clarifying this.

>
> So the possible states are:
>
> ts->inidle ts->tick_stopped
> 0 0 valid
> 0 1 BUG
> 1 0 valid
> 1 1 valid
>
> And we are transitioning in and out of that via:
>
> tick_nohz_idle_enter()
> ts->inidle = true;
> if (stop_possible)
> stop_tick(ts)
>
> tick_nohz_idle_exit()
> ts->inidle = false;
> if (ts->tick_stopped)
> start_tick(ts)
>
> irq_exit needs to take care of the following situations if we are
> inidle:
>
> if (ts->tick_stopped) {
> if (stop_possible) {
> /* handle timer changes (earlier/later) */
> update_tick(ts);
> } else {
> kick_out_of_idle();
> }
> } else {
> if (stop_possible)
> stop_tick(ts)
> }

Just for my understanding, Isn't the entire above logic minus the
update_tick() contained in __tick_nohz_idle_enter() ? And why
do we need to update ticks during irq_exit path? We do it during
the irq_enter() path if ticks are stopped.

>
> Now with nohzfull the state space looks like this:
>
> ts->inidle ts->infullnohz ts->tick_stopped
> 0 0 0 valid
> 0 0 1 BUG
> 1 0 0 valid
> 1 0 1 valid
>
> 0 1 0 valid
> 0 1 1 valid
> 1 1 0 BUG
> 1 1 1 BUG
>
> You might have noticed that we have neither ts->infullnohz nor
> functions which transitions in and out of that state.
>
> And that's where the whole problem starts. The nohz full stuff is
> trying to evaluate everything dynamically which is just insane.
>
> So we want to have functions which do:
>
> tick_nohz_full_enter()
> ts->infullnohz = true;
> if (stop_possible)
> stop_tick(ts);
>
> tick_nohz_full_exit()
> ts->infullnohz = false;
> if (ts->tick_stopped)
> start_tick(ts);
>
> Plus irq_exit would become:
>
> irq_exit
> if (ts->inidle)
> tick_nohz_update_sched_tick();
>
> else if (ts->infullnohz)
> tick_nohz_full_update_sched_tick();

I agree, this would be very helpful.

>
> You need to keep track of the fact that the cpu entered fullnohz and
> work from there. Whether the tick is stopped or not does not matter at
> all because that is a seperate decision like in the nohz idle case.
>
> Everything else is voodoo programming.
>
> Now the powerclamp mess is a different story.
>
> Calling tick_nohz_idle_enter()/exit() from outside the idle task is
> just broken. Period.

I do agree that the powerclamp driver has brought in a new scenario.
But I do not find it to be a messy implementation of mimicking idle.
Here is why:

Powerclamp driver injects idle duration when there is a need to
stay within the power budget. This is a fair enough problem.
Now for the implementation of the idle injection. Currently
powerclamp schedules in an idle thread, when there is work in the rq.
This means although the thread is idle, the CPU is not. It also requires
the tick to be stopped, so that the idle period is not interrupted.

I looked at your patch for scheduling in an idle thread to solve this issue,
but I think it will break things further. Because although it is an idle thread,
it cannot belong to the idle sched class because the function of an idle task
is to loop in cpu_idle(), calling into the idle governors and more extra frills,
none of which are relevant to the powerclamp driver.
Don't you think that the one function that a thread should call into in the
above situation is tick_nohz_idle_enter() ?

Looking at the implementation of tick_nohz_idle_enter/exit() I see that
fundamentally they declare the tick on that cpu to be idle and verify
if they have to be stopped/restarted; exactly what powerclamp is asking for.
Nothing here that critically mandates that the cpu calling it has to have no
work on it.

I agree that tick_nohz_idle_enter()/exit() were meant to be used by the
idle task alone when they were written. But I do not see why they
cannot be used for the usecase that powerclamp brings to the table.
They have the potential, which is why we are able to use
the functions that they call into, for nohz_full scenarios as well.
tick_nohz_stop_sched_tick() for instance. This shows that they need
not merely be used just before cpus go idle.

For this reason, I suggest the below patch to fix this issue for the time
being. Like you and Frederic pointed we need to depend on ts->inidle now.
Basing the decision on the idleness of the cpu will no longer work.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0699add..a31864d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -361,9 +361,10 @@ static inline void tick_irq_exit(void)
{
#ifdef CONFIG_NO_HZ_COMMON
int cpu = smp_processor_id();
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

/* Make sure that timer wheel updates are propagated */
- if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
+ if (ts->inidle || tick_nohz_full_cpu(cpu)) {
if (!in_interrupt())
tick_nohz_irq_exit();
}

Regards
Preeti U Murthy

2014-12-17 12:47:41

[permalink] [raw]

Subject: Re: [nohz] 2a16fc93d2c: kernel lockup on idle injection

On Wed, Dec 17, 2014 at 10:11:58AM +0100, Thomas Gleixner wrote:
> On Wed, 17 Dec 2014, Frederic Weisbecker wrote:
> > On Tue, Dec 16, 2014 at 10:21:27PM +0100, Thomas Gleixner wrote:
> > > So instead of evaluating the whole nonsense a gazillion times in a row
> > > and firing pointless self ipis why are you not looking at the obvious
> > > solution of sane state change tracking?
> >
> > Because I had to do it wrong first before somebody like you with a fresh mind
> > comes, look at the whole picture and propose a rethink :-)
>
> Point taken.
>
> > > I told all of you from the very beginning that remote accounting
> > > is a key mechanism for this, but you keep insisting on hacking
> > > around it.
> >
> > I don't, and I think everybody has understood we are not going to accept
> > hacks to solve the 1hz issue.
>
> You wish.
>
> > I'm pretty sure you know what this is about to maintain a project with a
> > (variably) cruel lack of manpower ;-)
>
> Oh yes! :(
>
> > Moreover I'm not the kind of kernel developer who ignores reviews (whether pre or
> > post merge).
>
> I know that.
>
> > So, no point in threatening a dependency to BROKEN or ripping out except perhaps to
> > push me out to other projects with more thankful managers.
>
> This was not directed to you, really.
>
> This was directed at the folks who randomly "fix" that code by voodoo
> bandaids instead of analyzing the root causes in the first
> place. You're the least one to blame for that.

Well I was about to propose a bad fix for powerclamp and I sometimes take
overengineering directions. I'm still to blame for several things 8-)

Sorry I got too touchy :-)

Anyway, I'll give a try to your proposition. Somehow I have always felt that
something was wrong with this tick_nohz_can_stop_tick() called all over the place,
that plus the IPI that's often near the scheduler IPI.

I'm pretty sure this is going to help fixing some overhead issues I got reported.

Thanks a lot!

2014-12-17 15:43:04