commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
use" was merged via the thermal tree without an explicit ack from the
relevant maintainers.
The exports are abused by the intel powerclamp driver which implements
a fake idle state from a sched FIFO task. This causes all kinds of
wreckage in the NOHZ core code which rightfully assumes that
tick_nohz_idle_enter/exit() are only called from the idle task itself.
Recent changes in the NOHZ core lead to a failure of the powerclamp
driver and now people try to hack completely broken and backwards
workarounds into the NOHZ core code. This is completely unacceptable.
The real solution is to fix the powerclamp driver by rewriting it with
a sane concept, but that's beyond the scope of this.
So the only solution for now is to remove the calls into the core NOHZ
code from the powerclamp trainwreck along with the exports.
Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
Signed-off-by: Thomas Gleixner <[email protected]>
---
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index b46c706e1cac..e98b4249187c 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
* allowed. thus jiffies are updated properly.
*/
preempt_disable();
- tick_nohz_idle_enter();
/* mwait until target jiffies is reached */
while (time_before(jiffies, target_jiffies)) {
unsigned long ecx = 1;
@@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
start_critical_timings();
atomic_inc(&idle_wakeup_counter);
}
- tick_nohz_idle_exit();
preempt_enable();
}
del_timer_sync(&wakeup_timer);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4d54b7540585..1363d58f07e9 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
local_irq_enable();
}
-EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
/**
* tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
local_irq_enable();
}
-EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
{
Hello Thomas,
On Thu, Dec 18, 2014 at 11:51:01AM +0100, Thomas Gleixner wrote:
> commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
> use" was merged via the thermal tree without an explicit ack from the
> relevant maintainers.
>
This is a shame. Rui, do you have any comments on this?
> The exports are abused by the intel powerclamp driver which implements
> a fake idle state from a sched FIFO task. This causes all kinds of
> wreckage in the NOHZ core code which rightfully assumes that
> tick_nohz_idle_enter/exit() are only called from the idle task itself.
>
OK.
> Recent changes in the NOHZ core lead to a failure of the powerclamp
> driver and now people try to hack completely broken and backwards
> workarounds into the NOHZ core code. This is completely unacceptable.
>
I see.
> The real solution is to fix the powerclamp driver by rewriting it with
> a sane concept, but that's beyond the scope of this.
>
Do you have suggestions on what exactly is the expected rewriting or the
correct sane concepts?
> So the only solution for now is to remove the calls into the core NOHZ
> code from the powerclamp trainwreck along with the exports.
>
> Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
If I got it right, the driver is currently broken due to changes in NOHZ
core. So, does this patch fix power clamp behavior ?
If I got your proposal right, in the end power clamp will be still
broken, but at least won't be abusing NOHZ. Is that what you are
proposing?
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index b46c706e1cac..e98b4249187c 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
> * allowed. thus jiffies are updated properly.
> */
> preempt_disable();
> - tick_nohz_idle_enter();
> /* mwait until target jiffies is reached */
> while (time_before(jiffies, target_jiffies)) {
> unsigned long ecx = 1;
> @@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
> start_critical_timings();
> atomic_inc(&idle_wakeup_counter);
> }
> - tick_nohz_idle_exit();
> preempt_enable();
> }
> del_timer_sync(&wakeup_timer);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 4d54b7540585..1363d58f07e9 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
>
> local_irq_enable();
> }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
>
> /**
> * tick_nohz_irq_exit - update next tick event from interrupt exit
> @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
>
> local_irq_enable();
> }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
>
> static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
> {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Thu, 18 Dec 2014, Eduardo Valentin wrote:
> > The real solution is to fix the powerclamp driver by rewriting it with
> > a sane concept, but that's beyond the scope of this.
> >
>
> Do you have suggestions on what exactly is the expected rewriting or the
> correct sane concepts?
There was quite some discussion about this in this very thread.
> > So the only solution for now is to remove the calls into the core NOHZ
> > code from the powerclamp trainwreck along with the exports.
> >
> > Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
>
> If I got it right, the driver is currently broken due to changes in NOHZ
> core. So, does this patch fix power clamp behavior ?
The driver has been broken forever. It just worked by chance.
Now a very well justified and correct change in the core code exposed
that wreckage. So we have 2 choices:
1) Get rid of the abuse and let powerclamp deal with the problem.
2) Revert a correct patch for the sake of a 'works by chance' driver
or put hacky workarounds in the core. Either of that will just
paper over the real root cause until the next thing breaks in
subtle ways.
#1 is the only sane decision. We cannot deal with misdesigned driver
code in the NOHZ core.
> If I got your proposal right, in the end power clamp will be still
> broken, but at least won't be abusing NOHZ. Is that what you are
> proposing?
Yes, the design of powerclamp stays broken, but the NOHZ abuse is
gone. powerclamp will work, but it can't benefit from the possible
longer idle times anymore.
Thanks,
tglx
Hi Thomas,
On 12/18/2014 04:21 PM, Thomas Gleixner wrote:
> commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
> use" was merged via the thermal tree without an explicit ack from the
> relevant maintainers.
>
> The exports are abused by the intel powerclamp driver which implements
> a fake idle state from a sched FIFO task. This causes all kinds of
> wreckage in the NOHZ core code which rightfully assumes that
> tick_nohz_idle_enter/exit() are only called from the idle task itself.
>
> Recent changes in the NOHZ core lead to a failure of the powerclamp
> driver and now people try to hack completely broken and backwards
> workarounds into the NOHZ core code. This is completely unacceptable.
>
> The real solution is to fix the powerclamp driver by rewriting it with
> a sane concept, but that's beyond the scope of this.
>
> So the only solution for now is to remove the calls into the core NOHZ
> code from the powerclamp trainwreck along with the exports.
>
> Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index b46c706e1cac..e98b4249187c 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
> * allowed. thus jiffies are updated properly.
> */
> preempt_disable();
> - tick_nohz_idle_enter();
> /* mwait until target jiffies is reached */
> while (time_before(jiffies, target_jiffies)) {
> unsigned long ecx = 1;
> @@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
> start_critical_timings();
> atomic_inc(&idle_wakeup_counter);
> }
> - tick_nohz_idle_exit();
> preempt_enable();
> }
> del_timer_sync(&wakeup_timer);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 4d54b7540585..1363d58f07e9 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
>
> local_irq_enable();
> }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
>
> /**
> * tick_nohz_irq_exit - update next tick event from interrupt exit
> @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
>
> local_irq_enable();
> }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
>
> static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
> {
>
Ok the solution looks apt to me.
Let me see if I can come up with a sane solution for powerclamp based on
the suggestions that you gave in the previous thread. I was thinking of
the below steps towards its implementation. The idea is based on the
throttling mechanism that you had suggested.
1. Queue a deferable periodic timer whose handler checks if idle needs
to be injected. If so, it sets rq->need_throttle for the cpu. If its
already in the fake idle period, it clears rq->need_throttle and sets
need_resched.
2. pick_next_task_fair() checks rq->need_throttle and dequeues all tasks
in the rq if this is set and puts them on a throttled list. This
mechanism is similar to throttling cfs rq today. This function hence
fails to return a task, and if no task from any other sched class
exists, idle task is picked.
Peter thoughts?
3. So we are now in the idle injected period. The scheduler state is
sane because the cpu is idle, rq->nr_running = 0, rq->curr = rq->idle.
The nohz state is sane, because ts->inidle = 1 and tick_stopped may or
may not be 1 and they are set by an idle task.
4. When need_resched is set again, the idle task of course unsets inidle
and restarts tick. In the following scheduler tick,
pick_next_task_fair() sees that rq->need_throttle is cleared, enqueues
back the tasks and returns one of them to run.
Of course there may be several points that I have missed. But how does
the approach appear? If it looks sane enough, the cases which do not
obviously fall in place can be worked upon.
Regards
Preeti U Murthy
>
>
> -----Original Message-----
> From: Preeti U Murthy [mailto:[email protected]]
> Sent: Thursday, December 18, 2014 9:28 AM
> To: Thomas Gleixner; Preeti Murthy; Pan, Jacob jun; Peter Zijlstra
> Cc: Viresh Kumar; Frederic Weisbecker; Wu, Fengguang; Frederic
> Weisbecker; LKML; LKP; Zhang, Rui Subject: Re: [PATCH]
> tick/powerclamp: Remove tick_nohz_idle abuse
>
> Hi Thomas,
>
> On 12/18/2014 04:21 PM, Thomas Gleixner wrote:
> > commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
> > use" was merged via the thermal tree without an explicit ack from
> > the relevant maintainers.
> >
> > The exports are abused by the intel powerclamp driver which
> > implements a fake idle state from a sched FIFO task. This causes
> > all kinds of wreckage in the NOHZ core code which rightfully
> > assumes that tick_nohz_idle_enter/exit() are only called from the
> > idle task itself.
> >
> > Recent changes in the NOHZ core lead to a failure of the powerclamp
> > driver and now people try to hack completely broken and backwards
> > workarounds into the NOHZ core code. This is completely
> > unacceptable.
> >
> > The real solution is to fix the powerclamp driver by rewriting it
> > with a sane concept, but that's beyond the scope of this.
> >
> > So the only solution for now is to remove the calls into the core
> > NOHZ code from the powerclamp trainwreck along with the exports.
> >
> > Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
> > Signed-off-by: Thomas Gleixner <[email protected]>
> > ---
> > diff --git a/drivers/thermal/intel_powerclamp.c
> > b/drivers/thermal/intel_powerclamp.c
> > index b46c706e1cac..e98b4249187c 100644
> > --- a/drivers/thermal/intel_powerclamp.c
> > +++ b/drivers/thermal/intel_powerclamp.c
> > @@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
> > * allowed. thus jiffies are updated properly.
> > */
> > preempt_disable();
> > - tick_nohz_idle_enter();
> > /* mwait until target jiffies is reached */
> > while (time_before(jiffies, target_jiffies)) {
> > unsigned long ecx = 1;
> > @@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
> > start_critical_timings();
> > atomic_inc(&idle_wakeup_counter);
> > }
> > - tick_nohz_idle_exit();
> > preempt_enable();
> > }
> > del_timer_sync(&wakeup_timer);
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 4d54b7540585..1363d58f07e9 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
> >
> > local_irq_enable();
> > }
> > -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
> >
> > /**
> > * tick_nohz_irq_exit - update next tick event from interrupt exit
> > @@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
> >
> > local_irq_enable();
> > }
> > -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
> >
> > static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t
> > now) {
> >
>
(switching to my linux email)
OK I agree, also as I mentioned earlier, Peter already has a patch for
consolidated idle loop and remove tick_nohz_idle_enter/exit call from
powerclamp driver. I have been working on a few tweaks to maintain the
functionality and efficiency with the consolidated idle loop.
We can apply the patches on top of yours.
> Ok the solution looks apt to me.
>
> Let me see if I can come up with a sane solution for powerclamp based
> on the suggestions that you gave in the previous thread. I was
> thinking of the below steps towards its implementation. The idea is
> based on the throttling mechanism that you had suggested.
>
> 1. Queue a deferable periodic timer whose handler checks if idle
> needs to be injected. If so, it sets rq->need_throttle for the cpu.
> If its already in the fake idle period, it clears rq->need_throttle
> and sets need_resched.
>
The key to powerclamp driver is to achieve package level idle
states, which implies synchronized idle injection. From
power/performance standpoint, only package level idle states is worth
injection.
IMHO, percpu the deferrable timer based solution makes it hard to
synchronize. And you have to be able to request deepest idle.
Some background on why we do this:
As the power consumption in package level idle goes lower and lower with
new processors, it became negligible compared to running states.
Therefore, powerclamp driver can give you near linear power-performance
throttling. Idle injection at per cpu core level may not be worthwhile
in most of todays' cpus.
Just some background on the use case, if you want to try powerclamp on
your ultrabook, you will be able compare the effectiveness in
controlling cpu thermal. You can use tmon tool in kernel source.
e.g.
$tools/thermal/tmon$ sudo ./tmon -z 1 -c intel_powerclamp
(choose -z thermal zone of your processor zone, pkg-temp or acpi tz)
> 2. pick_next_task_fair() checks rq->need_throttle and dequeues all
> tasks in the rq if this is set and puts them on a throttled list.
> This mechanism is similar to throttling cfs rq today. This function
> hence fails to return a task, and if no task from any other sched
> class exists, idle task is picked.
>
> Peter thoughts?
>
> 3. So we are now in the idle injected period. The scheduler state is
> sane because the cpu is idle, rq->nr_running = 0, rq->curr =
> rq->idle. The nohz state is sane, because ts->inidle = 1 and
> tick_stopped may or may not be 1 and they are set by an idle task.
>
> 4. When need_resched is set again, the idle task of course unsets
> inidle and restarts tick. In the following scheduler tick,
> pick_next_task_fair() sees that rq->need_throttle is cleared,
> enqueues back the tasks and returns one of them to run.
>
> Of course there may be several points that I have missed. But how
> does the approach appear? If it looks sane enough, the cases which do
> not obviously fall in place can be worked upon.
>
> Regards
> Preeti U Murthy
>
On Thu, 18 Dec 2014, Jacob Pan wrote:
> OK I agree, also as I mentioned earlier, Peter already has a patch for
> consolidated idle loop and remove tick_nohz_idle_enter/exit call from
> powerclamp driver. I have been working on a few tweaks to maintain the
> functionality and efficiency with the consolidated idle loop.
> We can apply the patches on top of yours.
No. This is equally wrong as I pointed out before. The 'unified' idle
loop is still fake and just pretending to be idle.
If simple standard interfaces like cpu_idle() are not working from
idle code anymore then this simply stinks. And that's what any fake
idle thread will do.
The whole approach is wrong. Implement a sched fair throttler and you
can avoid the whoile trainwreck.
> > 1. Queue a deferable periodic timer whose handler checks if idle
> > needs to be injected. If so, it sets rq->need_throttle for the cpu.
> > If its already in the fake idle period, it clears rq->need_throttle
> > and sets need_resched.
> >
> The key to powerclamp driver is to achieve package level idle
> states, which implies synchronized idle injection. From
> power/performance standpoint, only package level idle states is worth
> injection.
Then use a synchronized non deferrable timer on all cpus. It's simple
enough.
Thanks,
tglx
Commit-ID: a5fd9733a30d18d7ac23f17080e7e07bb3205b69
Gitweb: http://git.kernel.org/tip/a5fd9733a30d18d7ac23f17080e7e07bb3205b69
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 18 Dec 2014 11:51:01 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 19 Dec 2014 14:05:52 +0100
tick/powerclamp: Remove tick_nohz_idle abuse
commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
use" was merged via the thermal tree without an explicit ack from the
relevant maintainers.
The exports are abused by the intel powerclamp driver which implements
a fake idle state from a sched FIFO task. This causes all kinds of
wreckage in the NOHZ core code which rightfully assumes that
tick_nohz_idle_enter/exit() are only called from the idle task itself.
Recent changes in the NOHZ core lead to a failure of the powerclamp
driver and now people try to hack completely broken and backwards
workarounds into the NOHZ core code. This is completely unacceptable
and just papers over the real problem. There are way more subtle
issues lurking around the corner.
The real solution is to fix the powerclamp driver by rewriting it with
a sane concept, but that's beyond the scope of this.
So the only solution for now is to remove the calls into the core NOHZ
code from the powerclamp trainwreck along with the exports.
Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Preeti U Murthy <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Pan Jacob jun <[email protected]>
Cc: LKP <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanos
Signed-off-by: Thomas Gleixner <[email protected]>
---
drivers/thermal/intel_powerclamp.c | 2 --
kernel/time/tick-sched.c | 2 --
2 files changed, 4 deletions(-)
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 95cb7fc..6cb7849 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -435,7 +435,6 @@ static int clamp_thread(void *arg)
* allowed. thus jiffies are updated properly.
*/
preempt_disable();
- tick_nohz_idle_enter();
/* mwait until target jiffies is reached */
while (time_before(jiffies, target_jiffies)) {
unsigned long ecx = 1;
@@ -451,7 +450,6 @@ static int clamp_thread(void *arg)
start_critical_timings();
atomic_inc(&idle_wakeup_counter);
}
- tick_nohz_idle_exit();
preempt_enable();
}
del_timer_sync(&wakeup_timer);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1f43560..ff3ec34 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -847,7 +847,6 @@ void tick_nohz_idle_enter(void)
local_irq_enable();
}
-EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
/**
* tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -974,7 +973,6 @@ void tick_nohz_idle_exit(void)
local_irq_enable();
}
-EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
{
On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> On Thu, 18 Dec 2014, Jacob Pan wrote:
> > OK I agree, also as I mentioned earlier, Peter already has a patch
> > for consolidated idle loop and remove tick_nohz_idle_enter/exit
> > call from powerclamp driver. I have been working on a few tweaks to
> > maintain the functionality and efficiency with the consolidated
> > idle loop. We can apply the patches on top of yours.
>
> No. This is equally wrong as I pointed out before. The 'unified' idle
> loop is still fake and just pretending to be idle.
>
In terms of efficiency, the consolidated idle loop will allow turning
off sched tick during idle injection period. If we just take out the
tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down
significantly. I am not arguing the design but from fixing regression
perspective or short term solution.
> If simple standard interfaces like cpu_idle() are not working from
> idle code anymore then this simply stinks. And that's what any fake
> idle thread will do.
>
> The whole approach is wrong. Implement a sched fair throttler and you
> can avoid the whoile trainwreck.
>
I agree with you that fake idle always have dilemma. on one
side, idle injection threads are really busy injecting idle, so from
that standpoint it is not idle. But on the other side, we have to stop
tick to avoid being woken up all the time. So we can't just simply take
control of the cpu periodically and execute mwait.
I need to do some study here, thanks for the pointers,
Jacob
> > > 1. Queue a deferable periodic timer whose handler checks if idle
> > > needs to be injected. If so, it sets rq->need_throttle for the
> > > cpu. If its already in the fake idle period, it clears
> > > rq->need_throttle and sets need_resched.
> > >
> > The key to powerclamp driver is to achieve package level idle
> > states, which implies synchronized idle injection. From
> > power/performance standpoint, only package level idle states is
> > worth injection.
>
> Then use a synchronized non deferrable timer on all cpus. It's simple
> enough.
>
> Thanks,
>
> tglx
[Jacob Pan]
On Fri, 19 Dec 2014, Jacob Pan wrote:
> On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
> Thomas Gleixner <[email protected]> wrote:
>
> > On Thu, 18 Dec 2014, Jacob Pan wrote:
> > > OK I agree, also as I mentioned earlier, Peter already has a patch
> > > for consolidated idle loop and remove tick_nohz_idle_enter/exit
> > > call from powerclamp driver. I have been working on a few tweaks to
> > > maintain the functionality and efficiency with the consolidated
> > > idle loop. We can apply the patches on top of yours.
> >
> > No. This is equally wrong as I pointed out before. The 'unified' idle
> > loop is still fake and just pretending to be idle.
> >
> In terms of efficiency, the consolidated idle loop will allow turning
> off sched tick during idle injection period. If we just take out the
> tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down
> significantly. I am not arguing the design but from fixing regression
> perspective or short term solution.
There is no perspective. Period.
Its violates every rightful assumption of the nohz_IDLE_* code and
just ever worked by chance. There is so much subtle wreckage lurking
there that the only sane solution is to forbid it. End of story.
Thanks,
tglx
On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
> On Fri, 19 Dec 2014, Jacob Pan wrote:
>
>> On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
>> Thomas Gleixner <[email protected]> wrote:
>>
>>> On Thu, 18 Dec 2014, Jacob Pan wrote:
>>>> OK I agree, also as I mentioned earlier, Peter already has a patch
>>>> for consolidated idle loop and remove tick_nohz_idle_enter/exit
>>>> call from powerclamp driver. I have been working on a few tweaks to
>>>> maintain the functionality and efficiency with the consolidated
>>>> idle loop. We can apply the patches on top of yours.
>>>
>>> No. This is equally wrong as I pointed out before. The 'unified' idle
>>> loop is still fake and just pretending to be idle.
>>>
>> In terms of efficiency, the consolidated idle loop will allow turning
>> off sched tick during idle injection period. If we just take out the
>> tick_nohz_idle_xxx call, the effectiveness of powerclamp is going down
>> significantly. I am not arguing the design but from fixing regression
>> perspective or short term solution.
>
> There is no perspective. Period.
>
> Its violates every rightful assumption of the nohz_IDLE_* code and
> just ever worked by chance. There is so much subtle wreckage lurking
> there that the only sane solution is to forbid it. End of story.
>
> Thanks,
>
> tglx
>
Hi Jacob,
Like Thomas pointed out, we can design a sane solution for powerclamp.
Idle injection is nothing but throttling of runqueue. If the runqueue is
throttled, no fair tasks will be selected and the natural choice in the
absence of tasks from any other sched class is the idle task.
The idle loop will automatically be called and the nohz state will also
fall in place. The cpu is really idle now: the runqueue has no tasks and
the task running on the cpu is the idle thread. The throttled tasks are
on a separate list.
When the period of idle injection is over, we unthrottle the runqueue.
All this being taken care of my a non-deferrable timer. This design
ensures that the intention of powerclamp is not hampered while at the
same time maintaining a sane state for nohz; you will get the efficiency
you want.
Of course there may be corner cases and challenges around
synchronization of package idle, which I am sure we can work around with
a better design such as the above. I am working on that patchset and
will post out in a day. You can take a look and let us know the pieces
we are missing.
I find that implementing the above design is not too hard.
Regards
Preeti U Murthy
On Sat, 20 Dec 2014 07:01:12 +0530
Preeti U Murthy <[email protected]> wrote:
> On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
> > On Fri, 19 Dec 2014, Jacob Pan wrote:
> >
> >> On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
> >> Thomas Gleixner <[email protected]> wrote:
> >>
> >>> On Thu, 18 Dec 2014, Jacob Pan wrote:
> >>>> OK I agree, also as I mentioned earlier, Peter already has a
> >>>> patch for consolidated idle loop and remove
> >>>> tick_nohz_idle_enter/exit call from powerclamp driver. I have
> >>>> been working on a few tweaks to maintain the functionality and
> >>>> efficiency with the consolidated idle loop. We can apply the
> >>>> patches on top of yours.
> >>>
> >>> No. This is equally wrong as I pointed out before. The 'unified'
> >>> idle loop is still fake and just pretending to be idle.
> >>>
> >> In terms of efficiency, the consolidated idle loop will allow
> >> turning off sched tick during idle injection period. If we just
> >> take out the tick_nohz_idle_xxx call, the effectiveness of
> >> powerclamp is going down significantly. I am not arguing the
> >> design but from fixing regression perspective or short term
> >> solution.
> >
> > There is no perspective. Period.
> >
> > Its violates every rightful assumption of the nohz_IDLE_* code and
> > just ever worked by chance. There is so much subtle wreckage lurking
> > there that the only sane solution is to forbid it. End of story.
> >
> > Thanks,
> >
> > tglx
> >
> Hi Jacob,
>
> Like Thomas pointed out, we can design a sane solution for powerclamp.
> Idle injection is nothing but throttling of runqueue. If the runqueue
> is throttled, no fair tasks will be selected and the natural choice
> in the absence of tasks from any other sched class is the idle task.
>
> The idle loop will automatically be called and the nohz state will
> also fall in place. The cpu is really idle now: the runqueue has no
> tasks and the task running on the cpu is the idle thread. The
> throttled tasks are on a separate list.
>
> When the period of idle injection is over, we unthrottle the runqueue.
> All this being taken care of my a non-deferrable timer. This design
> ensures that the intention of powerclamp is not hampered while at the
> same time maintaining a sane state for nohz; you will get the
> efficiency you want.
>
> Of course there may be corner cases and challenges around
> synchronization of package idle, which I am sure we can work around
> with a better design such as the above. I am working on that patchset
> and will post out in a day. You can take a look and let us know the
> pieces we are missing.
>
> I find that implementing the above design is not too hard.
>
Hi Preeti,
Yeah, it seems to be a good approach. looking forward to work with you
on this. Timer may scale better for larger systems. One question, will
timer irq gets unpredictable delays if run by ksoftirqd?
BTW, I may not be able to respond quickly during the holidays. If
things workout, it may benefit ACPI PAD driver as well.
Thanks,
Jacob
> Regards
> Preeti U Murthy
>
[Jacob Pan]
Hi Jacob,
On 12/23/2014 08:27 AM, Jacob Pan wrote:
> On Sat, 20 Dec 2014 07:01:12 +0530
> Preeti U Murthy <[email protected]> wrote:
>
>> On 12/20/2014 01:26 AM, Thomas Gleixner wrote:
>>> On Fri, 19 Dec 2014, Jacob Pan wrote:
>>>
>>>> On Thu, 18 Dec 2014 22:12:57 +0100 (CET)
>>>> Thomas Gleixner <[email protected]> wrote:
>>>>
>>>>> On Thu, 18 Dec 2014, Jacob Pan wrote:
>>>>>> OK I agree, also as I mentioned earlier, Peter already has a
>>>>>> patch for consolidated idle loop and remove
>>>>>> tick_nohz_idle_enter/exit call from powerclamp driver. I have
>>>>>> been working on a few tweaks to maintain the functionality and
>>>>>> efficiency with the consolidated idle loop. We can apply the
>>>>>> patches on top of yours.
>>>>>
>>>>> No. This is equally wrong as I pointed out before. The 'unified'
>>>>> idle loop is still fake and just pretending to be idle.
>>>>>
>>>> In terms of efficiency, the consolidated idle loop will allow
>>>> turning off sched tick during idle injection period. If we just
>>>> take out the tick_nohz_idle_xxx call, the effectiveness of
>>>> powerclamp is going down significantly. I am not arguing the
>>>> design but from fixing regression perspective or short term
>>>> solution.
>>>
>>> There is no perspective. Period.
>>>
>>> Its violates every rightful assumption of the nohz_IDLE_* code and
>>> just ever worked by chance. There is so much subtle wreckage lurking
>>> there that the only sane solution is to forbid it. End of story.
>>>
>>> Thanks,
>>>
>>> tglx
>>>
>> Hi Jacob,
>>
>> Like Thomas pointed out, we can design a sane solution for powerclamp.
>> Idle injection is nothing but throttling of runqueue. If the runqueue
>> is throttled, no fair tasks will be selected and the natural choice
>> in the absence of tasks from any other sched class is the idle task.
>>
>> The idle loop will automatically be called and the nohz state will
>> also fall in place. The cpu is really idle now: the runqueue has no
>> tasks and the task running on the cpu is the idle thread. The
>> throttled tasks are on a separate list.
>>
>> When the period of idle injection is over, we unthrottle the runqueue.
>> All this being taken care of my a non-deferrable timer. This design
>> ensures that the intention of powerclamp is not hampered while at the
>> same time maintaining a sane state for nohz; you will get the
>> efficiency you want.
>>
>> Of course there may be corner cases and challenges around
>> synchronization of package idle, which I am sure we can work around
>> with a better design such as the above. I am working on that patchset
>> and will post out in a day. You can take a look and let us know the
>> pieces we are missing.
>>
>> I find that implementing the above design is not too hard.
>>
> Hi Preeti,
> Yeah, it seems to be a good approach. looking forward to work with you
> on this. Timer may scale better for larger systems. One question, will
> timer irq gets unpredictable delays if run by ksoftirqd?
I am sorry I could not respond earlier; I was on vacation as well. Yes,
we may have a problem here. Let alone synchronization between cpus in
performing clamping, there are two other functionality issues that I see.
1. Since periodic timers get executed in the softirq context,
scheduler_tick() would have passed by by then. i.e.
hrtimer_interrupt()
|__tick_sched_handle()
|__scheduler_tick()
|__raise_softirq(TIMER_SOFTIRQ)
ksoftirqd runs on local_bh_enable()-->powerclamp_timer handler runs
Although runqueues are throttled in the powerclamp timer handler, it has
to wait till the next scheduler tick to select an idle task to run.
A precious 4-10ms depending on the config would have passed by then.
2. For the same reason as 1, when the ksoftirqd has to run on the cpus
during the tick after the one in which throttling is enabled, cpus are
unavailable to run the daemon because they are throttled. However there
is no other way to unthrottle the runqueues now except by running the
ksoftirqd; a chicken and egg problem.
I think both the above problems could be solved by using hrtimers
instead of periodic timers to perform clamping/unclamping, since
hrtimers are serviced in the interrupt context. But we cannot
initialize/start/modify hrtimers on a remote cpu. We will end up using
IPIs for handling the hrtimers during start/end of powerclamp or
modification of the idle duration of clamping, which is not a tempting
option either.
So I am currently stuck at this point. I would be glad to have some
suggestions.
Thanks
Regards
Preeti U Murthy
> BTW, I may not be able to respond quickly during the holidays. If
> things workout, it may benefit ACPI PAD driver as well.
>
>
> Thanks,
>
> Jacob
>> Regards
>> Preeti U Murthy
>>
>
> [Jacob Pan]
>