Hi Thomas,
Recently I want to do the experiment for cpu isolation over 3.10 kernel.
But I find the isolated one is periodically waken up by IPI interrupt.
By checking the trace, I find those IPI is generated by add_timer_on,
which would calls wake_up_nohz_cpu, and wake up the already idle cpu.
With further checking, I find this timer is added by on_demand governor of
cpufreq. It would periodically check each cores' state.
The problem I see here is cpufreq_governor using INIT_DEFERRABLE_WORK
as the tool, while timer is made as deferrable anyway.
And what is more that cpufreq checking is very frequent. In my case, the
isolated cpu is wakenup by IPI every 5ms.
So why kernel need to wake the remote processor when mount the deferrable
timer? As per my understanding, we'd better keep cpu as idle when use
the deferrable timer.
Thanks,
Lei
On Wed, 22 Jan 2014, Lei Wen wrote:
> Recently I want to do the experiment for cpu isolation over 3.10 kernel.
> But I find the isolated one is periodically waken up by IPI interrupt.
>
> By checking the trace, I find those IPI is generated by add_timer_on,
> which would calls wake_up_nohz_cpu, and wake up the already idle cpu.
>
> With further checking, I find this timer is added by on_demand governor of
> cpufreq. It would periodically check each cores' state.
> The problem I see here is cpufreq_governor using INIT_DEFERRABLE_WORK
> as the tool, while timer is made as deferrable anyway.
> And what is more that cpufreq checking is very frequent. In my case, the
> isolated cpu is wakenup by IPI every 5ms.
>
> So why kernel need to wake the remote processor when mount the deferrable
> timer? As per my understanding, we'd better keep cpu as idle when use
> the deferrable timer.
Indeed, we can avoid the wakeup of the remote cpu when the timer is
deferrable.
Though you really want to figure out why the cpufreq governor is
arming timers on other cores every 5ms. That smells like an utterly
stupid approach.
Thanks,
tglx
On Wed, Jan 22, 2014 at 10:07 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 22 Jan 2014, Lei Wen wrote:
>> Recently I want to do the experiment for cpu isolation over 3.10 kernel.
>> But I find the isolated one is periodically waken up by IPI interrupt.
>>
>> By checking the trace, I find those IPI is generated by add_timer_on,
>> which would calls wake_up_nohz_cpu, and wake up the already idle cpu.
>>
>> With further checking, I find this timer is added by on_demand governor of
>> cpufreq. It would periodically check each cores' state.
>> The problem I see here is cpufreq_governor using INIT_DEFERRABLE_WORK
>> as the tool, while timer is made as deferrable anyway.
>> And what is more that cpufreq checking is very frequent. In my case, the
>> isolated cpu is wakenup by IPI every 5ms.
>>
>> So why kernel need to wake the remote processor when mount the deferrable
>> timer? As per my understanding, we'd better keep cpu as idle when use
>> the deferrable timer.
>
> Indeed, we can avoid the wakeup of the remote cpu when the timer is
> deferrable.
Glad to hear that we could fix this unwanted wakeup.
Do you have related patches already?
>
> Though you really want to figure out why the cpufreq governor is
> arming timers on other cores every 5ms. That smells like an utterly
> stupid approach.
Not sure why cpufreq choose such frequent profiling over each cpu.
As my understanding, since kernel is smp, launching profiler over one cpu
would be enough...
Thanks,
Lei
On 23 January 2014 11:11, Lei Wen <[email protected]> wrote:
> On Wed, Jan 22, 2014 at 10:07 PM, Thomas Gleixner <[email protected]> wrote:
>> On Wed, 22 Jan 2014, Lei Wen wrote:
>>> Recently I want to do the experiment for cpu isolation over 3.10 kernel.
>>> But I find the isolated one is periodically waken up by IPI interrupt.
>>>
>>> By checking the trace, I find those IPI is generated by add_timer_on,
>>> which would calls wake_up_nohz_cpu, and wake up the already idle cpu.
>>>
>>> With further checking, I find this timer is added by on_demand governor of
>>> cpufreq. It would periodically check each cores' state.
>>> The problem I see here is cpufreq_governor using INIT_DEFERRABLE_WORK
>>> as the tool, while timer is made as deferrable anyway.
>>> And what is more that cpufreq checking is very frequent. In my case, the
>>> isolated cpu is wakenup by IPI every 5ms.
>>>
>>> So why kernel need to wake the remote processor when mount the deferrable
>>> timer? As per my understanding, we'd better keep cpu as idle when use
>>> the deferrable timer.
>>
>> Indeed, we can avoid the wakeup of the remote cpu when the timer is
>> deferrable.
>
> Glad to hear that we could fix this unwanted wakeup.
> Do you have related patches already?
>
>>
>> Though you really want to figure out why the cpufreq governor is
>> arming timers on other cores every 5ms. That smells like an utterly
>> stupid approach.
>
> Not sure why cpufreq choose such frequent profiling over each cpu.
> As my understanding, since kernel is smp, launching profiler over one cpu
> would be enough...
Hi Guys,
So the first question is why cpufreq needs it and is it really stupid?
Yes, it is stupid but that's how its implemented since a long time. It does
so to get data about the load on CPUs, so that freq can be scaled up/down.
Though there is a solution in discussion currently, which will take
inputs from scheduler and so these background timers would go away.
But we need to wait until that time.
Now, why do we need that for every cpu, while that for a single cpu might
be enough? The answer is cpuidle here: What if the cpu responsible for
running timer goes to sleep? Who will evaluate the load then? And if we
make this timer run on one cpu in non-deferrable mode then that cpu
would be waken up again and again from idle. So, it was decided to have
a per-cpu deferrable timer. Though to improve efficiency, once it is fired
on any cpu, timer for all other CPUs are rescheduled, so that they don't
fire before 5ms (sampling time)..
I think below diff might get this fixed for you, though I am not sure if it
breaks something else. Probably Thomas/Frederic can answer here.
If this looks fine I will send it formally again:
diff --git a/kernel/timer.c b/kernel/timer.c
index accfd24..3a2c7fa 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
* makes sure that a CPU on the way to stop its tick can not
* evaluate the timer wheel.
*/
- wake_up_nohz_cpu(cpu);
+ if (!tbase_get_deferrable(timer->base))
+ wake_up_nohz_cpu(cpu);
spin_unlock_irqrestore(&base->lock, flags);
}
EXPORT_SYMBOL_GPL(add_timer_on);
On Thu, Jan 23, 2014 at 11:22:32AM +0530, Viresh Kumar wrote:
> I think below diff might get this fixed for you, though I am not sure if it
> breaks something else. Probably Thomas/Frederic can answer here.
> If this looks fine I will send it formally again:
>
> diff --git a/kernel/timer.c b/kernel/timer.c
> index accfd24..3a2c7fa 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
> * makes sure that a CPU on the way to stop its tick can not
> * evaluate the timer wheel.
> */
> - wake_up_nohz_cpu(cpu);
> + if (!tbase_get_deferrable(timer->base))
> + wake_up_nohz_cpu(cpu);
So you simply rely on the next tick to see the new timer. This should work with
CONFIG_NO_HZ_IDLE but not with CONFIG_NO_HZ_FULL since the target may be running
without the tick.
Basically, in the case of a deferrable timer you need to manage to call
wake_up_full_nohz_cpu() but not wake_up_idle_cpu().
It should be even possible to spare the IPI in a full dynticks CPU if it is
running idle. But that's an optional bonus because it require some deep
care on complicated races against the call to tick_nohz_idle_exit().
I also realize than when we enqueue a timer on a full nohz CPU, we should set_need_resched()
the target before sending the IPI if it is idle like does wake_up_idle_cpu(). Otherwise the
IPI will be ignored without exiting the idle loop nor reevaluating the tick on irq exit.
If you can fix that along the way, that will be much appreciated.
Thanks!
On 23 January 2014 19:05, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Jan 23, 2014 at 11:22:32AM +0530, Viresh Kumar wrote:
>> I think below diff might get this fixed for you, though I am not sure if it
>> breaks something else. Probably Thomas/Frederic can answer here.
>> If this looks fine I will send it formally again:
>>
>> diff --git a/kernel/timer.c b/kernel/timer.c
>> index accfd24..3a2c7fa 100644
>> --- a/kernel/timer.c
>> +++ b/kernel/timer.c
>> @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
>> * makes sure that a CPU on the way to stop its tick can not
>> * evaluate the timer wheel.
>> */
>> - wake_up_nohz_cpu(cpu);
>> + if (!tbase_get_deferrable(timer->base))
>> + wake_up_nohz_cpu(cpu);
Wait, I got the wrong code here. That's wasn't my initial intention.
I actually wanted to write something like this:
- wake_up_nohz_cpu(cpu);
+ if (!tbase_get_deferrable(timer->base) || idle_cpu(cpu))
+ wake_up_nohz_cpu(cpu);
Will that work?
> So you simply rely on the next tick to see the new timer. This should work with
> CONFIG_NO_HZ_IDLE but not with CONFIG_NO_HZ_FULL since the target may be running
> without the tick.
>
> Basically, in the case of a deferrable timer you need to manage to call
> wake_up_full_nohz_cpu() but not wake_up_idle_cpu().
>
> It should be even possible to spare the IPI in a full dynticks CPU if it is
> running idle. But that's an optional bonus because it require some deep
> care on complicated races against the call to tick_nohz_idle_exit().
>
> I also realize than when we enqueue a timer on a full nohz CPU, we should set_need_resched()
> the target before sending the IPI if it is idle like does wake_up_idle_cpu(). Otherwise the
> IPI will be ignored without exiting the idle loop nor reevaluating the tick on irq exit.
> If you can fix that along the way, that will be much appreciated.
I haven't thought much about this currently as I have limited knowledge of
these routines. Though the problem we were facing wasn't related to
NO_HZ_FULL. It was just about waking up an idle cpu.
On Thu, Jan 23, 2014 at 07:50:40PM +0530, Viresh Kumar wrote:
> On 23 January 2014 19:05, Frederic Weisbecker <[email protected]> wrote:
> > On Thu, Jan 23, 2014 at 11:22:32AM +0530, Viresh Kumar wrote:
> >> I think below diff might get this fixed for you, though I am not sure if it
> >> breaks something else. Probably Thomas/Frederic can answer here.
> >> If this looks fine I will send it formally again:
> >>
> >> diff --git a/kernel/timer.c b/kernel/timer.c
> >> index accfd24..3a2c7fa 100644
> >> --- a/kernel/timer.c
> >> +++ b/kernel/timer.c
> >> @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
> >> * makes sure that a CPU on the way to stop its tick can not
> >> * evaluate the timer wheel.
> >> */
> >> - wake_up_nohz_cpu(cpu);
> >> + if (!tbase_get_deferrable(timer->base))
> >> + wake_up_nohz_cpu(cpu);
>
> Wait, I got the wrong code here. That's wasn't my initial intention.
> I actually wanted to write something like this:
>
> - wake_up_nohz_cpu(cpu);
> + if (!tbase_get_deferrable(timer->base) || idle_cpu(cpu))
> + wake_up_nohz_cpu(cpu);
>
> Will that work?
Well, this is going to wake up the target from its idle state, which is
what we want to avoid if the timer is deferrable, right?
The simplest thing we want is:
if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu))
wake_up_nohz_cpu(cpu);
This spares the IPI for the common case where the timer is deferrable and we run
in periodic or dynticks-idle mode (which should be 99.99% of the existing workloads).
Then we can later optimize that and spare the IPI on full dynticks CPUs when they run
idle, but that require some special care about subtle races which can't be dealt
with a simple test on "idle_cpu(target)". And power consumption in full dynticks
is already very suboptimized anyway.
So I suggest we start simple with the above test, and a big fat comment which explains
what we are doing and what needs to be done in the future.
Thanks!
>
> > So you simply rely on the next tick to see the new timer. This should work with
> > CONFIG_NO_HZ_IDLE but not with CONFIG_NO_HZ_FULL since the target may be running
> > without the tick.
> >
> > Basically, in the case of a deferrable timer you need to manage to call
> > wake_up_full_nohz_cpu() but not wake_up_idle_cpu().
> >
> > It should be even possible to spare the IPI in a full dynticks CPU if it is
> > running idle. But that's an optional bonus because it require some deep
> > care on complicated races against the call to tick_nohz_idle_exit().
> >
> > I also realize than when we enqueue a timer on a full nohz CPU, we should set_need_resched()
> > the target before sending the IPI if it is idle like does wake_up_idle_cpu(). Otherwise the
> > IPI will be ignored without exiting the idle loop nor reevaluating the tick on irq exit.
> > If you can fix that along the way, that will be much appreciated.
>
> I haven't thought much about this currently as I have limited knowledge of
> these routines. Though the problem we were facing wasn't related to
> NO_HZ_FULL. It was just about waking up an idle cpu.
Sure that's fine, I'll fix that along aside your change.
Hi,
On Thu, Jan 23, 2014 at 11:22 AM, Viresh Kumar <[email protected]> wrote:
> Hi Guys,
>
> So the first question is why cpufreq needs it and is it really stupid?
> Yes, it is stupid but that's how its implemented since a long time. It does
> so to get data about the load on CPUs, so that freq can be scaled up/down.
>
> Though there is a solution in discussion currently, which will take
> inputs from scheduler and so these background timers would go away.
> But we need to wait until that time.
>
> Now, why do we need that for every cpu, while that for a single cpu might
> be enough? The answer is cpuidle here: What if the cpu responsible for
> running timer goes to sleep? Who will evaluate the load then? And if we
> make this timer run on one cpu in non-deferrable mode then that cpu
> would be waken up again and again from idle. So, it was decided to have
> a per-cpu deferrable timer. Though to improve efficiency, once it is fired
> on any cpu, timer for all other CPUs are rescheduled, so that they don't
> fire before 5ms (sampling time)..
How about simplifying this design by doing the below?
1. Since anyway cpufreq governors monitor load on the cpu once every
5ms, *tie it with tick_sched_timer*, which also gets deferred when the cpu
enters nohz_idle.
2. To overcome the problem of running this job of monitoring the load
on every cpu, have the *time keeping* cpu do it for you.
The time keeping cpu has the property that if it has to go to idle, it will do
so and let the next cpu that runs the periodic timer become the time keeper.
Hence no cpu is prevented from entering nohz_idle and the cpu that is busy
and first executes periodic timer will take over as the time keeper.
The result would be:
1. One cpu at any point in time will be monitoring cpu load, at every sched tick
as long as its busy. If it goes to sleep, then it gives up this duty
and enters idle.
The next cpu that runs the periodic timer becomes the cpu to monitor the load
and will continue to do so as long as its busy. Hence we do not miss monitoring
the cpu load.
2. This will avoid an additional timer for cpufreq.
3. It avoids sending IPIs each time this timer gets modified since there is just
one CPU doing the monitoring.
4. The downside to this could be that we are stretching the functions of the
periodic timer into the power management domain which does not seem like
the right thing to do.
Having said the above, the fix that Viresh has proposed along with the nohz_full
condition that Frederick added looks to solve this problem.
But just a thought on if there is scope to improve this part of the
cpufreq code.
What do you all think?
Thanks
Regards
Preeti U Murthy
>
> I think below diff might get this fixed for you, though I am not sure if it
> breaks something else. Probably Thomas/Frederic can answer here.
> If this looks fine I will send it formally again:
>
> diff --git a/kernel/timer.c b/kernel/timer.c
> index accfd24..3a2c7fa 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
> * makes sure that a CPU on the way to stop its tick can not
> * evaluate the timer wheel.
> */
> - wake_up_nohz_cpu(cpu);
> + if (!tbase_get_deferrable(timer->base))
> + wake_up_nohz_cpu(cpu);
> spin_unlock_irqrestore(&base->lock, flags);
> }
> EXPORT_SYMBOL_GPL(add_timer_on);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jan 29, 2014 at 10:57:59AM +0530, Preeti Murthy wrote:
> Hi,
>
> On Thu, Jan 23, 2014 at 11:22 AM, Viresh Kumar <[email protected]> wrote:
>
> > Hi Guys,
> >
> > So the first question is why cpufreq needs it and is it really stupid?
> > Yes, it is stupid but that's how its implemented since a long time. It does
> > so to get data about the load on CPUs, so that freq can be scaled up/down.
> >
> > Though there is a solution in discussion currently, which will take
> > inputs from scheduler and so these background timers would go away.
> > But we need to wait until that time.
> >
> > Now, why do we need that for every cpu, while that for a single cpu might
> > be enough? The answer is cpuidle here: What if the cpu responsible for
> > running timer goes to sleep? Who will evaluate the load then? And if we
> > make this timer run on one cpu in non-deferrable mode then that cpu
> > would be waken up again and again from idle. So, it was decided to have
> > a per-cpu deferrable timer. Though to improve efficiency, once it is fired
> > on any cpu, timer for all other CPUs are rescheduled, so that they don't
> > fire before 5ms (sampling time)..
>
> How about simplifying this design by doing the below?
>
> 1. Since anyway cpufreq governors monitor load on the cpu once every
> 5ms, *tie it with tick_sched_timer*, which also gets deferred when the cpu
> enters nohz_idle.
>
> 2. To overcome the problem of running this job of monitoring the load
> on every cpu, have the *time keeping* cpu do it for you.
>
> The time keeping cpu has the property that if it has to go to idle, it will do
> so and let the next cpu that runs the periodic timer become the time keeper.
> Hence no cpu is prevented from entering nohz_idle and the cpu that is busy
> and first executes periodic timer will take over as the time keeper.
>
> The result would be:
>
> 1. One cpu at any point in time will be monitoring cpu load, at every sched tick
> as long as its busy. If it goes to sleep, then it gives up this duty
> and enters idle.
> The next cpu that runs the periodic timer becomes the cpu to monitor the load
> and will continue to do so as long as its busy. Hence we do not miss monitoring
> the cpu load.
Well that's basically what an unbound deferrable timer does. It's deferrable so
it's doesn't prevent from entering dynticks idle mode and it's not affine to any
particular CPU so it's going to be tied to a buzy CPU according to the scheduler
(see get_nohz_timer_target()).
>
> 2. This will avoid an additional timer for cpufreq.
That doesn't look like a problem.
>
> 3. It avoids sending IPIs each time this timer gets modified since there is just
> one CPU doing the monitoring.
If we fix the initial issue properly, we shouldn't need to send an IPI anymore.
>
> 4. The downside to this could be that we are stretching the functions of the
> periodic timer into the power management domain which does not seem like
> the right thing to do.
Indeed, that's what I'm worried about. The tick has grown into a Big Kernel Timer
where any subsystem can hook into for any kind of periodic event. This is why it
was not easy to implement full dynticks, and it's not even complete yet due
to the complicated dependencies involved.
>
> Having said the above, the fix that Viresh has proposed along with the nohz_full
> condition that Frederick added looks to solve this problem.
In any case I believe we want Viresh patch since there are other users
of deferrable timers that can profit from this.
So I'm queueing it.
>
> But just a thought on if there is scope to improve this part of the
> cpufreq code.
> What do you all think?
I fear I don't know the problem well enough to display any serious advice.
It depends what kind of measurement is needed. For example, isn't there some
loads statistics that are already available from the scheduler that you could reuse?
The scheduler alone takes gazillions of different loads and power statistics taken
in interesting path such as the tick or sched switches. Aren't there some read-only metrics
that could be interesting?
Hi Frederic,
On 01/31/2014 10:00 PM, Frederic Weisbecker wrote:
> On Wed, Jan 29, 2014 at 10:57:59AM +0530, Preeti Murthy wrote:
>> Hi,
>>
>> On Thu, Jan 23, 2014 at 11:22 AM, Viresh Kumar <[email protected]> wrote:
>>
>>> Hi Guys,
>>>
>>> So the first question is why cpufreq needs it and is it really stupid?
>>> Yes, it is stupid but that's how its implemented since a long time. It does
>>> so to get data about the load on CPUs, so that freq can be scaled up/down.
>>>
>>> Though there is a solution in discussion currently, which will take
>>> inputs from scheduler and so these background timers would go away.
>>> But we need to wait until that time.
>>>
>>> Now, why do we need that for every cpu, while that for a single cpu might
>>> be enough? The answer is cpuidle here: What if the cpu responsible for
>>> running timer goes to sleep? Who will evaluate the load then? And if we
>>> make this timer run on one cpu in non-deferrable mode then that cpu
>>> would be waken up again and again from idle. So, it was decided to have
>>> a per-cpu deferrable timer. Though to improve efficiency, once it is fired
>>> on any cpu, timer for all other CPUs are rescheduled, so that they don't
>>> fire before 5ms (sampling time)..
>>
>> How about simplifying this design by doing the below?
>>
>> 1. Since anyway cpufreq governors monitor load on the cpu once every
>> 5ms, *tie it with tick_sched_timer*, which also gets deferred when the cpu
>> enters nohz_idle.
>>
>> 2. To overcome the problem of running this job of monitoring the load
>> on every cpu, have the *time keeping* cpu do it for you.
>>
>> The time keeping cpu has the property that if it has to go to idle, it will do
>> so and let the next cpu that runs the periodic timer become the time keeper.
>> Hence no cpu is prevented from entering nohz_idle and the cpu that is busy
>> and first executes periodic timer will take over as the time keeper.
>>
>> The result would be:
>>
>> 1. One cpu at any point in time will be monitoring cpu load, at every sched tick
>> as long as its busy. If it goes to sleep, then it gives up this duty
>> and enters idle.
>> The next cpu that runs the periodic timer becomes the cpu to monitor the load
>> and will continue to do so as long as its busy. Hence we do not miss monitoring
>> the cpu load.
>
> Well that's basically what an unbound deferrable timer does. It's deferrable so
> it's doesn't prevent from entering dynticks idle mode and it's not affine to any
> particular CPU so it's going to be tied to a buzy CPU according to the scheduler
> (see get_nohz_timer_target()).
>
>>
>> 2. This will avoid an additional timer for cpufreq.
>
> That doesn't look like a problem.
>
>>
>> 3. It avoids sending IPIs each time this timer gets modified since there is just
>> one CPU doing the monitoring.
>
> If we fix the initial issue properly, we shouldn't need to send an IPI anymore.
That's right. I am sorry I missed that we were completely avoiding IPIs
in case of deferrable timers.
>
>>
>> 4. The downside to this could be that we are stretching the functions of the
>> periodic timer into the power management domain which does not seem like
>> the right thing to do.
>
> Indeed, that's what I'm worried about. The tick has grown into a Big Kernel Timer
> where any subsystem can hook into for any kind of periodic event. This is why it
> was not easy to implement full dynticks, and it's not even complete yet due
> to the complicated dependencies involved.
I see your point. Yes point 4 is a bad idea.
>
>>
>> Having said the above, the fix that Viresh has proposed along with the nohz_full
>> condition that Frederick added looks to solve this problem.
>
> In any case I believe we want Viresh patch since there are other users
> of deferrable timers that can profit from this.
>
> So I'm queueing it.
Yeah it solves the problem reported.
But on a different note, I was wondering if we could avoid running this
timer on every CPU by having a model similar to the time-keeping CPU.
Like the cpu-frequency tracking CPU which moves dynamically and serves
its purpose at all points in time as long as there are busy CPUs. This
will help avoid IPIs to busy CPUs indicating that they modify their
timers to delay their respective updates of frequency, each time one of
the CPUs does it on their behalf.
Viresh?
But your below point is very true, we will need to see if cpu frequency
can use statistics about cpu load from the scheduler and avoid having to
re-calculate it. Let me take a look at this.
>
>>
>> But just a thought on if there is scope to improve this part of the
>> cpufreq code.
>> What do you all think?
>
> I fear I don't know the problem well enough to display any serious advice.
> It depends what kind of measurement is needed. For example, isn't there some
> loads statistics that are already available from the scheduler that you could reuse?
>
> The scheduler alone takes gazillions of different loads and power statistics taken
> in interesting path such as the tick or sched switches. Aren't there some read-only metrics
> that could be interesting?
>
Thanks
Regards
Preeti U Murthy
Sorry was away for short vacation.
On 28 January 2014 19:20, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Jan 23, 2014 at 07:50:40PM +0530, Viresh Kumar wrote:
>> Wait, I got the wrong code here. That's wasn't my initial intention.
>> I actually wanted to write something like this:
>>
>> - wake_up_nohz_cpu(cpu);
>> + if (!tbase_get_deferrable(timer->base) || idle_cpu(cpu))
>> + wake_up_nohz_cpu(cpu);
>>
>> Will that work?
Something is seriously wrong with me, again wrote rubbish code.
Let me phrase what I wanted to write :)
"don't send IPI to a idle CPU for a deferrable timer."
Probably I code it correctly this time atleast.
- wake_up_nohz_cpu(cpu);
+ if (!(tbase_get_deferrable(timer->base) && idle_cpu(cpu)))
+ wake_up_nohz_cpu(cpu);
> Well, this is going to wake up the target from its idle state, which is
> what we want to avoid if the timer is deferrable, right?
Yeah, sorry for doing it for second time :(
> The simplest thing we want is:
>
> if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu))
> wake_up_nohz_cpu(cpu);
>
> This spares the IPI for the common case where the timer is deferrable and we run
> in periodic or dynticks-idle mode (which should be 99.99% of the existing workloads).
I wasn't looking at this problem with NO_HZ_FULL in mind. As I thought its
only about if the CPU is idle or not. And so the solution I was
talking about was:
"don't send IPI to a idle CPU for a deferrable timer."
But I see that still failing with the code you wrote. For normal cases where we
don't enable NO_HZ_FULL, we will still end up waking up idle CPUs which
is what Lei Wen reported initially.
Also if a CPU is marked for NO_HZ_FULL and is not idle currently then we
wouldn't send a IPI for a deferrable timer. But we actually need that, so that
we can reevaluate the timers order again?
> Then we can later optimize that and spare the IPI on full dynticks CPUs when they run
> idle, but that require some special care about subtle races which can't be dealt
> with a simple test on "idle_cpu(target)". And power consumption in full dynticks
> is already very suboptimized anyway.
>
> So I suggest we start simple with the above test, and a big fat comment which explains
> what we are doing and what needs to be done in the future.
On 29 January 2014 10:57, Preeti Murthy <[email protected]> wrote:
> How about simplifying this design by doing the below?
>
> 1. Since anyway cpufreq governors monitor load on the cpu once every
> 5ms, *tie it with tick_sched_timer*, which also gets deferred when the cpu
> enters nohz_idle.
Its configurable. We can change sampling time to whatever we want. Some
might be selecting it as 30 ms.
> 2. To overcome the problem of running this job of monitoring the load
> on every cpu, have the *time keeping* cpu do it for you.
>
> The time keeping cpu has the property that if it has to go to idle, it will do
> so and let the next cpu that runs the periodic timer become the time keeper.
> Hence no cpu is prevented from entering nohz_idle and the cpu that is busy
> and first executes periodic timer will take over as the time keeper.
>
> The result would be:
>
> 1. One cpu at any point in time will be monitoring cpu load, at every sched tick
> as long as its busy. If it goes to sleep, then it gives up this duty
> and enters idle.
> The next cpu that runs the periodic timer becomes the cpu to monitor the load
> and will continue to do so as long as its busy. Hence we do not miss monitoring
> the cpu load.
>
> 2. This will avoid an additional timer for cpufreq.
>
> 3. It avoids sending IPIs each time this timer gets modified since there is just
> one CPU doing the monitoring.
>
> 4. The downside to this could be that we are stretching the functions of the
> periodic timer into the power management domain which does not seem like
> the right thing to do.
Looks good, but AFAIK timerkeeper is still to be implemented? Also the best
solution is to get rid of this timer completely by getting inputs from
scheduler.
Probably some ARM/Linaro folks are working on it.
On Mon, Feb 03, 2014 at 12:21:16PM +0530, Viresh Kumar wrote:
> Sorry was away for short vacation.
>
> On 28 January 2014 19:20, Frederic Weisbecker <[email protected]> wrote:
> > On Thu, Jan 23, 2014 at 07:50:40PM +0530, Viresh Kumar wrote:
> >> Wait, I got the wrong code here. That's wasn't my initial intention.
> >> I actually wanted to write something like this:
> >>
> >> - wake_up_nohz_cpu(cpu);
> >> + if (!tbase_get_deferrable(timer->base) || idle_cpu(cpu))
> >> + wake_up_nohz_cpu(cpu);
> >>
> >> Will that work?
>
> Something is seriously wrong with me, again wrote rubbish code.
> Let me phrase what I wanted to write :)
>
> "don't send IPI to a idle CPU for a deferrable timer."
>
> Probably I code it correctly this time atleast.
>
> - wake_up_nohz_cpu(cpu);
> + if (!(tbase_get_deferrable(timer->base) && idle_cpu(cpu)))
> + wake_up_nohz_cpu(cpu);
Yeah but that's racy if the target is nohz full. We may be seeing it idle whereas
it woke up lately and run in userspace tickless for a while.
>
> > Well, this is going to wake up the target from its idle state, which is
> > what we want to avoid if the timer is deferrable, right?
>
> Yeah, sorry for doing it for second time :(
I'm certainly not blaming you for being confused, that would be the pot calling the kettle black ;)
>
> > The simplest thing we want is:
> >
> > if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu))
> > wake_up_nohz_cpu(cpu);
> >
> > This spares the IPI for the common case where the timer is deferrable and we run
> > in periodic or dynticks-idle mode (which should be 99.99% of the existing workloads).
>
> I wasn't looking at this problem with NO_HZ_FULL in mind. As I thought its
> only about if the CPU is idle or not. And so the solution I was
> talking about was:
>
> "don't send IPI to a idle CPU for a deferrable timer."
>
> But I see that still failing with the code you wrote. For normal cases where we
> don't enable NO_HZ_FULL, we will still end up waking up idle CPUs which
> is what Lei Wen reported initially.
Not with the small change I proposed above.
I'm applying it.
>
> Also if a CPU is marked for NO_HZ_FULL and is not idle currently then we
> wouldn't send a IPI for a deferrable timer. But we actually need that, so that
> we can reevaluate the timers order again?
Right.
Hi Viresh,
On Thu, Jan 23, 2014 at 11:22:32AM +0530, Viresh Kumar wrote:
>
> Hi Guys,
>
> So the first question is why cpufreq needs it and is it really stupid?
> Yes, it is stupid but that's how its implemented since a long time. It does
> so to get data about the load on CPUs, so that freq can be scaled up/down.
>
> Though there is a solution in discussion currently, which will take
> inputs from scheduler and so these background timers would go away.
> But we need to wait until that time.
>
> Now, why do we need that for every cpu, while that for a single cpu might
> be enough? The answer is cpuidle here: What if the cpu responsible for
> running timer goes to sleep? Who will evaluate the load then? And if we
> make this timer run on one cpu in non-deferrable mode then that cpu
> would be waken up again and again from idle. So, it was decided to have
> a per-cpu deferrable timer. Though to improve efficiency, once it is fired
> on any cpu, timer for all other CPUs are rescheduled, so that they don't
> fire before 5ms (sampling time)..
>
> I think below diff might get this fixed for you, though I am not sure if it
> breaks something else. Probably Thomas/Frederic can answer here.
> If this looks fine I will send it formally again:
>
> diff --git a/kernel/timer.c b/kernel/timer.c
> index accfd24..3a2c7fa 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
> * makes sure that a CPU on the way to stop its tick can not
> * evaluate the timer wheel.
> */
> - wake_up_nohz_cpu(cpu);
> + if (!tbase_get_deferrable(timer->base))
> + wake_up_nohz_cpu(cpu);
The change I'm applying is strongly inspired from the above. Can I use your Signed-off-by?
Thanks.
> spin_unlock_irqrestore(&base->lock, flags);
> }
> EXPORT_SYMBOL_GPL(add_timer_on);
On 12 February 2014 20:36, Frederic Weisbecker <[email protected]> wrote:
> The change I'm applying is strongly inspired from the above. Can I use your Signed-off-by?
Sure :)
Commit-ID: 8ba14654282ed6bb386d0a2f1ab329bfb293403f
Gitweb: http://git.kernel.org/tip/8ba14654282ed6bb386d0a2f1ab329bfb293403f
Author: Viresh Kumar <[email protected]>
AuthorDate: Mon, 10 Feb 2014 17:09:54 +0100
Committer: Frederic Weisbecker <[email protected]>
CommitDate: Fri, 14 Feb 2014 17:59:14 +0100
timer: Spare IPI when deferrable timer is queued on idle remote targets
When a timer is enqueued or modified on a remote target, the latter is
expected to see and handle this timer on its next tick. However if the
target is idle and CONFIG_NO_HZ_IDLE=y, the CPU may be sleeping tickless
and the timer may be ignored.
wake_up_nohz_cpu() takes care of that by setting TIF_NEED_RESCHED and
sending an IPI to idle targets so that the tick is reevaluated on the
idle loop through the tick_nohz_idle_*() APIs.
Now this is all performed regardless of the power properties of the
timer. If the timer is deferrable, idle targets don't need to be woken
up. Only the next buzy tick needs to care about it, and no IPI kick
is needed for that to happen.
So lets spare the IPI on idle targets when the timer is deferrable.
Meanwhile we keep the current behaviour on full dynticks targets. We can
spare IPIs on idle full dynticks targets as well but some tricky races
against idle_cpu() must be dealt all along to make sure that the timer
is well handled after idle exit. We can deal with that later since
NO_HZ_FULL already has more important powersaving issues.
Reported-by: Thomas Gleixner <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/CAKohpomMZ0TAN2e6N76_g4ZRzxd5vZ1XfuZfxrP7GMxfTNiLVw@mail.gmail.com
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/timer.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/timer.c b/kernel/timer.c
index accfd24..b75e789 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -939,8 +939,15 @@ void add_timer_on(struct timer_list *timer, int cpu)
* with the timer by holding the timer base lock. This also
* makes sure that a CPU on the way to stop its tick can not
* evaluate the timer wheel.
+ *
+ * Spare the IPI for deferrable timers on idle targets though.
+ * The next busy ticks will take care of it. Except full dynticks
+ * require special care against races with idle_cpu(), lets deal
+ * with that later.
*/
- wake_up_nohz_cpu(cpu);
+ if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu))
+ wake_up_nohz_cpu(cpu);
+
spin_unlock_irqrestore(&base->lock, flags);
}
EXPORT_SYMBOL_GPL(add_timer_on);