LinuxLists.cc - [RFC] (How to) Let idle CPUs sleep

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Rusty Russell wrote:
> On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
>
>>Two solutions have been proposed so far:
>>
>> A. As per Nick's suggestion, impose a max limit (say some 100 ms or
>> say a second, Nick?) on how long a idle CPU can avoid taking

Yeah probably something around that order of magnitude. I suspect
there will fast be a point where either you'll get other timers
going off more frequently, and / or you simply get very quickly
diminishing returns on the amount of power saving gained from
increasing the period.

>> local-timer ticks. As a result, the load imbalance could exist only
>> for this max duration, after which the sleeping CPU will wake up
>> and balance itself. If there is no imbalance, it can go and sleep
>> again for the max duration.
>>
>> For ex, lets say a idle CPU found that it doesn't have any near timer
>> for the next 1 minute. Instead of letting it sleep for 1 minute in
>> a single stretch, we let it sleep in bursts of 100 msec (or whatever
>> is the max. duration chosen). This still is better than having
>> the idle CPU take HZ ticks a second.
>>
>> As a special case, when all the CPUs of an image go idle, we
>> could consider completely shutting off local timer ticks
>> across all CPUs (till the next non-timer interrupt).
>>
>>
>> B. Don't impose any max limit on how long a idle CPU can sleep.
>> Here we let the idle CPU sleep as long as it wants. It is
>> woken up by a "busy" CPU when it detects an imbalance. The
>> busy CPU acts as a watchdog here. If there are no such
>> busy CPUs, then it means that nobody will acts as watchdogs
>> and idle CPUs sleep as long as they want. A possible watchdog
>> implementation has been discussed at:
>>
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=111287808905764&w=2
>
>
> My preference would be the second: fix the scheduler so it doesn't rely
> on regular polling.

It is not so much a matter of "fixing" the scheduler as just adding
more heuristics. When are we too busy? When should we wake another
CPU? What if that CPU is an SMT sibling? What if it is across the
other side of the topology, and other CPUs closer to it are busy
as well? What if they're busy but not as busy as we are? etc.

We've already got that covered in the existing periodic pull balancing,
so instead of duplicating this logic and moving this extra work to busy
CPUs, we can just use the existing framework.

At least we should try method A first, and if that isn't good enough
(though I suspect it will be), then think about adding more complexity
to the scheduler.

> However, as long as the UP case runs with no timer
> interrupts when idle, many people will be happy (eg. most embedded).
>

Well in the UP case, both A and B should basically degenerate to the
same thing.

Probably the more important case for the scheduler is to be able to
turn off idle SMP hypervisor clients, Srivatsa?

--
SUSE Labs, Novell Inc.

2005-05-08 10:14:40

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Sun, 2005-05-08 at 13:50 +1000, Rusty Russell wrote:
> My preference would be the second: fix the scheduler so it doesn't rely
> on regular polling. However, as long as the UP case runs with no timer
> interrupts when idle, many people will be happy (eg. most embedded).

alternatively; if a CPU is idle a long time we could do a software level
hotunplug on it (after setting it to the lowest possible frequency and
power state), and have some sort of thing that keeps track of "spare but
unplugged" cpus that can plug cpus back in on demand.

That also be nice for all the virtual environments where this could
interact with the hypervisor etc

2005-05-08 12:19:14

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
> Yeah probably something around that order of magnitude. I suspect
> there will fast be a point where either you'll get other timers
> going off more frequently, and / or you simply get very quickly
> diminishing returns on the amount of power saving gained from
> increasing the period.

I am looking at it from the other perspective also i.e, virtualized
env. Any amount of unnecessary timer ticks will lead to equivalent amount
of unnecessary context switches among the guest OSes.

> It is not so much a matter of "fixing" the scheduler as just adding
> more heuristics. When are we too busy? When should we wake another
> CPU? What if that CPU is an SMT sibling? What if it is across the
> other side of the topology, and other CPUs closer to it are busy
> as well? What if they're busy but not as busy as we are? etc.
>
> We've already got that covered in the existing periodic pull balancing,
> so instead of duplicating this logic and moving this extra work to busy
> CPUs, we can just use the existing framework.

I don't think we have to duplicate the logic, just "reuse" whatever logic
exists (in find_busiest_group etc). However I do agree there is movement
of extra work to busy CPUs, but that is only to help the idle CPU sleep longer.
Whether it justifies the additional complexity or not is what this RFC is
about I guess!

FWIW, I have also made some modifications in the original proposal
for reducing the watchdog workload (instead of the same non-idle cpu waking
up all the sleeping CPUs it finds in the same rebalance_tick, the task
is spread over multiple non-idle tasks in different rebalance_ticks).
New (lightly tested) patch is in the mail below.

> At least we should try method A first, and if that isn't good enough
> (though I suspect it will be), then think about adding more complexity
> to the scheduler.

What would be good to measure between the two approaches is the CPU utilization
(over a period of time - say 10 hrs) of somewhat lightly loaded SMP guest OSes
(i.e some CPUs are idle and other CPUs of the same guest are not idle), when
multiple such guest OSes are running simultaneously on the same box. This
means I need a port of VST to UML :(

> Well in the UP case, both A and B should basically degenerate to the
> same thing.

I agree.

> Probably the more important case for the scheduler is to be able to
> turn off idle SMP hypervisor clients, Srivatsa?

True. To make a distinction, these SMP clients can be either completely
idle (all their CPUs idle) or partially idle (only fraction of CPUs idle).
It would be good to cater to both kind of clients.

My latest watchdog implementation is below for reference:

---

linux-2.6.12-rc3-mm2-vatsa/include/linux/sched.h | 1
linux-2.6.12-rc3-mm2-vatsa/kernel/sched.c | 150 ++++++++++++++++++++++-
2 files changed, 146 insertions(+), 5 deletions(-)

diff -puN kernel/sched.c~sched-nohz kernel/sched.c
--- linux-2.6.12-rc3-mm2/kernel/sched.c~sched-nohz 2005-05-04 18:23:30.000000000 +0530
+++ linux-2.6.12-rc3-mm2-vatsa/kernel/sched.c 2005-05-07 22:09:04.000000000 +0530
@@ -1875,6 +1875,25 @@ out:
return pulled;
}

+static inline struct sched_domain *
+sched_domain_ptr(int dst_cpu, int src_cpu, struct sched_domain *src_ptr)
+{
+ struct sched_domain *tmp, *dst_ptr;
+
+ dst_ptr = cpu_rq(dst_cpu)->sd;
+
+ for_each_domain(src_cpu, tmp) {
+ if (tmp == src_ptr || !dst_ptr)
+ break;
+ dst_ptr = dst_ptr->parent;
+ }
+
+ if (tmp == NULL)
+ dst_ptr = NULL;
+
+ return dst_ptr;
+}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the number of tasks which should be
@@ -1882,11 +1901,18 @@ out:
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
- unsigned long *imbalance, enum idle_type idle)
+ unsigned long *imbalance, enum idle_type idle,
+ cpumask_t *wakemaskp)
{
struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
unsigned long max_load, avg_load, total_load, this_load, total_pwr;
int load_idx;
+#ifdef CONFIG_NO_IDLE_HZ
+ int grp_sleeping = 0, woken = 0;
+ cpumask_t tmpmask;
+ struct sched_domain *sd1;
+ unsigned long interval;
+#endif

max_load = this_load = total_load = total_pwr = 0;
if (idle == NOT_IDLE)
@@ -1896,6 +1922,11 @@ find_busiest_group(struct sched_domain *
else
load_idx = sd->idle_idx;

+#ifdef CONFIG_NO_IDLE_HZ
+ if (wakemaskp)
+ cpus_clear(*wakemaskp);
+#endif
+
do {
unsigned long load;
int local_group;
@@ -1906,6 +1937,17 @@ find_busiest_group(struct sched_domain *
/* Tally up the load of all CPUs in the group */
avg_load = 0;

+#ifdef CONFIG_NO_IDLE_HZ
+ grp_sleeping = 0;
+ woken = 0;
+ if (wakemaskp && idle == NOT_IDLE) {
+ /* Are all CPUs in the group sleeping ? */
+ cpus_and(tmpmask, group->cpumask, nohz_cpu_mask);
+ if (cpus_equal(tmpmask, group->cpumask))
+ grp_sleeping = 1;
+ }
+#endif
+
for_each_cpu_mask(i, group->cpumask) {
/* Bias balancing toward cpus of our domain */
if (local_group)
@@ -1914,6 +1956,36 @@ find_busiest_group(struct sched_domain *
load = source_load(i, load_idx);

avg_load += load;
+
+#ifdef CONFIG_NO_IDLE_HZ
+ /* Try to find a CPU that can be woken up from the
+ * sleeping group. After we wake up one CPU, we will let
+ * it wakeup others in its group.
+ */
+ if (!grp_sleeping || woken)
+ continue;
+
+ sd1 = sched_domain_ptr(i, this_cpu, sd);
+
+ if (!sd1 || !sd1->flags & SD_LOAD_BALANCE)
+ continue;
+
+ interval = sd1->balance_interval;
+ /* scale ms to jiffies */
+ interval = msecs_to_jiffies(interval);
+ if (unlikely(!interval))
+ interval = 1;
+
+ if (jiffies - sd1->last_balance >= interval) {
+ /* Lets record this CPU as a possible target
+ * to be woken up. Whether we actually wake it
+ * up or not depends on the CPU's imbalance wrt
+ * others in the domain.
+ */
+ woken = 1;
+ cpu_set(i, *wakemaskp);
+ }
+#endif
}

total_load += avg_load;
@@ -2050,11 +2122,15 @@ static int load_balance(int this_cpu, ru
unsigned long imbalance;
int nr_moved, all_pinned = 0;
int active_balance = 0;
+ cpumask_t wakemask;
+#ifdef CONFIG_NO_IDLE_HZ
+ struct sched_domain *sd1;
+#endif

spin_lock(&this_rq->lock);
schedstat_inc(sd, lb_cnt[idle]);

- group = find_busiest_group(sd, this_cpu, &imbalance, idle);
+ group = find_busiest_group(sd, this_cpu, &imbalance, idle, &wakemask);
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
goto out_balanced;
@@ -2130,9 +2206,11 @@ static int load_balance(int this_cpu, ru
sd->balance_interval *= 2;
}

- return nr_moved;
+ goto out_nohz;

out_balanced:
+ nr_moved = 0;
+
spin_unlock(&this_rq->lock);

schedstat_inc(sd, lb_balanced[idle]);
@@ -2143,7 +2221,36 @@ out_balanced:
(sd->balance_interval < sd->max_interval))
sd->balance_interval *= 2;

- return 0;
+out_nohz:
+#ifdef CONFIG_NO_IDLE_HZ
+ if (!cpus_empty(wakemask)) {
+ int i;
+
+ /* Lets try to wakeup one CPU from the mask. Rest of the cpus
+ * in the mask can be woken up by other CPUs when they do load
+ * balancing in this domain. That way, the overhead of watchdog
+ * functionality is spread across (non-idle) CPUs in the domain.
+ */
+
+ for_each_cpu_mask(i, wakemask) {
+
+ sd1 = sched_domain_ptr(i, this_cpu, sd);
+
+ if (!sd1)
+ continue;
+
+ find_busiest_group(sd1, i, &imbalance, SCHED_IDLE,
+ NULL);
+ if (imbalance > 0) {
+ spin_lock(&cpu_rq(i)->lock);
+ resched_task(cpu_rq(i)->idle);
+ spin_unlock(&cpu_rq(i)->lock);
+ break;
+ }
+ }
+ }
+#endif
+ return nr_moved;
}

/*
@@ -2162,7 +2269,7 @@ static int load_balance_newidle(int this
int nr_moved = 0;

schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
- group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
+ group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, NULL);
if (!group) {
schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
goto out_balanced;
@@ -2323,6 +2430,39 @@ static void rebalance_tick(int this_cpu,
}
}
}
+
+#ifdef CONFIG_NO_IDLE_HZ
+/*
+ * Try hard to pull tasks. Called by idle task before it sleeps cutting off
+ * local timer ticks. This clears the various load counters and tries to pull
+ * tasks.
+ *
+ * Returns 1 if tasks were pulled over, 0 otherwise.
+ */
+int idle_balance_retry(void)
+{
+ int j, moved = 0, this_cpu = smp_processor_id();
+ runqueue_t *this_rq = this_rq();
+ unsigned long flags;
+
+ local_irq_save(flags);
+
+ for (j = 0; j < 3; j++)
+ this_rq->cpu_load[j] = 0;
+
+ rebalance_tick(this_cpu, this_rq, SCHED_IDLE);
+
+ if (this_rq->nr_running) {
+ moved = 1;
+ set_tsk_need_resched(current);
+ }
+
+ local_irq_restore(flags);
+
+ return moved;
+}
+#endif
+
#else
/*
* on UP we do not need to balance between CPUs:
diff -puN include/linux/sched.h~sched-nohz include/linux/sched.h
--- linux-2.6.12-rc3-mm2/include/linux/sched.h~sched-nohz 2005-05-04 18:23:30.000000000 +0530
+++ linux-2.6.12-rc3-mm2-vatsa/include/linux/sched.h 2005-05-04 18:23:37.000000000 +0530
@@ -897,6 +897,7 @@ extern int task_curr(const task_t *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
extern task_t *idle_task(int cpu);
+extern int idle_balance_retry(void);

void yield(void);

_

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-08 13:31:04

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Srivatsa Vaddagiri <[email protected]> writes:

> Hello,
> I need some inputs from the community (specifically from virtual
> machine and embedded/power-management folks) on something that I am working on.

I think the best way is to let other CPUs handle the load balancing
for idle CPUs. Basically when a CPU goes fully idle then you mark
this in some global data structure, and CPUs doing load balancing
after doing their own thing look for others that need to be balanced
too and handle them too. When no CPU is left non idle then nothing needs
to be load balanced anyways. When a idle CPU gets a task it just gets
an reschedule IPI as usual, that wakes it up.

I call this the "scoreboard".

The trick is to evenly load balance the work over the remaining CPUs.
Something simple like never doing work for more than 1/idlecpus is
probably enough. In theory one could even use machine NUMA topology
information for this, but that would be probably overkill for the
first implementation.

With the scoreboard implementation CPus could be virtually idle
forever, which I think is best for virtualization.

BTW we need a very similar thing for RCU too.

-Andi

2005-05-08 13:33:41

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Arjan van de Ven <[email protected]> writes:

> On Sun, 2005-05-08 at 13:50 +1000, Rusty Russell wrote:
>> My preference would be the second: fix the scheduler so it doesn't rely
>> on regular polling. However, as long as the UP case runs with no timer
>> interrupts when idle, many people will be happy (eg. most embedded).
>
> alternatively; if a CPU is idle a long time we could do a software level
> hotunplug on it (after setting it to the lowest possible frequency and
> power state), and have some sort of thing that keeps track of "spare but
> unplugged" cpus that can plug cpus back in on demand.

We need to do this anyways for RCU, because fully idle CPUs don't go
through quiescent states and could stall the whole RCU system.

But it has to be *really* lightweight because these transistion can
happen a lot (consider a CPU that very often goes to sleep for a short time)
>
> That also be nice for all the virtual environments where this could
> interact with the hypervisor etc

I am not sure how useful it is to make this heavy weight by involving
more subsystems. I would try to keep the idle state as lightweight
as possible, to keep the cost of going to sleep/waking up low.

-Andi

2005-05-08 13:44:40

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

> But it has to be *really* lightweight because these transistion can
> happen a lot (consider a CPU that very often goes to sleep for a short time)

lightweight is good of course. But even if it's medium weight.. it just
means you need to be REALLY idle (eg for longer time) for it to trigger.
I guess we need some sort of per arch "idleness threshhold" for this.

2005-05-08 14:54:04

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Sun, May 08, 2005 at 03:44:14PM +0200, Arjan van de Ven wrote:
>
> > But it has to be *really* lightweight because these transistion can
> > happen a lot (consider a CPU that very often goes to sleep for a short time)
>
> lightweight is good of course. But even if it's medium weight.. it just
> means you need to be REALLY idle (eg for longer time) for it to trigger.
> I guess we need some sort of per arch "idleness threshhold" for this.

The question is how useful it is for the hypervisor to even know that.
Why can't it just detect long idle periods by itself if it really wants?

-Andi

2005-05-08 15:25:52

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Sun, May 08, 2005 at 03:31:00PM +0200, Andi Kleen wrote:
> I think the best way is to let other CPUs handle the load balancing
> for idle CPUs. Basically when a CPU goes fully idle then you mark
> this in some global data structure,

nohz_cpu_mask already exists for this purpose.

> and CPUs doing load balancing after doing their own thing look for others
> that need to be balanced too and handle them too.

This is precisely what I had proposed in my watchdog implementation.

> When no CPU is left non idle then nothing needs to be load balanced anyways.
> When a idle CPU gets a task it just gets an reschedule IPI as usual, that
> wakes it up.

True.

>
> I call this the "scoreboard".
>
> The trick is to evenly load balance the work over the remaining CPUs.
> Something simple like never doing work for more than 1/idlecpus is
> probably enough.

Well, there is this imbalance_pct which acts as a trigger threshold before
which load balance won't happen. I do take this into account before
waking up the sleeping idle cpu (the same imbalance_pct logic would
have been followed by the idle CPU if it were to continue taking timer
ticks).

So I guess your 1/idlecpus and the imbalance_pct may act on parallel lines.

> In theory one could even use machine NUMA topology
> information for this, but that would be probably overkill for the
> first implementation.
>
> With the scoreboard implementation CPus could be virtually idle
> forever, which I think is best for virtualization.
>
> BTW we need a very similar thing for RCU too.

RCU is taken care of already, except it is broken. There is a small
race which is not fixed. Following patch (which I wrote aainst 2.6.10 kernel
maybe) should fix that race. I intend to post this patch after test
agaist more recent kernel.

--- kernel/rcupdate.c.org 2005-02-11 11:38:47.000000000 +0530
+++ kernel/rcupdate.c 2005-02-11 11:44:08.000000000 +0530
@@ -199,8 +199,11 @@ static void rcu_start_batch(struct rcu_c
*/
static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp, struct rcu_state *rsp)
{
+ cpumask_t tmpmask;
+
cpu_clear(cpu, rsp->cpumask);
- if (cpus_empty(rsp->cpumask)) {
+ cpus_andnot(tmpmask, rsp->cpumask, nohz_cpu_mask);
+ if (cpus_empty(tmpmask)) {
/* batch completed ! */
rcp->completed = rcp->cur;
rcu_start_batch(rcp, rsp, 0);

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-09 06:28:07

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Srivatsa Vaddagiri wrote:
> On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
>
>>Yeah probably something around that order of magnitude. I suspect
>>there will fast be a point where either you'll get other timers
>>going off more frequently, and / or you simply get very quickly
>>diminishing returns on the amount of power saving gained from
>>increasing the period.
>
>
> I am looking at it from the other perspective also i.e, virtualized
> env. Any amount of unnecessary timer ticks will lead to equivalent amount
> of unnecessary context switches among the guest OSes.
>

Yep.

>
>>It is not so much a matter of "fixing" the scheduler as just adding
>>more heuristics. When are we too busy? When should we wake another
>>CPU? What if that CPU is an SMT sibling? What if it is across the
>>other side of the topology, and other CPUs closer to it are busy
>>as well? What if they're busy but not as busy as we are? etc.
>>
>>We've already got that covered in the existing periodic pull balancing,
>>so instead of duplicating this logic and moving this extra work to busy
>>CPUs, we can just use the existing framework.
>
>
> I don't think we have to duplicate the logic, just "reuse" whatever logic
> exists (in find_busiest_group etc). However I do agree there is movement

OK, that may possibly be an option... however:

> of extra work to busy CPUs, but that is only to help the idle CPU sleep longer.
> Whether it justifies the additional complexity or not is what this RFC is
> about I guess!
>

Yeah, this is a bit worrying. In general we should not be loading
up busy CPUs with any more work, and sleeping idle CPUs should be
done as a blunt "slowpath" operation. Ie. something that works well
enough.

> FWIW, I have also made some modifications in the original proposal
> for reducing the watchdog workload (instead of the same non-idle cpu waking
> up all the sleeping CPUs it finds in the same rebalance_tick, the task
> is spread over multiple non-idle tasks in different rebalance_ticks).
> New (lightly tested) patch is in the mail below.
>

Mmyeah, I'm not a big fan :)

I could probably find some time to do my implementation if you have
a complete working patch for eg. UML.

>
>
>>At least we should try method A first, and if that isn't good enough
>>(though I suspect it will be), then think about adding more complexity
>>to the scheduler.
>
>
> What would be good to measure between the two approaches is the CPU utilization
> (over a period of time - say 10 hrs) of somewhat lightly loaded SMP guest OSes
> (i.e some CPUs are idle and other CPUs of the same guest are not idle), when
> multiple such guest OSes are running simultaneously on the same box. This
> means I need a port of VST to UML :(
>

Yeah that would be good.

--
SUSE Labs, Novell Inc.

2005-05-11 18:05:35

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

* Nick Piggin <[email protected]> [050507 21:15]:
> Rusty Russell wrote:
> >On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
> >
> >>Two solutions have been proposed so far:
> >>
> >> A. As per Nick's suggestion, impose a max limit (say some 100 ms or
> >> say a second, Nick?) on how long a idle CPU can avoid taking
>
> Yeah probably something around that order of magnitude. I suspect
> there will fast be a point where either you'll get other timers
> going off more frequently, and / or you simply get very quickly
> diminishing returns on the amount of power saving gained from
> increasing the period.
>
> >> local-timer ticks. As a result, the load imbalance could exist
> >> only
> >> for this max duration, after which the sleeping CPU will wake up
> >> and balance itself. If there is no imbalance, it can go and sleep
> >> again for the max duration.
> >>
> >> For ex, lets say a idle CPU found that it doesn't have any near
> >> timer
> >> for the next 1 minute. Instead of letting it sleep for 1 minute in
> >> a single stretch, we let it sleep in bursts of 100 msec (or
> >> whatever
> >> is the max. duration chosen). This still is better than having
> >> the idle CPU take HZ ticks a second.
> >>
> >> As a special case, when all the CPUs of an image go idle, we
> >> could consider completely shutting off local timer ticks
> >> across all CPUs (till the next non-timer interrupt).
> >>
> >>
> >> B. Don't impose any max limit on how long a idle CPU can sleep.
> >> Here we let the idle CPU sleep as long as it wants. It is
> >> woken up by a "busy" CPU when it detects an imbalance. The
> >> busy CPU acts as a watchdog here. If there are no such
> >> busy CPUs, then it means that nobody will acts as watchdogs
> >> and idle CPUs sleep as long as they want. A possible watchdog
> >> implementation has been discussed at:
> >>
> >> http://marc.theaimsgroup.com/?l=linux-kernel&m=111287808905764&w=2
> >
> >
> >My preference would be the second: fix the scheduler so it doesn't rely
> >on regular polling.
>
> It is not so much a matter of "fixing" the scheduler as just adding
> more heuristics. When are we too busy? When should we wake another
> CPU? What if that CPU is an SMT sibling? What if it is across the
> other side of the topology, and other CPUs closer to it are busy
> as well? What if they're busy but not as busy as we are? etc.
>
> We've already got that covered in the existing periodic pull balancing,
> so instead of duplicating this logic and moving this extra work to busy
> CPUs, we can just use the existing framework.
>
> At least we should try method A first, and if that isn't good enough
> (though I suspect it will be), then think about adding more complexity
> to the scheduler.
>
> > However, as long as the UP case runs with no timer
> >interrupts when idle, many people will be happy (eg. most embedded).
> >
>
> Well in the UP case, both A and B should basically degenerate to the
> same thing.
>
> Probably the more important case for the scheduler is to be able to
> turn off idle SMP hypervisor clients, Srivatsa?

Sorry to jump in late. For embedded stuff we should be able to skip
ticks until something _really_ happens, like an interrupt.

So we need to be able to skip ticks several seconds at a time. Ticks
should be event driven. For embedded systems option B is really
the only way to go to take advantage of the power savings.

Of course the situation is different on servers, where the goal is
to save ticks to be able to run more virtual machines. And then
cutting down the ticks down to few per second does the trick.

Regards,

Tony

2005-05-12 08:39:11

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Mon, May 09, 2005 at 04:27:26PM +1000, Nick Piggin wrote:
> I could probably find some time to do my implementation if you have
> a complete working patch for eg. UML.

Well, turns out that if we restrict the amount of time idle cpus are
allowed to sleep, then there is very little change reqd in the scheduler.
Most of the calculation of exponential sleep times can be done outside
it (in the idle CPU's code).

First, the scheduler support to zero cpu_load[] counters before idle
cpu sleeps.

---

linux-2.6.12-rc3-mm3-vatsa/include/linux/sched.h | 1
linux-2.6.12-rc3-mm3-vatsa/kernel/sched.c | 33 +++++++++++++++++++++++
2 files changed, 34 insertions(+)

diff -puN kernel/sched.c~sched-nohz kernel/sched.c
--- linux-2.6.12-rc3-mm3/kernel/sched.c~sched-nohz 2005-05-11 17:05:13.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/kernel/sched.c 2005-05-11 17:06:38.000000000 +0530
@@ -2323,6 +2323,39 @@ static void rebalance_tick(int this_cpu,
}
}
}
+
+#ifdef CONFIG_NO_IDLE_HZ
+/*
+ * Try hard to pull tasks. Called by idle task before it sleeps cutting off
+ * local timer ticks. This clears the various load counters and tries to pull
+ * tasks.
+ *
+ * Returns 1 if tasks were pulled over, 0 otherwise.
+ */
+int idle_balance_retry(void)
+{
+ int j, moved = 0, this_cpu = smp_processor_id();
+ runqueue_t *this_rq = this_rq();
+ unsigned long flags;
+
+ local_irq_save(flags);
+
+ for (j = 0; j < 3; j++)
+ this_rq->cpu_load[j] = 0;
+
+ rebalance_tick(this_cpu, this_rq, SCHED_IDLE);
+
+ if (this_rq->nr_running) {
+ moved = 1;
+ set_tsk_need_resched(current);
+ }
+
+ local_irq_restore(flags);
+
+ return moved;
+}
+#endif
+
#else
/*
* on UP we do not need to balance between CPUs:
diff -puN include/linux/sched.h~sched-nohz include/linux/sched.h
--- linux-2.6.12-rc3-mm3/include/linux/sched.h~sched-nohz 2005-05-11 17:05:13.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/include/linux/sched.h 2005-05-11 17:13:19.000000000 +0530
@@ -897,6 +897,7 @@ extern int task_curr(const task_t *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
extern task_t *idle_task(int cpu);
+extern int idle_balance_retry(void);

void yield(void);

_

A sample patch that implements exponential sleep time is below. Note that this
patch only makes idle cpu pretend as if it is asleep (instead of really cutting
of timer ticks). I used this merely to test the scheduler change.

Martin,
You probably need something like this for S390 arch!

---

linux-2.6.12-rc3-mm3-vatsa/arch/i386/Kconfig | 4 +
linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/apic.c | 16 ++++--
linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/irq.c | 4 +
linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/process.c | 47 ++++++++++++++++--
linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/smp.c | 6 ++
5 files changed, 69 insertions(+), 8 deletions(-)

diff -puN arch/i386/Kconfig~vst-sim arch/i386/Kconfig
--- linux-2.6.12-rc3-mm3/arch/i386/Kconfig~vst-sim 2005-05-10 15:53:33.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/Kconfig 2005-05-10 15:54:22.000000000 +0530
@@ -443,6 +443,10 @@ config X86_OOSTORE
depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR
default y

+config NO_IDLE_HZ
+ bool "Tickless Idle CPUs support"
+ default n
+
config HPET_TIMER
bool "HPET Timer Support"
help
diff -puN arch/i386/kernel/process.c~vst-sim arch/i386/kernel/process.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/process.c~vst-sim 2005-05-10 15:53:34.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/process.c 2005-05-12 14:06:16.000000000 +0530
@@ -94,6 +94,12 @@ void enable_hlt(void)

EXPORT_SYMBOL(enable_hlt);

+DEFINE_PER_CPU(int, idle_asleep);
+DEFINE_PER_CPU(unsigned long, sleep_duration);
+
+#define MAX_SLEEP_DURATION 128 /* in tick counts */
+#define MIN_SLEEP_DURATION 8 /* in tick counts */
+
/*
* We use this if we don't have any better
* idle routine..
@@ -102,8 +108,36 @@ void default_idle(void)
{
if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
local_irq_disable();
- if (!need_resched())
- safe_halt();
+ if (!need_resched()) {
+ unsigned long jif_next, jif_delta;
+
+ jif_next = next_timer_interrupt();
+ jif_delta = jif_next - jiffies;
+
+ if (jif_delta > MIN_SLEEP_DURATION) {
+ unsigned long slpint;
+
+ if (idle_balance_retry()) {
+ local_irq_enable();
+ return;
+ }
+
+ slpint = min(__get_cpu_var(sleep_duration),
+ jif_delta);
+
+ jif_next = jiffies + slpint;
+ /* Hack to discard local timer ticks */
+ __get_cpu_var(idle_asleep) = 1;
+ cpu_set(smp_processor_id(), nohz_cpu_mask);
+ local_irq_enable();
+ while ((jiffies < jif_next-1) &&
+ __get_cpu_var(idle_asleep))
+ cpu_relax();
+ __get_cpu_var(idle_asleep) = 0;
+ cpu_clear(smp_processor_id(), nohz_cpu_mask);
+ } else
+ safe_halt();
+ }
else
local_irq_enable();
} else {
@@ -178,6 +212,8 @@ void cpu_idle(void)
{
int cpu = _smp_processor_id();

+ __get_cpu_var(sleep_duration) = MIN_SLEEP_DURATION;
+
/* endless idle loop with no priority at all */
while (1) {
while (!need_resched()) {
@@ -189,7 +225,7 @@ void cpu_idle(void)
rmb();
idle = pm_idle;

- if (!idle)
+ //if (!idle)
idle = default_idle;

if (cpu_is_offline(cpu))
@@ -197,7 +233,12 @@ void cpu_idle(void)

__get_cpu_var(irq_stat).idle_timestamp = jiffies;
idle();
+
+ if (__get_cpu_var(sleep_duration) < MAX_SLEEP_DURATION)
+ __get_cpu_var(sleep_duration) *= 2;
+
}
+ __get_cpu_var(sleep_duration) = MIN_SLEEP_DURATION;
schedule();
}
}
diff -puN arch/i386/kernel/irq.c~vst-sim arch/i386/kernel/irq.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/irq.c~vst-sim 2005-05-10 15:53:34.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/irq.c 2005-05-10 15:53:47.000000000 +0530
@@ -46,6 +46,8 @@ static union irq_ctx *hardirq_ctx[NR_CPU
static union irq_ctx *softirq_ctx[NR_CPUS];
#endif

+DECLARE_PER_CPU(int, idle_asleep);
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -60,6 +62,8 @@ fastcall unsigned int do_IRQ(struct pt_r
u32 *isp;
#endif

+ __get_cpu_var(idle_asleep) = 0;
+
irq_enter();
#ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
diff -puN arch/i386/kernel/smp.c~vst-sim arch/i386/kernel/smp.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/smp.c~vst-sim 2005-05-11 16:59:38.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/smp.c 2005-05-11 16:59:58.000000000 +0530
@@ -309,6 +309,8 @@ static inline void leave_mm (unsigned lo
* 2) Leave the mm if we are in the lazy tlb mode.
*/

+DECLARE_PER_CPU(int, idle_asleep);
+
fastcall void smp_invalidate_interrupt(struct pt_regs *regs)
{
unsigned long cpu;
@@ -336,6 +338,7 @@ fastcall void smp_invalidate_interrupt(s
leave_mm(cpu);
}
ack_APIC_irq();
+ __get_cpu_var(idle_asleep) = 0;
smp_mb__before_clear_bit();
cpu_clear(cpu, flush_cpumask);
smp_mb__after_clear_bit();
@@ -598,6 +601,8 @@ void smp_send_stop(void)
fastcall void smp_reschedule_interrupt(struct pt_regs *regs)
{
ack_APIC_irq();
+
+ __get_cpu_var(idle_asleep) = 0;
}

fastcall void smp_call_function_interrupt(struct pt_regs *regs)
@@ -607,6 +612,7 @@ fastcall void smp_call_function_interrup
int wait = call_data->wait;

ack_APIC_irq();
+ __get_cpu_var(idle_asleep) = 0;
/*
* Notify initiating CPU that I've grabbed the data and am
* about to execute the function
diff -puN arch/i386/kernel/apic.c~vst-sim arch/i386/kernel/apic.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/apic.c~vst-sim 2005-05-10 15:53:36.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/apic.c 2005-05-10 15:53:47.000000000 +0530
@@ -1171,6 +1171,8 @@ inline void smp_local_timer_interrupt(st
*/
}

+DECLARE_PER_CPU(int, idle_asleep);
+
/*
* Local APIC timer interrupt. This is the most natural way for doing
* local interrupts, but local timer interrupts can be emulated by
@@ -1185,15 +1187,19 @@ fastcall void smp_apic_timer_interrupt(s
int cpu = smp_processor_id();

/*
- * the NMI deadlock-detector uses this.
- */
- per_cpu(irq_stat, cpu).apic_timer_irqs++;
-
- /*
* NOTE! We'd better ACK the irq immediately,
* because timer handling can be slow.
*/
ack_APIC_irq();
+
+ if (__get_cpu_var(idle_asleep))
+ return;
+
+ /*
+ * the NMI deadlock-detector uses this.
+ */
+ per_cpu(irq_stat, cpu).apic_timer_irqs++;
+
/*
* update_process_times() expects us to have done irq_enter().
* Besides, if we don't timer interrupts ignore the global

_
--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-12 08:54:48

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> Sorry to jump in late. For embedded stuff we should be able to skip
> ticks until something _really_ happens, like an interrupt.
>
> So we need to be able to skip ticks several seconds at a time. Ticks
> should be event driven. For embedded systems option B is really
> the only way to go to take advantage of the power savings.

I don't know how sensitive embedded platforms are to load imbalance.
If they are not sensitive, then we could let the max time idle
cpus are allowed to sleep to be few seconds. That way, idle CPU
wakes up once in 3 or 4 seconds to check for imbalance and still
be able to save power for those 3/4 seconds that it sleeps.

I guess it is a tradeoff here between the complexity we want to
put and the real benefit we get. Its hard for me to get the numbers
(since I don't have easy access to the right tools to measure them)
to show how much the real benefit is :(

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-12 16:01:24

by Lee Revell

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thu, 2005-05-12 at 14:16 +0530, Srivatsa Vaddagiri wrote:
> On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> > Sorry to jump in late. For embedded stuff we should be able to skip
> > ticks until something _really_ happens, like an interrupt.
> >
> > So we need to be able to skip ticks several seconds at a time. Ticks
> > should be event driven. For embedded systems option B is really
> > the only way to go to take advantage of the power savings.
>
> I don't know how sensitive embedded platforms are to load imbalance.
> If they are not sensitive, then we could let the max time idle
> cpus are allowed to sleep to be few seconds. That way, idle CPU
> wakes up once in 3 or 4 seconds to check for imbalance and still
> be able to save power for those 3/4 seconds that it sleeps.

Not very. Embedded systems are usually UP so don't care at all. If an
embedded system is SMP often it's because one CPU is dedicated to RT
tasks, and this model will become less common as RT preemption allows
you to do everything on a single processor.

Lee

2005-05-12 16:17:07

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

* Lee Revell <[email protected]> [050512 09:05]:
> On Thu, 2005-05-12 at 14:16 +0530, Srivatsa Vaddagiri wrote:
> > On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> > > Sorry to jump in late. For embedded stuff we should be able to skip
> > > ticks until something _really_ happens, like an interrupt.
> > >
> > > So we need to be able to skip ticks several seconds at a time. Ticks
> > > should be event driven. For embedded systems option B is really
> > > the only way to go to take advantage of the power savings.
> >
> > I don't know how sensitive embedded platforms are to load imbalance.
> > If they are not sensitive, then we could let the max time idle
> > cpus are allowed to sleep to be few seconds. That way, idle CPU
> > wakes up once in 3 or 4 seconds to check for imbalance and still
> > be able to save power for those 3/4 seconds that it sleeps.
>
> Not very. Embedded systems are usually UP so don't care at all. If an
> embedded system is SMP often it's because one CPU is dedicated to RT
> tasks, and this model will become less common as RT preemption allows
> you to do everything on a single processor.

Yes UP mostly & sounds like this only affects MP systems.

Although there is ARM MP support patches available.
I guess embedded MP systems may be used for multimedia
stuff eventually.

Tony

2005-05-12 16:31:09

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thursday, May 12, 2005 9:16 am, Tony Lindgren wrote:
> * Lee Revell <[email protected]> [050512 09:05]:
> > On Thu, 2005-05-12 at 14:16 +0530, Srivatsa Vaddagiri wrote:
> > > On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> > > > Sorry to jump in late. For embedded stuff we should be able to
> > > > skip ticks until something _really_ happens, like an interrupt.
> > > >
> > > > So we need to be able to skip ticks several seconds at a time.
> > > > Ticks should be event driven. For embedded systems option B is
> > > > really the only way to go to take advantage of the power
> > > > savings.

That seems like a lot of added complexity. Isn't it possible to go
totally tickless and actually *remove* some of the complexity of the
current design? I know Linus has frowned upon the idea in the past,
but I've had to deal with the tick code a bit in the past, and it seems
like getting rid of ticks entirely might actually simplify things (both
conceptually and code-wise).

Seems like we could schedule timer interrupts based solely on add_timer
type stuff; the scheduler could use it if necessary for load balancing
(along with fork/exec based balancing perhaps) on large machines where
load imbalances hurt throughput a lot. But on small systems if all
your processes were blocked, you'd just go to sleep indefinitely and
save a bit of power and avoid unnecessary overhead.

I haven't looked at the lastest tickless patches, so I'm not sure if my
claims of simplicity are overblown, but especially as multiprocessor
systems become more and more common it just seems wasteful to wakeup
all the CPUs every so often only to have them find that they have
nothing to do.

Jesse

2005-05-12 17:12:12

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thu, May 12, 2005 at 09:28:55AM -0700, Jesse Barnes wrote:
> Seems like we could schedule timer interrupts based solely on add_timer
> type stuff; the scheduler could use it if necessary for load balancing
> (along with fork/exec based balancing perhaps) on large machines where
> load imbalances hurt throughput a lot. But on small systems if all

Even if we were to go for this tickless design, the fundamental question
remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some other
(busy) CPU wake it up (which makes the implementation complex) or the idle CPU
checks imbalance itself at periodic intervals (which restricts the amount of
time a idle CPU may sleep).

> your processes were blocked, you'd just go to sleep indefinitely and
> save a bit of power and avoid unnecessary overhead.
>
> I haven't looked at the lastest tickless patches, so I'm not sure if my
> claims of simplicity are overblown, but especially as multiprocessor
> systems become more and more common it just seems wasteful to wakeup
> all the CPUs every so often only to have them find that they have
> nothing to do.

I guess George's experience in implementing tickless systems is that
it is more of an overhead for a general purpose OS like Linux. George?

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-12 18:00:19

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Srivatsa Vaddagiri wrote:

>Even if we were to go for this tickless design, the fundamental question
>remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some other
>(busy) CPU wake it up (which makes the implementation complex) or the idle CPU
>checks imbalance itself at periodic intervals (which restricts the amount of
>time a idle CPU may sleep).
>
>
Waking it up at fork or exec time might be doable, and having a busy CPU
wake up other CPUs wouldn't add too much complexity, would it?

>I guess George's experience in implementing tickless systems is that
>it is more of an overhead for a general purpose OS like Linux. George?
>
>
The latest patches seem to do tick skipping rather than wholesale
ticklessness. Admittedly, the latter is a more invasive change, but one
that may end up being simpler in the long run. But maybe George did a
design like that in the past and rejected it?

Jesse

2005-05-12 18:10:12

by Martin Schwidefsky

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

> > Seems like we could schedule timer interrupts based solely on add_timer

> > type stuff; the scheduler could use it if necessary for load balancing
> > (along with fork/exec based balancing perhaps) on large machines where
> > load imbalances hurt throughput a lot. But on small systems if all
>
> Even if we were to go for this tickless design, the fundamental question
> remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some
other
> (busy) CPU wake it up (which makes the implementation complex) or the
idle CPU
> checks imbalance itself at periodic intervals (which restricts the amount
of
> time a idle CPU may sleep).

I would prefer a solution where the busy CPU wakes up an idle CPU if the
imbalance is too large. Any scheme that requires an idle CPU to poll at
some intervals will have one of two problem: either the poll intervall
is large then the imbalance will stay around for a long time, or the
poll intervall is small then this will behave badly in a heavily
virtualized environment with many images.

> > your processes were blocked, you'd just go to sleep indefinitely and
> > save a bit of power and avoid unnecessary overhead.
> >
> > I haven't looked at the lastest tickless patches, so I'm not sure if my
> > claims of simplicity are overblown, but especially as multiprocessor
> > systems become more and more common it just seems wasteful to wakeup
> > all the CPUs every so often only to have them find that they have
> > nothing to do.
>
> I guess George's experience in implementing tickless systems is that
> it is more of an overhead for a general purpose OS like Linux. George?

The original implementation of a tickless system introduced an overhead
in the system call path. The recent solution is tickless while in idle
that does not have that overhead any longer. The second source of
overhead is the need to reprogram the timer interrupt source. That
can be expensive (see i386) or cheap (see s390). So it depends, as usual.
In our experience on s390 tickless systems is a huge win.

blue skies,
Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

2005-05-12 18:17:04

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

* Jesse Barnes <[email protected]> [050512 11:01]:
> Srivatsa Vaddagiri wrote:
>
> >Even if we were to go for this tickless design, the fundamental question
> >remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some
> >other
> >(busy) CPU wake it up (which makes the implementation complex) or the idle
> >CPU checks imbalance itself at periodic intervals (which restricts the
> >amount of
> >time a idle CPU may sleep).
> >
> >
> Waking it up at fork or exec time might be doable, and having a busy CPU
> wake up other CPUs wouldn't add too much complexity, would it?

At least then it would be event driven rather than polling approach.

Tony

2005-05-12 18:21:58

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

* Martin Schwidefsky <[email protected]> [050512 11:15]:
> > > Seems like we could schedule timer interrupts based solely on add_timer
>
> > > type stuff; the scheduler could use it if necessary for load balancing
> > > (along with fork/exec based balancing perhaps) on large machines where
> > > load imbalances hurt throughput a lot. But on small systems if all
> >
> > Even if we were to go for this tickless design, the fundamental question
> > remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some
> other
> > (busy) CPU wake it up (which makes the implementation complex) or the
> idle CPU
> > checks imbalance itself at periodic intervals (which restricts the amount
> of
> > time a idle CPU may sleep).
>
> I would prefer a solution where the busy CPU wakes up an idle CPU if the
> imbalance is too large. Any scheme that requires an idle CPU to poll at
> some intervals will have one of two problem: either the poll intervall
> is large then the imbalance will stay around for a long time, or the
> poll intervall is small then this will behave badly in a heavily
> virtualized environment with many images.
>
> > > your processes were blocked, you'd just go to sleep indefinitely and
> > > save a bit of power and avoid unnecessary overhead.
> > >
> > > I haven't looked at the lastest tickless patches, so I'm not sure if my
> > > claims of simplicity are overblown, but especially as multiprocessor
> > > systems become more and more common it just seems wasteful to wakeup
> > > all the CPUs every so often only to have them find that they have
> > > nothing to do.
> >
> > I guess George's experience in implementing tickless systems is that
> > it is more of an overhead for a general purpose OS like Linux. George?
>
> The original implementation of a tickless system introduced an overhead
> in the system call path. The recent solution is tickless while in idle
> that does not have that overhead any longer. The second source of
> overhead is the need to reprogram the timer interrupt source. That
> can be expensive (see i386) or cheap (see s390). So it depends, as usual.
> In our experience on s390 tickless systems is a huge win.

In the dyn-tick patch for ARM and x86, timer reprogramming is only
done during idle. When the system is busy, there is continuous tick
and timer reprogramming is skipped. So the second source of overhead
should be minimal.

Tony

2005-05-12 21:16:45

by George Anzinger

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Srivatsa Vaddagiri wrote:
> On Thu, May 12, 2005 at 09:28:55AM -0700, Jesse Barnes wrote:
>
>>Seems like we could schedule timer interrupts based solely on add_timer
>>type stuff; the scheduler could use it if necessary for load balancing
>>(along with fork/exec based balancing perhaps) on large machines where
>>load imbalances hurt throughput a lot. But on small systems if all
>
>
> Even if we were to go for this tickless design, the fundamental question
> remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some other
> (busy) CPU wake it up (which makes the implementation complex) or the idle CPU
> checks imbalance itself at periodic intervals (which restricts the amount of
> time a idle CPU may sleep).
>
>
>>your processes were blocked, you'd just go to sleep indefinitely and
>>save a bit of power and avoid unnecessary overhead.
>>
>>I haven't looked at the latest tickless patches, so I'm not sure if my
>>claims of simplicity are overblown, but especially as multiprocessor
>>systems become more and more common it just seems wasteful to wakeup
>>all the CPUs every so often only to have them find that they have
>>nothing to do.
>
>
> I guess George's experience in implementing tickless systems is that
> it is more of an overhead for a general purpose OS like Linux. George?

The tickless system differs from VST in that it is designed to only "tick" when
there is something in the time list to do and it does this ALWAYS. The VST
notion is that ticks are not needed if the cpu is idle. This is VASTLY simpler
to do than a tickless system because, mostly, the accounting requirements. When
a VST cpu is not ticking, the full sleep time is charged to the idle task and
the system does not need to time slice or do any time driven user profiling or
execution limiting.

And this is exactly where the tickless system runs into overload. Simply
speaking, each task has with it certain limits and requirements WRT time. It is
almost always time sliced, but it may also have execution limits and settimer
execution time interrupts that it wants. These need to be set up for each task
when it is switched to and removed when the system switches away from it. In
the test I did, I reduced all these timers to one (I think I just used the slice
time, but this is not the thing to do if the user is trying to do profiling. In
any case, only one timer needs to be set up, possibly some work needs to be done
to find the min. of all the execution time timers it has, but only that one
needs to go in the time list.). BUT it needs to happen each context switch time
and thus adds overhead to the switch time. In my testing, the overhead for this
became higher than the ticked system overhead for the same services at a
relatively low context switch rate. From a systems point of view, you just
don't want to increase overhead when the load goes up. This leads to fragile
systems.
>
Now, the question still remains, if a cpu in an SMP system is sleeping because
of VST, who or how is it to be wakened to responded to increases in the system
load. If all CPUs are sleeping, there is some event (i.e. interrupt) that does
this. I think, in an SMP system, any awake CPUs should, during their load
balancing, notice that there are sleeping CPUs and wake them as the load increases.

>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-05-12 21:37:41

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thursday, May 12, 2005 2:16 pm, George Anzinger wrote:
> The tickless system differs from VST in that it is designed to only
> "tick" when there is something in the time list to do and it does
> this ALWAYS.

Right, that's what I gathered (and what I was getting at).

> And this is exactly where the tickless system runs into overload.
> Simply speaking, each task has with it certain limits and
> requirements WRT time. It is almost always time sliced, but it may
> also have execution limits and settimer execution time interrupts
> that it wants.

Right...

> These need to be set up for each task when it is
> switched to and removed when the system switches away from it.

Why do they need to be switched when the task's slice is up? Can't they
remain in the timer list? I guess I'm imagining a global outstanding
timer list that's manipulated by add/remove_timer, settimer, etc..
When a timeout occurs the timer interrupt handler could mark tasks
runnable again, bump their priority, send SIGALRM, or whatever. Most
timers are deleted before they expire anyway, right? If that's true
you definitely don't want to save and restore them on every context
switch...

> In
> the test I did, I reduced all these timers to one (I think I just
> used the slice time, but this is not the thing to do if the user is
> trying to do profiling. In any case, only one timer needs to be set
> up, possibly some work needs to be done to find the min. of all the
> execution time timers it has, but only that one needs to go in the
> time list.).

Or at least only the nearest one has to be programmed as the next
interrupt, and before going back to sleep the kernel could check if any
timers had expired while the last one was running (aren't missing
jiffies accounted for this way too, but of course jiffies would go away
in this scheme)?

> BUT it needs to happen each context switch time and
> thus adds overhead to the switch time. In my testing, the overhead
> for this became higher than the ticked system overhead for the same
> services at a relatively low context switch rate.

Yeah, that does sound expensive.

> From a systems
> point of view, you just don't want to increase overhead when the load
> goes up. This leads to fragile systems.
>
> Now, the question still remains, if a cpu in an SMP system is
> sleeping because of VST, who or how is it to be wakened to responded
> to increases in the system load. If all CPUs are sleeping, there is
> some event (i.e. interrupt) that does this. I think, in an SMP
> system, any awake CPUs should, during their load balancing, notice
> that there are sleeping CPUs and wake them as the load increases.

Sounds reasonable to me, should be as simple as a 'wake up and load
balance' IPI, right?

Jesse

2005-05-12 22:16:08

by George Anzinger

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Jesse Barnes wrote:
> On Thursday, May 12, 2005 2:16 pm, George Anzinger wrote:
>
>>The tickless system differs from VST in that it is designed to only
>>"tick" when there is something in the time list to do and it does
>>this ALWAYS.
>
>
> Right, that's what I gathered (and what I was getting at).
>
>
>>And this is exactly where the tickless system runs into overload.
>>Simply speaking, each task has with it certain limits and
>>requirements WRT time. It is almost always time sliced, but it may
>>also have execution limits and settimer execution time interrupts
>>that it wants.
>
>
> Right...
>
>
>>These need to be set up for each task when it is
>>switched to and removed when the system switches away from it.
>
>
> Why do they need to be switched when the task's slice is up?

No, that would not be too bad, but they need to be removed when the task is
switched away from. This happens FAR more often than a slice expiring. Most
tasks are switched away from because they block, not because their time is over.

Can't they
> remain in the timer list?

We are timeing the tasks slice. It is only active when the task has the cpu...

I guess I'm imagining a global outstanding
> timer list that's manipulated by add/remove_timer, settimer, etc..
> When a timeout occurs the timer interrupt handler could mark tasks
> runnable again, bump their priority, send SIGALRM, or whatever. Most

The timers that cause the problem are the ones that only run when the task is
active. These are the slice timer, the profile timer (ITIMER_PROF), the
execution limit timer and the settime timer that is relative to execution time
(ITIMER_VIRTUAL).

Again, we can colapse all these to one, but still it needs to be setup when the
task is switched to and removed when it is switched away.

> timers are deleted before they expire anyway, right? If that's true
> you definitely don't want to save and restore them on every context
> switch...

One of these timers is almost ALWAYS needed. And yes, mostly they do not
expire. That is usually good as an expire cost more than a removal (provided
you do not need to reprogram the timer as a result).

Timers that run on system time (ITIMER_REAL) stay in the list even when the task
is inactive.
>
>
>>In
>>the test I did, I reduced all these timers to one (I think I just
>>used the slice time, but this is not the thing to do if the user is
>>trying to do profiling. In any case, only one timer needs to be set
>>up, possibly some work needs to be done to find the min. of all the
>>execution time timers it has, but only that one needs to go in the
>>time list.).
>
>
> Or at least only the nearest one has to be programmed as the next
> interrupt, and before going back to sleep the kernel could check if any
> timers had expired while the last one was running (aren't missing
> jiffies accounted for this way too, but of course jiffies would go away
> in this scheme)?
>
>
>>BUT it needs to happen each context switch time and
>>thus adds overhead to the switch time. In my testing, the overhead
>>for this became higher than the ticked system overhead for the same
>>services at a relatively low context switch rate.
>
>
> Yeah, that does sound expensive.
>
>
>>From a systems
>>point of view, you just don't want to increase overhead when the load
>>goes up. This leads to fragile systems.
>>
>>Now, the question still remains, if a cpu in an SMP system is
>>sleeping because of VST, who or how is it to be wakened to responded
>>to increases in the system load. If all CPUs are sleeping, there is
>>some event (i.e. interrupt) that does this. I think, in an SMP
>>system, any awake CPUs should, during their load balancing, notice
>>that there are sleeping CPUs and wake them as the load increases.
>
>
> Sounds reasonable to me, should be as simple as a 'wake up and load
> balance' IPI, right?

I think there is already an IPI to tell another cpu that it has work. That
should be enough. Need to check, however. From the VST point of view, any
interrupt wake the cpu from the VST sleep, so it need not even target the
scheduler..

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-05-13 00:43:44

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thursday, May 12, 2005 3:15 pm, George Anzinger wrote:
> The timers that cause the problem are the ones that only run when the
> task is active. These are the slice timer, the profile timer
> (ITIMER_PROF), the execution limit timer and the settime timer that
> is relative to execution time (ITIMER_VIRTUAL).

ITIMER_PROF could simply be ignored if the task it corresponds to isn't
active when it fires, so it wouldn't incur any overhead.
ITIMER_VIRTUAL sounds like it would uglify things though, and of course
unused timer slice interrupts would have to be cleared out.

> Again, we can colapse all these to one, but still it needs to be
> setup when the task is switched to and removed when it is switched
> away.

Right, I see what you're saying now. It's not as simple as I had hoped.

> Timers that run on system time (ITIMER_REAL) stay in the list even
> when the task is inactive.

Right, they'll cause the task they're associated with to become runnable
again, or get a signal, or whatever.

> I think there is already an IPI to tell another cpu that it has work.
> That should be enough. Need to check, however. From the VST point
> of view, any interrupt wake the cpu from the VST sleep, so it need
> not even target the scheduler..

But in this case you probably want it to, so it can rebalance tasks to
the CPU that just woke up.

Jesse

2005-05-13 06:23:38

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thu, May 12, 2005 at 08:08:26PM +0200, Martin Schwidefsky wrote:
> I would prefer a solution where the busy CPU wakes up an idle CPU if the
> imbalance is too large. Any scheme that requires an idle CPU to poll at
> some intervals will have one of two problem: either the poll intervall
> is large then the imbalance will stay around for a long time, or the
> poll intervall is small then this will behave badly in a heavily
> virtualized environment with many images.

I guess all the discussions we are having boils down to this: what is the max
time one can afford to have an imbalanced system because of sleeping idle CPU
not participating in load balance? 10ms, 100ms, 1 sec or more?

Maybe the answer depends on how much imbalance it is that we are talking of
here. A high order of imbalance would mean that the sleeping idle CPUs have
to be woken up quickly, while a low order imbalance could mean that
we can let it sleep longer.

>From all the discussions we have been having, I think a watchdog
implementation makes more sense. Nick/Ingo, what do you think
should be our final decision on this?

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-13 06:27:09

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thu, May 12, 2005 at 10:59:59AM -0700, Jesse Barnes wrote:
> The latest patches seem to do tick skipping rather than wholesale
> ticklessness. Admittedly, the latter is a more invasive change, but one

If you are referring to my i386 patch, that was only a hack to test the
scheduler change!

> that may end up being simpler in the long run. But maybe George did a
> design like that in the past and rejected it?

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-13 06:31:15

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Thu, May 12, 2005 at 05:43:46PM -0700, Jesse Barnes wrote:
> But in this case you probably want it to, so it can rebalance tasks to
> the CPU that just woke up.

Usually the sleeping idle CPU is sent a resched IPI, which will cause it to
call schedule or idle_balance_retry/rebalance_tick (ref: my earlier patch) to
find out if it has do a load balance.

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-13 07:16:45

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Srivatsa Vaddagiri wrote:
> On Thu, May 12, 2005 at 08:08:26PM +0200, Martin Schwidefsky wrote:
>
>>I would prefer a solution where the busy CPU wakes up an idle CPU if the
>>imbalance is too large. Any scheme that requires an idle CPU to poll at
>>some intervals will have one of two problem: either the poll intervall
>>is large then the imbalance will stay around for a long time, or the
>>poll intervall is small then this will behave badly in a heavily
>>virtualized environment with many images.
>
>
> I guess all the discussions we are having boils down to this: what is the max
> time one can afford to have an imbalanced system because of sleeping idle CPU
> not participating in load balance? 10ms, 100ms, 1 sec or more?
>

Exactly. Any scheme that requires a busy CPU to poll at some intervals
will also have problems: you are putting more work on the busy CPUs,
there will be superfluous IPIs, and there will be cases where the
imbalance is allowed to get too large. Not really much different from
doing the work with the idle CPUs really, *except* that it will introduce
a lot of complexity that nobody has shown is needed.

> Maybe the answer depends on how much imbalance it is that we are talking of
> here. A high order of imbalance would mean that the sleeping idle CPUs have
> to be woken up quickly, while a low order imbalance could mean that
> we can let it sleep longer.
>
>>From all the discussions we have been having, I think a watchdog
> implementation makes more sense. Nick/Ingo, what do you think
> should be our final decision on this?
>

Well the complex solution won't go in until it is shown that the
simple version has fundamental failure cases - but I don't think
we need to make a final decision yet do we?

--
SUSE Labs, Novell Inc.

2005-05-13 08:05:12

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

* Nick Piggin <[email protected]> wrote:

> > From all the discussions we have been having, I think a watchdog
> > implementation makes more sense. Nick/Ingo, what do you think
> > should be our final decision on this?
>
> Well the complex solution won't go in until it is shown that the
> simple version has fundamental failure cases - but I don't think we
> need to make a final decision yet do we?

there's no need to make a final decision yet. But the more complex
watchdog solution does have the advantage of putting idle CPUs to sleep
immediately and perpetually.

the power equation is really easy: the implicit cost of a deep CPU sleep
is say 1-2 msecs. (that's how long it takes to shut the CPU and the bus
down, etc.) If we do an exponential backoff we periodically re-wake the
CPU fully up again - wasting 1-2msec (or more) more power. With the
watchdog solution we have more overhead on the busy CPU but it takes
_much_ less power for a truly idle CPU to be turned off. [the true
'effective cost' all depends on the scheduling pattern as well, but the
calculation before is still valid.] Whatever the algorithmic overhead of
the watchdog code, it's dwarved by the power overhead caused by false
idle-wakeups of CPUs under exponential backoff.

the watchdog solution - despite being more complex - is also more
orthogonal in that it does not change the balancing decisions at all -
they just get offloaded to another CPU. The exponential backoff OTOH
materially changes how we do SMP balancing - which might or might not
matter much, but it will always depend on circumstances. So in the long
run the watchdog solution is probably easier to control. (because it's
just an algorithm offload, not a material scheduling feature.)

so unless there are strong implementational arguments against the
watchdog solution, i definitely think it's the higher quality solution,
both in terms of power savings, and in terms of impact.

Ingo

2005-05-13 08:27:47

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

Ingo Molnar wrote:

> the power equation is really easy: the implicit cost of a deep CPU sleep
> is say 1-2 msecs. (that's how long it takes to shut the CPU and the bus
> down, etc.) If we do an exponential backoff we periodically re-wake the
> CPU fully up again - wasting 1-2msec (or more) more power. With the
> watchdog solution we have more overhead on the busy CPU but it takes
> _much_ less power for a truly idle CPU to be turned off. [the true
> 'effective cost' all depends on the scheduling pattern as well, but the
> calculation before is still valid.] Whatever the algorithmic overhead of
> the watchdog code, it's dwarved by the power overhead caused by false
> idle-wakeups of CPUs under exponential backoff.
>

Well, it really depends on how it is implemented, and what tradeoffs
you make.

Let's say that you don't start deep sleeping until you've backed off
to 64ms rebalancing. Now the CPU power consumption is reduced to less
than 2% of ideal.

Now we don't have to worry about uniprocessor, and SMP systems that
go *completely* idle can have mechanism to indefinitely deep sleep
all CPUs until there is real work.

What you're left with are SMP systems with *some* activity happening,
and of those, I bet most idle CPUs will have other reasons to be woken
up other than the scheduler tick anyway.

And don't forget that the watchdog approach can just as easily deep
sleep a CPU only to immediately wake it up again if it detects an
imbalance.

So in terms of real, system wide power savings, I'm guessing the
difference would really be immesurable.

And the CPU usage / wakeup cost arguments cut both ways. The busy
CPUs have to do extra work in the watchdog case.

> the watchdog solution - despite being more complex - is also more
> orthogonal in that it does not change the balancing decisions at all -
> they just get offloaded to another CPU. The exponential backoff OTOH
> materially changes how we do SMP balancing - which might or might not
> matter much, but it will always depend on circumstances. So in the long
> run the watchdog solution is probably easier to control. (because it's
> just an algorithm offload, not a material scheduling feature.)
>

Well so does the watchdog, really. But it's probably not like you have
to *really* tune sleep algorithms _exactly_ right, yeah? So long as you
get within even 5% of total theoretical power saving on SMP systems,
it's likely good enough.

> so unless there are strong implementational arguments against the
> watchdog solution, i definitely think it's the higher quality solution,
> both in terms of power savings, and in terms of impact.
>

I'm think power savings will be unmeasurable between the two approaches,
backoff will be quite a lot less complex, and have less impact on CPUs
that are busy doing real work.

Smells like a bakeoff coming up :)

2005-05-13 09:19:37

[permalink] [raw]

Subject: Re: [RFC] (How to) Let idle CPUs sleep

On Fri, May 13, 2005 at 08:29:17AM +0000, Nick Piggin wrote:
> And don't forget that the watchdog approach can just as easily deep
> sleep a CPU only to immediately wake it up again if it detects an
> imbalance.

I think we should increase the threshold beyond which the idle CPU
is woken up (more than the imbalance_pct that exists already). This
should justify waking up the CPU.

> And the CPU usage / wakeup cost arguments cut both ways. The busy
> CPUs have to do extra work in the watchdog case.

Maybe with a really smart watchdog solution, we can cut down this overhead.
I did think of other schemes - a dedicated CPU per node acting as watchdog
for that node and per-node wacthdog kernel threads? - to name a few. What I had
proposed was the best I thought. But maybe we can look at improving it
to see if the overhead concern you have can be reduced - meeting the interests
of both the worlds :)

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-13 09:33:56