Hello all,
This patch allows more aggressive idle balances, reducing idle time in
scenarios where should not be any, where nr_running > nr_cpus. We have seen
this in a couple of online transaction workloads. Three areas are targeted:
1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
the task->cpu is busy and the sibling is idle. This has been expanded to any
idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
then going up the domains as necessary.
2) can_migrate() is modified. Honestly part of it looked kind of backwards.
We checked for task_hot() if the cpu was idle. I would think we would do
that if is was -not- idle. Changes to can migrate are:
a) same policies for running tasks and cpus_allowed, return 0
b) We omit task_hot check for -whole- idle cpus (not just a sibling),
when task_cpu(p) and this_cpu are member siblings in the same core,
or when nr_balance_failed > cache_nice_tries
c) finally, if this_cpu is busy we still do the task_hot check
3) Add SD_BALANCE_NEWIDLE to SD_NODE_INIT. All this does is allow a newly
idle cpu a broader span of cpus, only if necessary, to pull a task. We still
target cpus closest to the this_cpu, starting with this_cpu->sd, then as
needed moving up the domains. **This is vital for Opteron, where each node
only has one cpu. idle_balance() never works because it would never consider
looking beyond the cpu->sd, because cpu->sd->sd did not have
SD_BALANCE_NEWIDLE set.
So far we have seen 3-5% with these patches on online transaction workolads
and no regressions on SDET. Kenneth, I am particularly interested in using
this with your increased cache_hot_time value, where you got your best
throughput:
> cache_hot_time | workload throughput
> --------------------------------------
> 25ms - 111.1 (5% idle)
...but still had idle time. Do you think you could try these patches with
your 25ms cache_hot_time? I think your workload could benefit from both the
longer cache_hot_time for busy cpus, but more aggressive idle balances,
hopefully driving your workload to 100% cpu utilization.
Any feedback greatly appreciated, thanks.
-Andrew Theurer
Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
> So far we have seen 3-5% with these patches on online transaction workolads
> and no regressions on SDET. Kenneth, I am particularly interested in using
> this with your increased cache_hot_time value, where you got your best
> throughput:
>
> ...but still had idle time. Do you think you could try these patches with
> your 25ms cache_hot_time? I think your workload could benefit from both the
> longer cache_hot_time for busy cpus, but more aggressive idle balances,
> hopefully driving your workload to 100% cpu utilization.
Looks interesting, I will queue this up on our benchmark setup.
- Ken
Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
> This patch allows more aggressive idle balances, reducing idle time in
> scenarios where should not be any, where nr_running > nr_cpus. We have seen
> this in a couple of online transaction workloads. Three areas are targeted:
>
> 1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
> the task->cpu is busy and the sibling is idle. This has been expanded to any
> idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
> then going up the domains as necessary.
It occurs to me that half of the patch only applicable to HT, like the change
in wake_idle(). And also, do you really want to put that functionality in
wake_idle()? Seems defeating the original intention of that function, which
only tries to wake up sibling cpu as far as how I understand the code.
My setup is 4-way SMP, no HT (4-way itanium2 processor), sorry, I won't be able
to tell you how this portion of the change affect online transaction workload.
- Ken
Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
> This patch allows more aggressive idle balances, reducing idle time in
> scenarios where should not be any, where nr_running > nr_cpus. We have seen
> this in a couple of online transaction workloads. Three areas are targeted:
>
> 1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
> the task->cpu is busy and the sibling is idle. This has been expanded to any
> idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
> then going up the domains as necessary.
Chen, Kenneth W wrote on Tuesday, November 02, 2004 2:35 PM
> It occurs to me that half of the patch only applicable to HT, like the change
> in wake_idle(). And also, do you really want to put that functionality in
> wake_idle()? Seems defeating the original intention of that function, which
> only tries to wake up sibling cpu as far as how I understand the code.
>
> My setup is 4-way SMP, no HT (4-way itanium2 processor), sorry, I won't be able
> to tell you how this portion of the change affect online transaction workload.
Move that functionality into try_to_wake_up directly. I'm going to try this
on my setup.
--- kernel/sched.c.orig 2004-11-02 13:35:33.000000000 -0800
+++ kernel/sched.c 2004-11-02 14:51:08.000000000 -0800
@@ -1059,13 +1059,21 @@ static int try_to_wake_up(task_t * p, un
schedstat_inc(sd, ttwu_wake_balance);
goto out_set_cpu;
}
+ } else if (sd->flags & SD_WAKE_IDLE) {
+ cpus_and(tmp, sd->span, cpu_online_map);
+ cpus_and(tmp, tmp, p->cpus_allowed);
+ for_each_cpu_mask(i, tmp) {
+ if (idle_cpu(i)) {
+ new_cpu = i;
+ goto out_set_cpu;
+ }
+ }
}
}
new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
out_set_cpu:
schedstat_inc(rq, ttwu_attempts);
- new_cpu = wake_idle(new_cpu, p);
if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) {
schedstat_inc(rq, ttwu_moved);
set_task_cpu(p, new_cpu);
Chen, Kenneth W wrote:
>Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
>
>>This patch allows more aggressive idle balances, reducing idle time in
>>scenarios where should not be any, where nr_running > nr_cpus. We have seen
>>this in a couple of online transaction workloads. Three areas are targeted:
>>
>>1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
>>the task->cpu is busy and the sibling is idle. This has been expanded to any
>>idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
>>then going up the domains as necessary.
>>
>>
>
>It occurs to me that half of the patch only applicable to HT, like the change
>in wake_idle(). And also, do you really want to put that functionality in
>wake_idle()? Seems defeating the original intention of that function, which
>only tries to wake up sibling cpu as far as how I understand the code.
>
>
The patch (wake_idle()) should still make a difference with nonHT
systems. It is true that in the mainline code, only HT systems
benefited from wake_idle(). In fact, wake_idle() probably did nothing
at all for Itanium, but with this patch it will, since we scan for an
idle cpu up the domains that have flag SD_WAKE_IDLE (and we addedd that
flag to SD_CPU_INIT and SD_NODE_INIT).
The patch's intention is to make sure we don't leave idle cpus idle.
This can happen because the rate at which tasks are activated and
deactivated is much higher than the idle balance interval, and because
the task_hot prevents migrations even when we do encounter an idle
balance tick.
Honsetly, I think you should be able to see the benefits form all
seciotns of this patch, even without HT.
-Andrew Theurer