Hi,
The following experiments were conducted on a two socket dual core
intel processor based machine in order to understand the impact of
sched_mc_power_savings scheduler heuristics.
Kernel linux-2.6.24-rc6:
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
tick-sched.c has been instrumented to collect idle entry and exit time
stamps.
Instrumentation patch:
Instrument tick-sched nohz code and generate time stamp trace data.
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---
kernel/time/tick-sched.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- linux-2.6.24-rc6.orig/kernel/time/tick-sched.c
+++ linux-2.6.24-rc6/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
#include <linux/profile.h>
#include <linux/sched.h>
#include <linux/tick.h>
+#include <linux/ktrace.h>
#include <asm/irq_regs.h>
@@ -200,7 +201,10 @@ void tick_nohz_stop_sched_tick(void)
if (ts->tick_stopped) {
delta = ktime_sub(now, ts->idle_entrytime);
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- }
+ ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_INFO1,
+ ktime_to_ns(now), ts->idle_calls);
>>>>>>>>>>>>>>>>Tracepoint A
+ } else
+ ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick,
KT_EVENT_FUNC_ENTER, ktime_to_ns(now), 0);
>>>>>>>>>>>>>>>>Tracepoint B
ts->idle_entrytime = now;
ts->idle_calls++;
@@ -391,6 +395,8 @@ void tick_nohz_restart_sched_tick(void)
tick_do_update_jiffies64(now);
now = ktime_get();
}
+ ktrace_log2(KT_FUNC_tick_nohz_restart_sched_tick, KT_EVENT_FUNC_EXIT,
+ ktime_to_ns(now), ts->idle_calls);
>>>>>>>>>>>>>>>>Tracepoint C
local_irq_enable();
}
The idle time collected are time stamp at (C) minus (B). This is the
time interval between stopping ticks and restarting ticks in an idle
system.
Complete patch series:
http://svaidy.googlepages.com/1-klog.patch
http://svaidy.googlepages.com/1-trace-sched.patch
Userspace program to extract trace data:
http://svaidy.googlepages.com/1-klog.c
Python script to post process binary trace data:
http://svaidy.googlepages.com/1-sched-stats.py
Gnuplot scripts that was used to generate the graphs:
http://svaidy.googlepages.com/1-multiplot.gp
The scheduler heuristics for multi core system
/sys/devices/system/cpu/sched_mc_power_savings should ideally extend
the cpu tickless idle time atleast on few CPU in an SMP machine.
However in the experiment it was found that turning on
sched_mc_power_savings marginally increased the idle time in only some
of the CPUs.
Experiment 1:
-------------
Setup:
* yum-updated and irqbalance daemon was stopped to reduce the idle
wakeup rate
* All irqs manually routed to CPU0 only (hoping this will keep other
CPUs idle) (http://svaidy.googlepages.com/1-irq-config.txt)
* Powertop shows around 35 wakeups per second during idle
(http://svaidy.googlepages.com/1-powertop-screen.txt)
* The trace of idle time stamps was collected for 120 seconds with the
system in idle state
Results:
There are 4 png files that plots the idle time for each CPU in the
system.
Please get the graphs from the following URLs
http://svaidy.googlepages.com/1-idle-cpu0.png
http://svaidy.googlepages.com/1-idle-cpu1.png
http://svaidy.googlepages.com/1-idle-cpu2.png
http://svaidy.googlepages.com/1-idle-cpu3.png
Each png file has 4 graphs plotted that is relevant to one CPU
* Right-top plot is the idle time sample obtained during the
experiment
* Left-top graph is histogram of right top plot
* The bottom graphs corresponding to idle times when
sched_mc_power_savings=1
Observations with sched_mc_power_savings=1:
* No major impact of sched_mc_power_savings on CPU0 and CPU1
* Significant idle time improvement on CPU2
* However, significant idle time reduction on CPU3
Experiment 2:
-------------
Setup:
* USB stopped
* Most daemons like yum-updatesd, hal, autofs, syslog, crond, irqbalance,
sendmail, pcscd were stopped
* Interrupt routing left to default but irqbalance daemon stopped
* Powertop shows around 4 wakeups per second during idle
(http://svaidy.googlepages.com/2-powertop-screen.txt)
* The trace of idle time stamps was collected for 120 seconds with the
system in idle state
Results:
There are 4 png files that plots the idle time for each CPU in the
system.
http://svaidy.googlepages.com/2-idle-cpu0.png
http://svaidy.googlepages.com/2-idle-cpu1.png
http://svaidy.googlepages.com/2-idle-cpu2.png
http://svaidy.googlepages.com/2-idle-cpu3.png
The details of the plot are same as the previous experiment.
Observations with sched_mc_power_savings=1:
* No major impact of sched_mc_power_savings on CPU0 and CPU1
* Good idle time improvement on CPU2 and CPU3
Please review the experiment and comment on how the effectiveness of
sched_mc_power_savings can be analysed.
At very very low wakeup count of ~4 per second on 4 CPU system gave
good idle time result when sched_mc_power_savings is enabled.
However the results are not as expected even at a marginal wakeup
count of ~35 per second.
Please let us know your comments and suggestions on the experiment or
results.
Do we have similar analysis and data on scheduler heuristics for power
savings.
Thanks,
Vaidy
On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote:
> Hi,
>
> The following experiments were conducted on a two socket dual core
> intel processor based machine in order to understand the impact of
> sched_mc_power_savings scheduler heuristics.
Thanks for these experiments. sched_mc_power_savings is mainly for
the reasonably long(which can be caught by periodic load balance) running
light load(when number of tasks < number of logical cpus). This tunable
will minimize the number of packages carrying load, enabling the other
completely idle packages to go to the available deepest P and C states.
>
> The scheduler heuristics for multi core system
> /sys/devices/system/cpu/sched_mc_power_savings should ideally extend
> the cpu tickless idle time atleast on few CPU in an SMP machine.
Not really. Ideally sched_mc_power_savings re-distributes the load(in
the scenarios mentioned above) to different logical cpus in the system, so
as to minimize the busy packages in the system.
> Experiment 1:
> -------------
...
> Observations with sched_mc_power_savings=1:
>
> * No major impact of sched_mc_power_savings on CPU0 and CPU1
> * Significant idle time improvement on CPU2
> * However, significant idle time reduction on CPU3
In your setup, CPU0 and 3 are the core siblings? If so, this is probably
expected. Previously there was some load distribution happening between
packages. With sched_mc_power_savings, load is now getting distributed
between cores in the same package. But on my systems, typically CPU0 and 2
are core siblings. I will let you confirm your system topology, before we
come to conclusions.
> Experiment 2:
> -------------
...
>
> Observations with sched_mc_power_savings=1:
>
> * No major impact of sched_mc_power_savings on CPU0 and CPU1
> * Good idle time improvement on CPU2 and CPU3
I would have expected similar results like in experiment-1. CPU3 data
seems almost same with and without sched_mc_power_savings. Atleast the
data is not significantly different as in other cases like CPU2 for ex.
> Please review the experiment and comment on how the effectiveness of
> sched_mc_power_savings can be analysed.
While we should see a no change or a simple redistribution(to minimize
busy packages) in your experiments, for evaluating sched_mc_power_savings,
we can also use some lightly loaded system (like kernel-compilation or
specjbb with 2 threads on a DP with dual-core, for example and see how the
load is distributed with and without MC power savings.)
Similarly it will be interesting to see how this data varies with and without
tickless.
For some loads which can't be caught by periodic load balancer, we may
not see any difference. But atleast we should not see any scheduling
anomalies.
thanks,
suresh
* Siddha, Suresh B <[email protected]> [2008-01-08 13:24:00]:
> On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote:
> > Hi,
> >
> > The following experiments were conducted on a two socket dual core
> > intel processor based machine in order to understand the impact of
> > sched_mc_power_savings scheduler heuristics.
>
> Thanks for these experiments. sched_mc_power_savings is mainly for
> the reasonably long(which can be caught by periodic load balance) running
> light load(when number of tasks < number of logical cpus). This tunable
> will minimize the number of packages carrying load, enabling the other
> completely idle packages to go to the available deepest P and C states.
Hi Suresh,
Thanks for the explanation. I tried few more experiments based on
your comments and found that in a long running CPU intensive job, the
heuristics did help to get the load on the same package.
> >
> > The scheduler heuristics for multi core system
> > /sys/devices/system/cpu/sched_mc_power_savings should ideally extend
> > the cpu tickless idle time atleast on few CPU in an SMP machine.
>
> Not really. Ideally sched_mc_power_savings re-distributes the load(in
> the scenarios mentioned above) to different logical cpus in the system, so
> as to minimize the busy packages in the system.
But the problem is we will not be able get longer idle time and goto
lower sleep states on other packages if we are not able to move the
short bursts of daemon jobs to busy packages.
We are looking at various means to increase uninterrupted CPU idle
times on atleast some of the CPUs in an under-utilised SMP machine.
> > Experiment 1:
> > -------------
> ...
> > Observations with sched_mc_power_savings=1:
> >
> > * No major impact of sched_mc_power_savings on CPU0 and CPU1
> > * Significant idle time improvement on CPU2
> > * However, significant idle time reduction on CPU3
>
> In your setup, CPU0 and 3 are the core siblings? If so, this is probably
> expected. Previously there was some load distribution happening between
> packages. With sched_mc_power_savings, load is now getting distributed
> between cores in the same package. But on my systems, typically CPU0 and 2
> are core siblings. I will let you confirm your system topology, before we
> come to conclusions.
No, in contrary to your observation, on my machine, CPU0 and CPU1 are
siblings on the first package while CPU2 and CPU3 are in the second
package.
> > Experiment 2:
> > -------------
> ...
> >
> > Observations with sched_mc_power_savings=1:
> >
> > * No major impact of sched_mc_power_savings on CPU0 and CPU1
> > * Good idle time improvement on CPU2 and CPU3
>
> I would have expected similar results like in experiment-1. CPU3 data
> seems almost same with and without sched_mc_power_savings. Atleast the
> data is not significantly different as in other cases like CPU2 for ex.
There is a general improvement in idle time in this experiment as
compared to the first experiment. So if sched_mc will not impact
short running tasks (daemons and other event managers in idle system),
then the observation on experiment-1 may be just a random
redistribution of tasks over time.
> > Please review the experiment and comment on how the effectiveness of
> > sched_mc_power_savings can be analysed.
>
> While we should see a no change or a simple redistribution(to minimize
> busy packages) in your experiments, for evaluating sched_mc_power_savings,
> we can also use some lightly loaded system (like kernel-compilation or
> specjbb with 2 threads on a DP with dual-core, for example and see how the
> load is distributed with and without MC power savings.)
I have tried running two CPU intensive threads. With sched_mc turned
ON, they were scheduled on CPU0-1 or CPU2-3. With sched_mc turned
OFF, I can see that they were scheduled arbitrarily.
Later I added random usleep() to make them consume less CPU, in that
case the scheduling was very random. The two threads with random
sleeps were scheduled on any of the 4 CPUs and the sched_mc parameter
had no effect is consolidating the threads to one package. Addition
of sleeps makes the threads very short run and not CPU intensive.
This completely validates your explanation of consolidating only 'long
running' jobs.
How do we take this technique to the next step where we can
consolidate short running jobs as well? Did you face any difficulty
biasing the CPU for short running jobs?
> Similarly it will be interesting to see how this data varies with and without
> tickless.
I am sure tickless has very significant impact on the idle time.
I will get you more data on this comparison.
> For some loads which can't be caught by periodic load balancer, we may
> not see any difference. But atleast we should not see any scheduling
> anomalies.
True, we did not see any scheduling anomalies yet.
Thanks,
Vaidy
* Vaidyanathan Srinivasan <[email protected]> wrote:
> How do we take this technique to the next step where we can
> consolidate short running jobs as well? Did you face any difficulty
> biasing the CPU for short running jobs?
are you sure your measurement tasks do not impact the measurement
workload? If you use something like 'top' then try running it reniced to
+19. (or perhaps even bound to a particular CPU, say #3, to make its
impact isolated)
Ingo
* Ingo Molnar <[email protected]> [2008-01-09 12:35:07]:
>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > How do we take this technique to the next step where we can
> > consolidate short running jobs as well? Did you face any difficulty
> > biasing the CPU for short running jobs?
>
> are you sure your measurement tasks do not impact the measurement
> workload? If you use something like 'top' then try running it reniced to
> +19. (or perhaps even bound to a particular CPU, say #3, to make its
> impact isolated)
Hi Ingo,
I will watch this during the experiments. I have been using klog
application to dump relayfs data. I did run powertop and top as well,
I will bind them to certain CPUs and isolate their impact.
I believe the margin of error would be less since all the measurement
tasks sleep for long duration.
Thanks,
Vaidy
* Vaidyanathan Srinivasan <[email protected]> wrote:
> I will watch this during the experiments. I have been using klog
> application to dump relayfs data. I did run powertop and top as well,
> I will bind them to certain CPUs and isolate their impact.
>
> I believe the margin of error would be less since all the measurement
> tasks sleep for long duration.
ok, long duration ought to be enough.
i think a possible explanation of your observtions would be this: sleepy
workloads are affected more by the wakeup logic, and most of the
power-savings works via runtime balancing.
So perhaps try to add some SD_POWERSAVINGS_BALANCE logic to
try_to_wake_up()? I think waking up on the same CPU where it went to
sleep is the most power-efficient approach in general. (or always waking
up where the wakee runs - this should be measured.) Right now
try_to_wake_up() tries to spread out load opportunistically, which is
throughput-maximizing but it's arguably not very power conscious.
Ingo