2008-12-18 17:52:56

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

Hi,

The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power. The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.

The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.

Please refer to the following discussions and article for details.

[1]Making power policy just work
http://lwn.net/Articles/287924/

[2][RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/

v2: http://lwn.net/Articles/297306/
v3: http://lkml.org/lkml/2008/11/10/260
v4: http://lkml.org/lkml/2008/11/21/47
v5: http://lkml.org/lkml/2008/12/11/178
v6: http://lwn.net/Articles/311830/

The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.

This version of the patch incorporates comments and feedback received
on the previous post from Andrew Morton.

Changes form v6:
----------------
* Convert BALANCE_FOR_xx_POWER and related macros to inline functions
based on comments from Andrew and Ingo.
* Ran basic kernelbench test and did not see any performance variation
due to the changes.

Changes form v5:
---------------
* Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
* Dropped the RFC prefix to indicate that the patch is ready for
testing and inclusion
* Patch series against 2.6.28-rc8 kernel

Changes from v4:
----------------

* Conditionally added SD_BALANCE_NEWIDLE flag to MC and CPU level
domain to help task consolidation when sched_mc > 0
Removing this flag significantly reduces the number load_balance
calls and considerably slows down task consolidation and in effect
power savings.

Ingo and Mike Galbraith removed this flag to improve performance for
select benchmarks. I have added the flags only when power savings
is requested and restore to full performance mode if sched_mc=0

* Fixed few whitespace issues
* Patch series against 2.6.28-rc8 kernel

Changes from v3:
----------------

* Fixed the locking code with double_lock_balance() in
active-balance-newidle.patch

* Moved sched_mc_preferred_wakeup_cpu to root_domain structure so that
each partitioned sched domain will get independent nominated cpu

* More comments in active-balance-newidle.patch

* Reverted sched MC level and CPU level fine tuning in v2.6.28-rc4 for
now. These affect consolidation since SD_BALANCE_NEWIDLE is
removed. I will rework the tuning in the next iteration to
selectively enable them at sched_mc=2

* Patch series on 2.6.28-rc6 kernel

Changes from v2:
----------------

* Fixed locking order issue in active-balance new-idle
* Moved the wakeup biasing code to wake_idle() function and preserve
wake_affine function. Previous version would break wake affine in
order to aggressively consolidate tasks
* Removed sched_mc_preferred_wakeup_cpu global variable and moved to
doms_cur/dattr_cur and added a per_cpu pointer to appropriate
storage in partitioned sched domain. This changed is needed to
preserve functionality in case of partitioned sched domains
* Patch on 2.6.28-rc3 kernel

Results:
--------

Basic functionality of the code has not changed and the power vs
performance benefits for kernbench are similar to the ones posted
earlier.

KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
package system

SchedMC Run Time Package Idle Energy Power
0 81.68 52.83% 54.71% 1.00x J 1.00y W
1 80.70 36.62% 70.11% 0.95x J 0.96y W
2 74.95 19.53% 85.92% 0.90x J 0.98y W

The results are marginally better than the previous version of the
patch series which could be within the test variation.

Please feel free to test, and let me know your comments and feedback.
I will post more experimental results with various benchmarks.

Hi Ingo,

Please review and include in sched development tree for testing.

Thanks,
Vaidy

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

---

Gautham R Shenoy (1):
sched: Framework for sched_mc/smt_power_savings=N

Vaidyanathan Srinivasan (7):
sched: idle_balance() does not call load_balance_newidle()
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: convert BALANCE_FOR_xx_POWER to inline functions


include/linux/sched.h | 57 +++++++++++++++++++++++++----
include/linux/topology.h | 6 ++-
kernel/sched.c | 89 +++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 18 +++++++++
4 files changed, 153 insertions(+), 17 deletions(-)

--


2008-12-19 08:21:42

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Ingo Molnar <[email protected]> [2008-12-18 21:19:38]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > Hi,
> >
> > The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
> > the workload in the system on minimum number of CPU packages and tries
> > to keep rest of the CPU packages idle for longer duration. Thus
> > consolidating workloads to fewer packages help other packages to be in
> > idle state and save power. The current implementation is very
> > conservative and does not work effectively across different workloads.
> > Initial idea of tunable sched_mc_power_savings=n was proposed to enable
> > tuning of the power saving load balancer based on the system
> > configuration, workload characteristics and end user requirements.
> >
> > The power savings and performance of the given workload in an under
> > utilised system can be controlled by setting values of 0, 1 or 2 to
> > /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
> > performance and least power savings and level 2 indicating maximum power
> > savings even at the cost of slight performance degradation.
> >
> > Please refer to the following discussions and article for details.
> >
> > [1]Making power policy just work
> > http://lwn.net/Articles/287924/
> >
> > [2][RFC v1] Tunable sched_mc_power_savings=n
> > http://lwn.net/Articles/287882/
> >
> > v2: http://lwn.net/Articles/297306/
> > v3: http://lkml.org/lkml/2008/11/10/260
> > v4: http://lkml.org/lkml/2008/11/21/47
> > v5: http://lkml.org/lkml/2008/12/11/178
> > v6: http://lwn.net/Articles/311830/
> >
> > The following series of patch demonstrates the basic framework for
> > tunable sched_mc_power_savings.
> >
> > This version of the patch incorporates comments and feedback received
> > on the previous post from Andrew Morton.
> >
> > Changes form v6:
> > ----------------
> > * Convert BALANCE_FOR_xx_POWER and related macros to inline functions
> > based on comments from Andrew and Ingo.
> > * Ran basic kernelbench test and did not see any performance variation
> > due to the changes.
> >
> > Changes form v5:
> > ---------------
> > * Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
> > * Dropped the RFC prefix to indicate that the patch is ready for
> > testing and inclusion
> > * Patch series against 2.6.28-rc8 kernel
>
> thanks, applied - and i started testing them. It needed some help here and
> there to resolve conflicts with pending cpumask changes. Could you please
> double-check the merged up end result in the latest scheduler devel tree:
>
> http://people.redhat.com/mingo/tip.git/README

Thanks Ingo. I will review and check the cpumask changes.
I will test the tip tree as well and provide you feedback.

I will look forward to test/bug reports from others on the patch
series and help with the fix.

--Vaidy

2008-12-19 08:28:52

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Ingo Molnar <[email protected]> [2008-12-18 21:31:40]:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > thanks, applied - and i started testing them. [...]
>
> the sched.h bits needed the small fix below. (struct sched_domain is not
> defined on UP)

Thanks Ingo. I will watch out for such errors next time.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

--Vaidy

>
> Ingo
>
> ------------->
> From edb4c71953409c1deac1a80528ac0aa768762b33 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <[email protected]>
> Date: Thu, 18 Dec 2008 21:30:23 +0100
> Subject: [PATCH] sched: move test_sd_parent() to an SMP section of sched.h
>
> Impact: build fix
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> include/linux/sched.h | 18 +++++++++---------
> 1 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5a933d9..e5f928a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -920,6 +920,15 @@ extern void partition_sched_domains(int ndoms_new, struct cpumask *doms_new,
> struct sched_domain_attr *dattr_new);
> extern int arch_reinit_sched_domains(void);
>
> +/* Test a flag in parent sched domain */
> +static inline int test_sd_parent(struct sched_domain *sd, int flag)
> +{
> + if (sd->parent && (sd->parent->flags & flag))
> + return 1;
> +
> + return 0;
> +}
> +
> #else /* CONFIG_SMP */
>
> struct sched_domain_attr;
> @@ -1431,15 +1440,6 @@ struct task_struct {
> #endif
> };
>
> -/* Test a flag in parent sched domain */
> -static inline int test_sd_parent(struct sched_domain *sd, int flag)
> -{
> - if (sd->parent && (sd->parent->flags & flag))
> - return 1;
> -
> - return 0;
> -}
> -
> /*
> * Priority of a process goes from 0..MAX_PRIO-1, valid RT
> * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH

2008-12-19 13:31:30

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Ingo Molnar <[email protected]> [2008-12-18 21:19:38]:

>
> * Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > Hi,
> >
> > The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
> > the workload in the system on minimum number of CPU packages and tries
> > to keep rest of the CPU packages idle for longer duration. Thus
> > consolidating workloads to fewer packages help other packages to be in
> > idle state and save power. The current implementation is very
> > conservative and does not work effectively across different workloads.
> > Initial idea of tunable sched_mc_power_savings=n was proposed to enable
> > tuning of the power saving load balancer based on the system
> > configuration, workload characteristics and end user requirements.
> >
> > The power savings and performance of the given workload in an under
> > utilised system can be controlled by setting values of 0, 1 or 2 to
> > /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
> > performance and least power savings and level 2 indicating maximum power
> > savings even at the cost of slight performance degradation.
> >
> > Please refer to the following discussions and article for details.
> >
> > [1]Making power policy just work
> > http://lwn.net/Articles/287924/
> >
> > [2][RFC v1] Tunable sched_mc_power_savings=n
> > http://lwn.net/Articles/287882/
> >
> > v2: http://lwn.net/Articles/297306/
> > v3: http://lkml.org/lkml/2008/11/10/260
> > v4: http://lkml.org/lkml/2008/11/21/47
> > v5: http://lkml.org/lkml/2008/12/11/178
> > v6: http://lwn.net/Articles/311830/
> >
> > The following series of patch demonstrates the basic framework for
> > tunable sched_mc_power_savings.
> >
> > This version of the patch incorporates comments and feedback received
> > on the previous post from Andrew Morton.
> >
> > Changes form v6:
> > ----------------
> > * Convert BALANCE_FOR_xx_POWER and related macros to inline functions
> > based on comments from Andrew and Ingo.
> > * Ran basic kernelbench test and did not see any performance variation
> > due to the changes.
> >
> > Changes form v5:
> > ---------------
> > * Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
> > * Dropped the RFC prefix to indicate that the patch is ready for
> > testing and inclusion
> > * Patch series against 2.6.28-rc8 kernel
>
> thanks, applied - and i started testing them. It needed some help here and
> there to resolve conflicts with pending cpumask changes. Could you please
> double-check the merged up end result in the latest scheduler devel tree:
>
> http://people.redhat.com/mingo/tip.git/README

Hi Ingo,

I reviewed and tested the tip tree. The merge looks good. I did not
find any other cpumask API warnings after the one you have fixed in commit
220e7f617826e0527bbc523ba859f6a4bae0bfe1

Results in tip at: 2.6.28-rc8-tip-sv-01393-gc79978a

SchedMC Run Time Package Idle Energy Power
0 80.50 51.47% 54.00% 1.00x J 1.00y W
1 78.89 43.37% 62.86% 0.95x J 0.97y W
2 75.37 30.28% 76.00% 0.91x J 0.98y W

Thanks,
Vaidy

2008-12-19 22:10:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

On Thu, 18 Dec 2008 23:26:22 +0530
Vaidyanathan Srinivasan <[email protected]> wrote:

> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
>
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
>
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical. Hence this information is stored in root_domain
> struct which is one copy per partitioned sched domain.
> The root_domain can be accessed from each cpu's runqueue
> and there is one copy per partitioned sched domain.
>

kernel/sched.c: In function 'find_busiest_group':
kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type

Due to

first_cpu(group_leader->cpumask);

apparently because Rusty changed sched_group.cpumask into a plain old
array and nobody tests their stuff against the tree into which it is
actually integrated :(

kernel/sched.c: In function 'schedule':
kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function

This warning is correct - the code is buggy.

2008-12-19 22:26:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

On Fri, 19 Dec 2008 13:55:08 -0800
Andrew Morton <[email protected]> wrote:

> On Thu, 18 Dec 2008 23:26:22 +0530
> Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> >
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> >
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical. Hence this information is stored in root_domain
> > struct which is one copy per partitioned sched domain.
> > The root_domain can be accessed from each cpu's runqueue
> > and there is one copy per partitioned sched domain.
> >
>
> kernel/sched.c: In function 'find_busiest_group':
> kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
>
>
> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
>

I left those unfixed and...

[ 62.307607] initcall init_nfsd+0x0/0xe2 [nfsd] returned 0 after 251 usecs
[ 62.399426] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 74.009934] divide error: 0000 [#1] SMP
[ 74.010085] last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.1/irq
[ 74.010147] CPU 1
[ 74.010241] Modules linked in: nfsd auth_rpcgss exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac parport_pc lp parport snd_hda_codec_realtek sg snd_hda_intel floppy snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ide_cd_mod cdrom snd_pcm_oss snd_mixer_oss serio_raw snd_pcm button shpchp snd_timer i2c_i801 i2c_core snd soundcore snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
[ 74.012821] Pid: 0, comm: swapper Not tainted 2.6.28-rc9-mm1 #1
[ 74.012875] RIP: 0010:[<ffffffff802375d8>] [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
[ 74.012982] RSP: 0018:ffff88025e057d88 EFLAGS: 00010246
[ 74.013038] RAX: 0000000000000000 RBX: ffff880028063900 RCX: ffff880028058dc0
[ 74.013095] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
[ 74.013152] RBP: ffff88025e057dd8 R08: 0000000000000000 R09: 0000000000000000
[ 74.013209] R10: ffff880028067640 R11: 0000000000000022 R12: ffff880028063900
[ 74.013264] R13: ffffffff802374c5 R14: ffffffff8022c546 R15: ffffffff8079b4a0
[ 74.013321] FS: 0000000000000000(0000) GS:ffff88025e0385c0(0000) knlGS:0000000000000000
[ 74.013381] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 74.013435] CR2: 00007fdb795af8f0 CR3: 0000000000201000 CR4: 00000000000006e0
[ 74.013489] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 74.013545] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 74.013602] Process swapper (pid: 0, threadinfo ffff88025e050000, task ffff88025e03f870)
[ 74.013661] Stack:
[ 74.013708] 0000000000000000 ffff880028063a00 0000000000000000 0000000000000000
[ 74.013893] 0000000000000008 ffffffff8079b4a0 ffff880028063900 ffffffff802374c5
[ 74.013893] ffffffff8022c546 0000000000000030 ffff88025e057e08 ffffffff8022c530
[ 74.013893] Call Trace:
[ 74.013893] <IRQ> <0> [<ffffffff802374c5>] ? tg_shares_up+0x0/0x1f1
[ 74.013893] [<ffffffff8022c546>] ? tg_nop+0x0/0x8
[ 74.013893] [<ffffffff8022c530>] walk_tg_tree+0x5f/0x75
[ 74.013893] [<ffffffff8022c65c>] update_shares+0x46/0x4a
[ 74.013893] [<ffffffff80235fe4>] run_rebalance_domains+0x192/0x522
[ 74.013893] [<ffffffff802543eb>] ? getnstimeofday+0x3d/0x9e
[ 74.013893] [<ffffffff8052bd73>] ? _spin_unlock_irq+0x9/0xc
[ 74.013893] [<ffffffff8024057f>] __do_softirq+0xa3/0x164
[ 74.013893] [<ffffffff8020d0bc>] call_softirq+0x1c/0x28
[ 74.013893] [<ffffffff8020e19d>] do_softirq+0x31/0x73
[ 74.013893] [<ffffffff802402ca>] irq_exit+0x3f/0x41
[ 74.013893] [<ffffffff8021e71d>] smp_apic_timer_interrupt+0x94/0xad
[ 74.013893] [<ffffffff8020caf3>] apic_timer_interrupt+0x13/0x20
[ 74.013893] <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45
[ 74.013893] RIP [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
[ 74.013893] RSP <ffff88025e057d88>
[ 74.020022] ---[ end trace 2fc4046e394f2312 ]---
[ 74.020188] Kernel panic - not syncing: Fatal exception in interrupt

config: http://userweb.kernel.org/~akpm/config-akpm2.txt

I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().

2008-12-19 22:30:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu


* Andrew Morton <[email protected]> wrote:

> On Fri, 19 Dec 2008 13:55:08 -0800
> Andrew Morton <[email protected]> wrote:
>
> > On Thu, 18 Dec 2008 23:26:22 +0530
> > Vaidyanathan Srinivasan <[email protected]> wrote:
> >
> > > When the system utilisation is low and more cpus are idle,
> > > then the process waking up from sleep should prefer to
> > > wakeup an idle cpu from semi-idle cpu package (multi core
> > > package) rather than a completely idle cpu package which
> > > would waste power.
> > >
> > > Use the sched_mc balance logic in find_busiest_group() to
> > > nominate a preferred wakeup cpu.
> > >
> > > This info can be sored in appropriate sched_domain, but
> > > updating this info in all copies of sched_domain is not
> > > practical. Hence this information is stored in root_domain
> > > struct which is one copy per partitioned sched domain.
> > > The root_domain can be accessed from each cpu's runqueue
> > > and there is one copy per partitioned sched domain.
> > >
> >
> > kernel/sched.c: In function 'find_busiest_group':
> > kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
> >
> >
> > kernel/sched.c: In function 'schedule':
> > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> >
>
> I left those unfixed and...
>
> [ 62.307607] initcall init_nfsd+0x0/0xe2 [nfsd] returned 0 after 251 usecs
> [ 62.399426] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> [ 74.009934] divide error: 0000 [#1] SMP
> [ 74.010085] last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.1/irq
> [ 74.010147] CPU 1
> [ 74.010241] Modules linked in: nfsd auth_rpcgss exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac parport_pc lp parport snd_hda_codec_realtek sg snd_hda_intel floppy snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ide_cd_mod cdrom snd_pcm_oss snd_mixer_oss serio_raw snd_pcm button shpchp snd_timer i2c_i801 i2c_core snd soundcore snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
> [ 74.012821] Pid: 0, comm: swapper Not tainted 2.6.28-rc9-mm1 #1
> [ 74.012875] RIP: 0010:[<ffffffff802375d8>] [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> [ 74.012982] RSP: 0018:ffff88025e057d88 EFLAGS: 00010246
> [ 74.013038] RAX: 0000000000000000 RBX: ffff880028063900 RCX: ffff880028058dc0
> [ 74.013095] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
> [ 74.013152] RBP: ffff88025e057dd8 R08: 0000000000000000 R09: 0000000000000000
> [ 74.013209] R10: ffff880028067640 R11: 0000000000000022 R12: ffff880028063900
> [ 74.013264] R13: ffffffff802374c5 R14: ffffffff8022c546 R15: ffffffff8079b4a0
> [ 74.013321] FS: 0000000000000000(0000) GS:ffff88025e0385c0(0000) knlGS:0000000000000000
> [ 74.013381] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 74.013435] CR2: 00007fdb795af8f0 CR3: 0000000000201000 CR4: 00000000000006e0
> [ 74.013489] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 74.013545] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 74.013602] Process swapper (pid: 0, threadinfo ffff88025e050000, task ffff88025e03f870)
> [ 74.013661] Stack:
> [ 74.013708] 0000000000000000 ffff880028063a00 0000000000000000 0000000000000000
> [ 74.013893] 0000000000000008 ffffffff8079b4a0 ffff880028063900 ffffffff802374c5
> [ 74.013893] ffffffff8022c546 0000000000000030 ffff88025e057e08 ffffffff8022c530
> [ 74.013893] Call Trace:
> [ 74.013893] <IRQ> <0> [<ffffffff802374c5>] ? tg_shares_up+0x0/0x1f1
> [ 74.013893] [<ffffffff8022c546>] ? tg_nop+0x0/0x8
> [ 74.013893] [<ffffffff8022c530>] walk_tg_tree+0x5f/0x75
> [ 74.013893] [<ffffffff8022c65c>] update_shares+0x46/0x4a
> [ 74.013893] [<ffffffff80235fe4>] run_rebalance_domains+0x192/0x522
> [ 74.013893] [<ffffffff802543eb>] ? getnstimeofday+0x3d/0x9e
> [ 74.013893] [<ffffffff8052bd73>] ? _spin_unlock_irq+0x9/0xc
> [ 74.013893] [<ffffffff8024057f>] __do_softirq+0xa3/0x164
> [ 74.013893] [<ffffffff8020d0bc>] call_softirq+0x1c/0x28
> [ 74.013893] [<ffffffff8020e19d>] do_softirq+0x31/0x73
> [ 74.013893] [<ffffffff802402ca>] irq_exit+0x3f/0x41
> [ 74.013893] [<ffffffff8021e71d>] smp_apic_timer_interrupt+0x94/0xad
> [ 74.013893] [<ffffffff8020caf3>] apic_timer_interrupt+0x13/0x20
> [ 74.013893] <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45
> [ 74.013893] RIP [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> [ 74.013893] RSP <ffff88025e057d88>
> [ 74.020022] ---[ end trace 2fc4046e394f2312 ]---
> [ 74.020188] Kernel panic - not syncing: Fatal exception in interrupt
>
> config: http://userweb.kernel.org/~akpm/config-akpm2.txt
>
> I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().

hm, weird - Yinghai noticed this crash in -tip and we have that patch
zapped from all -*-next branches already.

Ingo

2008-12-19 22:33:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu


* Ingo Molnar <[email protected]> wrote:

> > [ 74.013893] <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45
> > [ 74.013893] RIP [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > [ 74.013893] RSP <ffff88025e057d88>
> > [ 74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > [ 74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> >
> > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> >
> > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
>
> hm, weird - Yinghai noticed this crash in -tip and we have that patch
> zapped from all -*-next branches already.

I guess you applied Ken's patch from email and had it in -mm? We have a
new version of that patch from Ken now based on Yinghai's bugreport - only
lightly tested for now.

Ingo

----------->
>From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
From: Ken Chen <[email protected]>
Date: Fri, 19 Dec 2008 10:11:50 -0800
Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution

Impact: fix group scheduling behavior

While testing CFS scheduler on linux-2.6-tip tree:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

We found that task which is pinned to a CPU could be starved relative to its
allocated fair share.

The per-cpu sched_enetity load share calculation in tg_shares_up /
update_group_shares_cpu distributes total task_group's share among all CPUs
for a given SD domain, this would dilute task_group's per-cpu share because
it distributes share to CPU that even has no load. The trapped share is now
un-consumable and it leads to fair share starvation on the runnable CPU.
Peter was right that it is still required for the low level function to make
distinction between a boosted share that don't have any load and actual tg
share that should be distributed among CPUs in which the tg is running.

Patch to add that boost and we think the scheduler should only boost one
times of tg shares over all empty CPU that don't have any load for the
specific task_group in order to bound maximum temporary boost that a given
task_group can have.

Signed-off-by: Ken Chen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 49 ++++++++++++++++++++++++++++++-------------------
1 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ae5ca3f..7d07c97 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1475,24 +1475,34 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
* Calculate and set the cpu's group shares.
*/
static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
+update_group_shares_cpu(struct task_group *tg, int cpu, unsigned long boost,
unsigned long sd_shares, unsigned long sd_rq_weight)
{
- unsigned long shares;
+ unsigned long shares, raw_shares;
unsigned long rq_weight;

if (!tg->se[cpu])
return;

rq_weight = tg->cfs_rq[cpu]->rq_weight;
-
- /*
- * \Sum shares * rq_weight
- * shares = -----------------------
- * \Sum rq_weight
- *
- */
- shares = (sd_shares * rq_weight) / sd_rq_weight;
+ if (rq_weight && sd_rq_weight) {
+ /*
+ * \Sum shares * rq_weight
+ * shares = -----------------------
+ * \Sum rq_weight
+ *
+ */
+ raw_shares = (sd_shares * rq_weight) / sd_rq_weight;
+ shares = raw_shares;
+ } else {
+ /*
+ * If there are currently no tasks on the cpu pretend there
+ * is one of average load so that when a new task gets to
+ * run here it will not get delayed by group starvation.
+ */
+ raw_shares = 0;
+ shares = boost;
+ }
shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);

if (abs(shares - tg->se[cpu]->load.weight) >
@@ -1501,7 +1511,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long flags;

spin_lock_irqsave(&rq->lock, flags);
- tg->cfs_rq[cpu]->shares = shares;
+ tg->cfs_rq[cpu]->shares = raw_shares;

__set_se_shares(tg->se[cpu], shares);
spin_unlock_irqrestore(&rq->lock, flags);
@@ -1517,18 +1527,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
{
unsigned long weight, rq_weight = 0;
unsigned long shares = 0;
+ unsigned long boost;
struct sched_domain *sd = data;
- int i;
+ int i, no_load_count = 0;

for_each_cpu(i, sched_domain_span(sd)) {
- /*
- * If there are currently no tasks on the cpu pretend there
- * is one of average load so that when a new task gets to
- * run here it will not get delayed by group starvation.
- */
weight = tg->cfs_rq[i]->load.weight;
if (!weight)
- weight = NICE_0_LOAD;
+ no_load_count++;

tg->cfs_rq[i]->rq_weight = weight;
rq_weight += weight;
@@ -1541,8 +1547,13 @@ static int tg_shares_up(struct task_group *tg, void *data)
if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
shares = tg->shares;

+ if (no_load_count)
+ boost = shares / no_load_count;
+ else
+ boost = shares / cpumask_weight(sched_domain_span(sd));
+
for_each_cpu(i, sched_domain_span(sd))
- update_group_shares_cpu(tg, i, shares, rq_weight);
+ update_group_shares_cpu(tg, i, boost, shares, rq_weight);

return 0;
}

2008-12-19 22:39:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

On Fri, 19 Dec 2008 23:31:30 +0100
Ingo Molnar <[email protected]> wrote:

> * Ingo Molnar <[email protected]> wrote:
>
> > > [ 74.013893] <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45
> > > [ 74.013893] RIP [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > > [ 74.013893] RSP <ffff88025e057d88>
> > > [ 74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > > [ 74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> > >
> > > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> > >
> > > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
> >
> > hm, weird - Yinghai noticed this crash in -tip and we have that patch
> > zapped from all -*-next branches already.
>
> I guess you applied Ken's patch from email and had it in -mm?

Nope.

> We have a
> new version of that patch from Ken now based on Yinghai's bugreport - only
> lightly tested for now.
>
> Ingo
>
> ----------->
> >From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
> From: Ken Chen <[email protected]>
> Date: Fri, 19 Dec 2008 10:11:50 -0800
> Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution

linux-next has:

commit 6eed5aaacae0ced9a43a923633befd2c695d6620
Author: Ken Chen <[email protected]>
Date: Mon Dec 15 23:37:50 2008 -0800

sched: fix uneven per-cpu task_group share distribution

so I guess we have a latency problem.

2008-12-19 22:56:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu


* Andrew Morton <[email protected]> wrote:

> On Fri, 19 Dec 2008 23:31:30 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > > [ 74.013893] <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45
> > > > [ 74.013893] RIP [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > > > [ 74.013893] RSP <ffff88025e057d88>
> > > > [ 74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > > > [ 74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> > > >
> > > > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> > > >
> > > > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
> > >
> > > hm, weird - Yinghai noticed this crash in -tip and we have that patch
> > > zapped from all -*-next branches already.
> >
> > I guess you applied Ken's patch from email and had it in -mm?
>
> Nope.
>
> > We have a
> > new version of that patch from Ken now based on Yinghai's bugreport - only
> > lightly tested for now.
> >
> > Ingo
> >
> > ----------->
> > >From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
> > From: Ken Chen <[email protected]>
> > Date: Fri, 19 Dec 2008 10:11:50 -0800
> > Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution
>
> linux-next has:
>
> commit 6eed5aaacae0ced9a43a923633befd2c695d6620
> Author: Ken Chen <[email protected]>
> Date: Mon Dec 15 23:37:50 2008 -0800
>
> sched: fix uneven per-cpu task_group share distribution
>
> so I guess we have a latency problem.

ah, yes. I did this 22 hours ago:

| commit ea52f4ea5666ffed175ec41efac52b560818563d
| Author: Ingo Molnar <[email protected]>
| Date: Fri Dec 19 02:09:20 2008 +0100
|
| Revert "sched: fix uneven per-cpu task_group share distribution"
|
| This reverts commit 6eed5aaacae0ced9a43a923633befd2c695d6620.
|
| Causes crashes.

So should be fixed in the next linux-next iteration.

Ingo

2008-12-20 04:33:49

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

* Andrew Morton <[email protected]> [2008-12-19 13:55:08]:

> On Thu, 18 Dec 2008 23:26:22 +0530
> Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> >
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> >
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical. Hence this information is stored in root_domain
> > struct which is one copy per partitioned sched domain.
> > The root_domain can be accessed from each cpu's runqueue
> > and there is one copy per partitioned sched domain.
> >
>
> kernel/sched.c: In function 'find_busiest_group':
> kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
>
> Due to
>
> first_cpu(group_leader->cpumask);
>
> apparently because Rusty changed sched_group.cpumask into a plain old
> array and nobody tests their stuff against the tree into which it is
> actually integrated :(

Hi Andrew,

I agree. These are integration issues and I will test better next time.

There were two such bugs which Ingo fixed in the tip.

commit 220e7f617826e0527bbc523ba859f6a4bae0bfe1
Author: Ingo Molnar <[email protected]>
Date: Fri Dec 19 00:53:40 2008 +0100

sched: fix warning in kernel/sched.c

Impact: fix cpumask conversion bug

this warning:

kernel/sched.c: In function find_busiest_group:
kernel/sched.c:3429: warning: passing argument 1 of __first_cpu from incompatible pointer type

shows that we forgot to convert a new patch to the new cpumask APIs.

Signed-off-by: Ingo Molnar <[email protected]>


commit edb4c71953409c1deac1a80528ac0aa768762b33
Author: Ingo Molnar <[email protected]>
Date: Thu Dec 18 21:30:23 2008 +0100

sched: move test_sd_parent() to an SMP section of sched.h

Impact: build fix

Signed-off-by: Ingo Molnar <[email protected]>

I have tested the sched-tip with these fixes.

>
> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
>
> This warning is correct - the code is buggy.

Yes this is my code bug. I did not see the warning in sched.c. Is
there any build option that I need to pass in order to get -Wall
effect?

Here is the fix to initialise the active_balance=0.

Thanks for the detailed review. I will work to improve the quality of
my submission.

--Vaidy

sched: bug fix -- initialise active_balance variable

In sched.c load_balance_newidle, potential use of uninitialised
variable.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index e6a88bf..a21fe6d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3692,7 +3692,7 @@ redo:
}

if (!ld_moved) {
- int active_balance;
+ int active_balance = 0;

schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&

2008-12-20 04:45:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <[email protected]> wrote:

> >
> > kernel/sched.c: In function 'schedule':
> > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> >
> > This warning is correct - the code is buggy.
>
> Yes this is my code bug. I did not see the warning in sched.c. Is
> there any build option that I need to pass in order to get -Wall
> effect?

That was just with plain old kbuild: `make allmodconfig;make'.

That warning was produced by gcc-4.0.2. If you're using something more
recent then gcc has regressed.

2008-12-20 07:55:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu


* Andrew Morton <[email protected]> wrote:

> On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > >
> > > kernel/sched.c: In function 'schedule':
> > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > >
> > > This warning is correct - the code is buggy.
> >
> > Yes this is my code bug. I did not see the warning in sched.c. Is
> > there any build option that I need to pass in order to get -Wall
> > effect?
>
> That was just with plain old kbuild: `make allmodconfig;make'.
>
> That warning was produced by gcc-4.0.2. If you're using something more
> recent then gcc has regressed.

hm, it didnt trigger here with gcc 4.3 - and the condition is serious.
I've applied the fix, thanks guys!

Ingo

2008-12-20 10:04:45

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

* Andrew Morton <[email protected]> [2008-12-19 20:44:55]:

> On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > >
> > > kernel/sched.c: In function 'schedule':
> > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > >
> > > This warning is correct - the code is buggy.
> >
> > Yes this is my code bug. I did not see the warning in sched.c. Is
> > there any build option that I need to pass in order to get -Wall
> > effect?
>
> That was just with plain old kbuild: `make allmodconfig;make'.
>
> That warning was produced by gcc-4.0.2. If you're using something more
> recent then gcc has regressed.

This is an interesting problem. I am unable to get that warning in the
following GCC versions even with a combination of the following
cmdline options: -Wall -Wextra -Wuninitialized

gcc version 4.1.2 20070502
gcc version 4.2.3

I will look for older gcc and check this out.

--Vaidy

2008-12-20 10:35:39

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

* Vaidyanathan Srinivasan <[email protected]> [2008-12-20 15:32:28]:

> * Andrew Morton <[email protected]> [2008-12-19 20:44:55]:
>
> > On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <[email protected]> wrote:
> >
> > > >
> > > > kernel/sched.c: In function 'schedule':
> > > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > >
> > > > This warning is correct - the code is buggy.
> > >
> > > Yes this is my code bug. I did not see the warning in sched.c. Is
> > > there any build option that I need to pass in order to get -Wall
> > > effect?
> >
> > That was just with plain old kbuild: `make allmodconfig;make'.
> >
> > That warning was produced by gcc-4.0.2. If you're using something more
> > recent then gcc has regressed.
>
> This is an interesting problem. I am unable to get that warning in the
> following GCC versions even with a combination of the following
> cmdline options: -Wall -Wextra -Wuninitialized
>
> gcc version 4.1.2 20070502
> gcc version 4.2.3
>
> I will look for older gcc and check this out.

I am able to get the above warning in
gcc version 3.4.6 (Debian 3.4.6-9)
with default kbuild, no additional params.

Did not get the warning in
gcc version 4.1.3 20080623

Hence there has been some change in GCC that prevented the
uninitialized warning.

--Vaidy

2008-12-20 10:53:58

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

* Vaidyanathan Srinivasan <[email protected]> [2008-12-20 16:06:02]:

> * Vaidyanathan Srinivasan <[email protected]> [2008-12-20 15:32:28]:
>
> > * Andrew Morton <[email protected]> [2008-12-19 20:44:55]:
> >
> > > On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <[email protected]> wrote:
> > >
> > > > >
> > > > > kernel/sched.c: In function 'schedule':
> > > > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > > >
> > > > > This warning is correct - the code is buggy.
> > > >
> > > > Yes this is my code bug. I did not see the warning in sched.c. Is
> > > > there any build option that I need to pass in order to get -Wall
> > > > effect?
> > >
> > > That was just with plain old kbuild: `make allmodconfig;make'.
> > >
> > > That warning was produced by gcc-4.0.2. If you're using something more
> > > recent then gcc has regressed.
> >
> > This is an interesting problem. I am unable to get that warning in the
> > following GCC versions even with a combination of the following
> > cmdline options: -Wall -Wextra -Wuninitialized
> >
> > gcc version 4.1.2 20070502
> > gcc version 4.2.3
> >
> > I will look for older gcc and check this out.
>
> I am able to get the above warning in
> gcc version 3.4.6 (Debian 3.4.6-9)
> with default kbuild, no additional params.
>
> Did not get the warning in
> gcc version 4.1.3 20080623
>
> Hence there has been some change in GCC that prevented the
> uninitialized warning.

Related GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=179
Claimed to have been fixed in GCC 4.4, but I don't know if we are ready
to test kernel compile with GCC 4.4.

I have tested till gcc version 4.3.2 (Debian 4.3.2-1) and the bug
is not fixed. I do not get the warning from sched.c.

--Vaidy

2008-12-21 08:47:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu


* Vaidyanathan Srinivasan <[email protected]> wrote:

> * Vaidyanathan Srinivasan <[email protected]> [2008-12-20 16:06:02]:
>
> > * Vaidyanathan Srinivasan <[email protected]> [2008-12-20 15:32:28]:
> >
> > > * Andrew Morton <[email protected]> [2008-12-19 20:44:55]:
> > >
> > > > On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <[email protected]> wrote:
> > > >
> > > > > >
> > > > > > kernel/sched.c: In function 'schedule':
> > > > > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > > > >
> > > > > > This warning is correct - the code is buggy.
> > > > >
> > > > > Yes this is my code bug. I did not see the warning in sched.c. Is
> > > > > there any build option that I need to pass in order to get -Wall
> > > > > effect?
> > > >
> > > > That was just with plain old kbuild: `make allmodconfig;make'.
> > > >
> > > > That warning was produced by gcc-4.0.2. If you're using something more
> > > > recent then gcc has regressed.
> > >
> > > This is an interesting problem. I am unable to get that warning in the
> > > following GCC versions even with a combination of the following
> > > cmdline options: -Wall -Wextra -Wuninitialized
> > >
> > > gcc version 4.1.2 20070502
> > > gcc version 4.2.3
> > >
> > > I will look for older gcc and check this out.
> >
> > I am able to get the above warning in
> > gcc version 3.4.6 (Debian 3.4.6-9)
> > with default kbuild, no additional params.
> >
> > Did not get the warning in
> > gcc version 4.1.3 20080623
> >
> > Hence there has been some change in GCC that prevented the
> > uninitialized warning.
>
> Related GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=179 Claimed
> to have been fixed in GCC 4.4, but I don't know if we are ready to test
> kernel compile with GCC 4.4.

4.4 is still way too noisy on x86 so not really practical.

> I have tested till gcc version 4.3.2 (Debian 4.3.2-1) and the bug is not
> fixed. I do not get the warning from sched.c.

neither did i get that warning. But Andrew did so thankfully, so we've got
the fix :) We'll always need variance in testing.

Ingo

2008-12-29 23:44:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

Hi, Vaidyanathan.
It's very late reponse. :(

> Results:
> --------
>
> Basic functionality of the code has not changed and the power vs
> performance benefits for kernbench are similar to the ones posted
> earlier.
>
> KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> package system
>
> SchedMC Run Time Package Idle Energy Power
> 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> 2 74.95 19.53% 85.92% 0.90x J 0.98y W
>
> The results are marginally better than the previous version of the
> patch series which could be within the test variation.
>
> Please feel free to test, and let me know your comments and feedback.
> I will post more experimental results with various benchmarks.

Your result is very interesting.
level 2 is more fast and efficient of power.

What's major contributor to use less time in level 2?
I think it's cache bounce is less time than old.
Is right ?

I want to test SCHED_MC but I don't know what you use to benchmark about power.
How do I get the data about 'Package, Idle, Energy, Power'?

--
Kinds regards,
MinChan Kim

2008-12-30 02:48:34

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* MinChan Kim <[email protected]> [2008-12-30 08:43:58]:

> Hi, Vaidyanathan.
> It's very late reponse. :(
>
> > Results:
> > --------
> >
> > Basic functionality of the code has not changed and the power vs
> > performance benefits for kernbench are similar to the ones posted
> > earlier.
> >
> > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > package system
> >
> > SchedMC Run Time Package Idle Energy Power
> > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
> >
> > The results are marginally better than the previous version of the
> > patch series which could be within the test variation.
> >
> > Please feel free to test, and let me know your comments and feedback.
> > I will post more experimental results with various benchmarks.
>
> Your result is very interesting.
> level 2 is more fast and efficient of power.
>
> What's major contributor to use less time in level 2?
> I think it's cache bounce is less time than old.
> Is right ?
>

Yes, correct

> I want to test SCHED_MC but I don't know what you use to benchmark about power.
> How do I get the data about 'Package, Idle, Energy, Power'?
>

Note, it is Package Idle (for both packages), it is a x86-64 8 core,
dual socket, quad core box. It is not Package, Idle.

For Energy and Power you need a means of measuring power like a meter.

> --
> Kinds regards,
> MinChan Kim
>

--
Balbir

2008-12-30 06:22:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n


* Balbir Singh <[email protected]> wrote:

> > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > package system
> > >
> > > SchedMC Run Time Package Idle Energy Power
> > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W

> > Your result is very interesting.
> > level 2 is more fast and efficient of power.
> >
> > What's major contributor to use less time in level 2?
> > I think it's cache bounce is less time than old.
> > Is right ?
>
> Yes, correct

yes, i too noticed that runtime improved so dramatically: +7.5% on
kernbench is a _very_ big deal.

So i wanted to ask you to re-test whether this speedup is reproducible,
and if yes, please check a few other workloads (for example sysbench on
postgresql / mysqld) and send a patch that changes the
sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).

Sidenote, _please_ fix your mail client to not mess up the Cc: list via:

Mail-Followup-To: MinChan Kim <[email protected]>,
Vaidyanathan Srinivasan <[email protected]>,
Linux Kernel <[email protected]>,
Suresh B Siddha <[email protected]>,
Venkatesh Pallipadi <[email protected]>,
Peter Zijlstra <[email protected]>,
Ingo Molnar <[email protected]>, Dipankar Sarma <[email protected]>,
Vatsa <[email protected]>, Gautham R Shenoy <[email protected]>,
Andi Kleen <[email protected]>,
David Collier-Brown <[email protected]>,
Tim Connors <[email protected]>,
Max Krasnyansky <[email protected]>,
Gregory Haskins <[email protected]>,
Pavel Machek <[email protected]>,
Andrew Morton <[email protected]>

Ingo

2008-12-30 06:44:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Ingo Molnar <[email protected]> [2008-12-30 07:21:39]:

>
> * Balbir Singh <[email protected]> wrote:
>
> > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > package system
> > > >
> > > > SchedMC Run Time Package Idle Energy Power
> > > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
>
> > > Your result is very interesting.
> > > level 2 is more fast and efficient of power.
> > >
> > > What's major contributor to use less time in level 2?
> > > I think it's cache bounce is less time than old.
> > > Is right ?
> >
> > Yes, correct
>
> yes, i too noticed that runtime improved so dramatically: +7.5% on
> kernbench is a _very_ big deal.
>

Yes, it is, I think one way to identify it on the x86 box would be to
use the performance counter patches, I'll see it I can get it working.

> So i wanted to ask you to re-test whether this speedup is reproducible,
> and if yes, please check a few other workloads (for example sysbench on
> postgresql / mysqld) and send a patch that changes the
> sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).
>

We'll do that, Vaidyanathan is on vacation and will be back in the
first week of January. We'll revert back with more tests and results.
>From code review, it would seem that the benefits are going to be
workload specific, but results would show for certain.

> Sidenote, _please_ fix your mail client to not mess up the Cc: list via:
>

Thanks, mutt has this turned on by default, I am turning it off and
hope that followup-to goes away.


--
Balbir

2008-12-30 07:20:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n


* Balbir Singh <[email protected]> wrote:

> * Ingo Molnar <[email protected]> [2008-12-30 07:21:39]:
>
> >
> > * Balbir Singh <[email protected]> wrote:
> >
> > > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > > package system
> > > > >
> > > > > SchedMC Run Time Package Idle Energy Power
> > > > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
> >
> > > > Your result is very interesting.
> > > > level 2 is more fast and efficient of power.
> > > >
> > > > What's major contributor to use less time in level 2?
> > > > I think it's cache bounce is less time than old.
> > > > Is right ?
> > >
> > > Yes, correct
> >
> > yes, i too noticed that runtime improved so dramatically: +7.5% on
> > kernbench is a _very_ big deal.
> >
>
> Yes, it is, I think one way to identify it on the x86 box would be to
> use the performance counter patches, I'll see it I can get it working.

yeah, good idea, and it's easy to get perf-counters working: just boot
tip/master on a Core2 (or later) Intel CPU [see below about how to use
perfcounters on other CPUs], and pick up timec.c:

http://redhat.com/~mingo/perfcounters/timec.c

then run a kernel build like this:

$ ./timec -e -5,-4,-3,0,1,2,3,4,5 make -j16 bzImage

that's all. You should get an array of metrics like this:

[...]
Kernel: arch/x86/boot/bzImage is ready (#28)

Performance counter stats for 'make':

628315.871980 task clock ticks (msecs)

42330 CPU migrations (events)
124980 context switches (events)
18698292 pagefaults (events)
1351875946010 CPU cycles (events)
1121901478363 instructions (events)
10654788968 cache references (events)
633581867 cache misses (events)
247508675357 branches (events)
21567244144 branch misses (events)

Wall-clock time elapsed: 118348.109066 msecs

I'd guess the CPU migrations, context-switches, cycles+instructions
counters are the most interesting ones. I.e. on a Core2 a good metric is
probably:

$ ./timec -e -5,-4,0,1 make -j16 bzImage

[ If you see any weirdnesses in the numbers, then you should first suspect
some perf-counters bug. Please report any problems to us! ]

If your systems are not Core2 based then patches to extend perfcounters
support to those CPUs are welcome as well ;-)

NOTE: the sw counters (migrations, per task cost) are available on all
CPUs. I.e. you can always do:

$ ./timec -e -5,-4,-3 make -j16 bzImage

regardless of CPU type.

And note that the wall-clock and task-clock results in this case will be
far more accurate than normal 'time make -j16 bzImage' output.

Ingo

2008-12-30 17:28:27

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Balbir Singh <[email protected]> [2008-12-30 08:18:19]:

> * MinChan Kim <[email protected]> [2008-12-30 08:43:58]:
>
> > Hi, Vaidyanathan.
> > It's very late reponse. :(
> >
> > > Results:
> > > --------
> > >
> > > Basic functionality of the code has not changed and the power vs
> > > performance benefits for kernbench are similar to the ones posted
> > > earlier.
> > >
> > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > package system
> > >
> > > SchedMC Run Time Package Idle Energy Power
> > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
> > >
> > > The results are marginally better than the previous version of the
> > > patch series which could be within the test variation.
> > >
> > > Please feel free to test, and let me know your comments and feedback.
> > > I will post more experimental results with various benchmarks.
> >
> > Your result is very interesting.
> > level 2 is more fast and efficient of power.
> >
> > What's major contributor to use less time in level 2?
> > I think it's cache bounce is less time than old.
> > Is right ?
> >
>
> Yes, correct
>
> > I want to test SCHED_MC but I don't know what you use to benchmark about power.
> > How do I get the data about 'Package, Idle, Energy, Power'?
> >
>
> Note, it is Package Idle (for both packages), it is a x86-64 8 core,
> dual socket, quad core box. It is not Package, Idle.
>
> For Energy and Power you need a means of measuring power like a meter.
>

Hi MinChan,

Thank you for your interest in sched_mc power saving feature. As
Balbir has mentioned, you will need a power measurement infrastructure
like an external power meter.

Laptops have battery discharge rate measurement that is a good
approximation for power consumption. But that is not helpful to test
sched_mc since we would need a multi-socket multi core system to get
power saving benefit from the enhancements.

The 'package idle' information comes from /proc/stat by adding up the
idle times from various logical CPUs belonging to a single physical
package. All logical CPUs belonging to a single physical package can
be identified from /proc/cpuinfo or
/sys/devices/system/cpu/cpu<n>/topology/physical_package_id

--Vaidy

2008-12-30 18:12:12

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Ingo Molnar <[email protected]> [2008-12-30 07:21:39]:

>
> * Balbir Singh <[email protected]> wrote:
>
> > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > package system
> > > >
> > > > SchedMC Run Time Package Idle Energy Power
> > > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
>
> > > Your result is very interesting.
> > > level 2 is more fast and efficient of power.
> > >
> > > What's major contributor to use less time in level 2?
> > > I think it's cache bounce is less time than old.
> > > Is right ?
> >
> > Yes, correct
>
> yes, i too noticed that runtime improved so dramatically: +7.5% on
> kernbench is a _very_ big deal.
>
> So i wanted to ask you to re-test whether this speedup is reproducible,
> and if yes, please check a few other workloads (for example sysbench on
> postgresql / mysqld) and send a patch that changes the
> sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).

The speedup for kernbench is reproducible. I will post a more
detailed test report on kernbench soon. The power-performance benefit
is for an under-utilised system (nr_cpus/2) run of kernbench which is
very ideal to demonstrate the power savings feature. I will also try
sysbench and update results little later. As Balbir mentioned, I am
on vacation and traveling. I will post more benchmark results as soon
as possible.

--Vaidy

2008-12-18 17:53:38

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 2/8] sched: Framework for sched_mc/smt_power_savings=N

From: Gautham R Shenoy <[email protected]>

Currently the sched_mc/smt_power_savings variable is a boolean,
which either enables or disables topology based power savings.
This patch extends the behaviour of the variable from boolean to
multivalued, such that based on the value, we decide how
aggressively do we want to perform powersavings balance at
appropriate sched domain based on topology.

Variable levels of power saving tunable would benefit end user to
match the required level of power savings vs performance
trade-off depending on the system configuration and workloads.

This version makes the sched_mc_power_savings global variable to
take more values (0,1,2). Later versions can have a single
tunable called sched_power_savings instead of
sched_{mc,smt}_power_savings.

Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
Acked-by: Balbir Singh <[email protected]>
---

include/linux/sched.h | 11 +++++++++++
kernel/sched.c | 17 ++++++++++++++---
2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 311122f..5fe906b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -764,6 +764,17 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */

+enum powersavings_balance_level {
+ POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
+ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
+ * first for long running threads
+ */
+ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
+ * cpu package for power savings
+ */
+ MAX_POWERSAVINGS_BALANCE_LEVELS
+};
+
extern int sched_mc_power_savings, sched_smt_power_savings;

static inline int sd_balance_for_mc_power(void)
diff --git a/kernel/sched.c b/kernel/sched.c
index e4bb1dd..16897ab 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7879,14 +7879,25 @@ int arch_reinit_sched_domains(void)
static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
{
int ret;
+ unsigned int level = 0;

- if (buf[0] != '0' && buf[0] != '1')
+ if (sscanf(buf, "%u", &level) != 1)
+ return -EINVAL;
+
+ /*
+ * level is always be positive so don't check for
+ * level < POWERSAVINGS_BALANCE_NONE which is 0
+ * What happens on 0 or 1 byte write,
+ * need to check for count as well?
+ */
+
+ if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
return -EINVAL;

if (smt)
- sched_smt_power_savings = (buf[0] == '1');
+ sched_smt_power_savings = level;
else
- sched_mc_power_savings = (buf[0] == '1');
+ sched_mc_power_savings = level;

ret = arch_reinit_sched_domains();

2008-12-18 17:53:52

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 3/8] sched: favour lower logical cpu number for sched_mc balance

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
Acked-by: Balbir Singh <[email protected]>
---

kernel/sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 16897ab..0b9bbbd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3264,7 +3264,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
*/
if ((sum_nr_running < min_nr_running) ||
(sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
+ first_cpu(group->cpumask) >
first_cpu(group_min->cpumask))) {
group_min = group;
min_nr_running = sum_nr_running;
@@ -3280,7 +3280,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (sum_nr_running <= group_capacity - 1) {
if (sum_nr_running > leader_nr_running ||
(sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
+ first_cpu(group->cpumask) <
first_cpu(group_leader->cpumask))) {
group_leader = group;
leader_nr_running = sum_nr_running;

2008-12-18 17:54:34

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages

Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch. Use
this information in sched_mc_preferred_wakeup_cpu in function
wake_idle() to bias task wakeups if the following conditions
are satisfied:
- The present cpu that is trying to wakeup the process is
idle and waking the target process on this cpu will
potentially wakeup a completely idle package
- The previous cpu on which the target process ran is
also idle and hence selecting the previous cpu may
wakeup a semi idle cpu package
- The task being woken up is allowed to run in the
nominated cpu (cpu affinity and restrictions)

Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.

Cache hotness is considered since the actual biasing happens
in wake_idle() only if the application is cache cold.

This technique will effectively move short running bursty jobs in
a mostly idle system.

Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched_fair.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 98345e4..c82386c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1027,6 +1027,24 @@ static int wake_idle(int cpu, struct task_struct *p)
cpumask_t tmp;
struct sched_domain *sd;
int i;
+ unsigned int chosen_wakeup_cpu;
+ int this_cpu;
+
+ /*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ */
+
+ this_cpu = smp_processor_id();
+ chosen_wakeup_cpu =
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+ idle_cpu(cpu) && idle_cpu(this_cpu) &&
+ p->mm && !(p->flags & PF_KTHREAD) &&
+ cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
+ return chosen_wakeup_cpu;

/*
* If it is idle, then it is the best cpu to run this task.

2008-12-18 17:53:20

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions

BALANCE_FOR_MC_POWER and similar macros defined in sched.h are
not constants and have various condition checks and significant
amount of code that is not suitable to be contain in a macro.
Also there could be side effects on the expressions passed to
some of them like test_sd_parent().

This patch converts all complex macros related to power savings
balance to inline functions.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 33 ++++++++++++++++++++++++---------
include/linux/topology.h | 4 ++--
2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..311122f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -764,15 +764,23 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */

-#define BALANCE_FOR_MC_POWER \
- (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
+extern int sched_mc_power_savings, sched_smt_power_savings;
+
+static inline int sd_balance_for_mc_power(void)
+{
+ if (sched_smt_power_savings)
+ return SD_POWERSAVINGS_BALANCE;

-#define BALANCE_FOR_PKG_POWER \
- ((sched_mc_power_savings || sched_smt_power_savings) ? \
- SD_POWERSAVINGS_BALANCE : 0)
+ return 0;
+}

-#define test_sd_parent(sd, flag) ((sd->parent && \
- (sd->parent->flags & flag)) ? 1 : 0)
+static inline int sd_balance_for_package_power(void)
+{
+ if (sched_mc_power_savings | sched_smt_power_savings)
+ return SD_POWERSAVINGS_BALANCE;
+
+ return 0;
+}


struct sched_group {
@@ -1358,6 +1366,15 @@ struct task_struct {
struct list_head *scm_work_list;
};

+/* Test a flag in parent sched domain */
+static inline int test_sd_parent(struct sched_domain *sd, int flag)
+{
+ if (sd->parent && (sd->parent->flags & flag))
+ return 1;
+
+ return 0;
+}
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2215,8 +2232,6 @@ __trace_special(void *__tr, void *__data,
extern long sched_setaffinity(pid_t pid, const cpumask_t *new_mask);
extern long sched_getaffinity(pid_t pid, cpumask_t *mask);

-extern int sched_mc_power_savings, sched_smt_power_savings;
-
extern void normalize_rt_tasks(void);

#ifdef CONFIG_GROUP_SCHED
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 117f1b7..7f10439 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,7 +125,7 @@ void arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | BALANCE_FOR_MC_POWER, \
+ | sd_balance_for_mc_power(),\
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -150,7 +150,7 @@ void arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | BALANCE_FOR_PKG_POWER,\
+ | sd_balance_for_package_power(),\
.last_balance = jiffies, \
.balance_interval = 1, \
}

2008-12-18 17:54:53

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 6/8] sched: activate active load balancing in new idle cpus

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3415fa3..f82a7a0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3692,10 +3692,64 @@ redo:
}

if (!ld_moved) {
+ int active_balance;
+
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (sd->nr_balance_failed++ < 2)
+ return -1;
+
+ /*
+ * The only task running in a non-idle cpu can be moved to this
+ * cpu in an attempt to completely freeup the other CPU
+ * package. The same method used to move task in load_balance()
+ * have been extended for load_balance_newidle() to speedup
+ * consolidation at sched_mc=POWERSAVINGS_BALANCE_WAKEUP (2)
+ *
+ * The package power saving logic comes from
+ * find_busiest_group(). If there are no imbalance, then
+ * f_b_g() will return NULL. However when sched_mc={1,2} then
+ * f_b_g() will select a group from which a running task may be
+ * pulled to this cpu in order to make the other package idle.
+ * If there is no opportunity to make a package idle and if
+ * there are no imbalance, then f_b_g() will return NULL and no
+ * action will be taken in load_balance_newidle().
+ *
+ * Under normal task pull operation due to imbalance, there
+ * will be more than one task in the source run queue and
+ * move_tasks() will succeed. ld_moved will be true and this
+ * active balance code will not be triggered.
+ */
+
+ /* Lock busiest in correct order while this_rq is held */
+ double_lock_balance(this_rq, busiest);
+
+ /*
+ * don't kick the migration_thread, if the curr
+ * task on busiest cpu can't be moved to this_cpu
+ */
+ if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+ double_unlock_balance(this_rq, busiest);
+ all_pinned = 1;
+ return ld_moved;
+ }
+
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+ }
+
+ double_unlock_balance(this_rq, busiest);
+ if (active_balance)
+ wake_up_process(busiest->migration_thread);
+
} else
sd->nr_balance_failed = 0;

2008-12-18 17:54:18

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0b9bbbd..3415fa3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,14 @@ struct root_domain {
#ifdef CONFIG_SMP
struct cpupri cpupri;
#endif
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+ /*
+ * Preferred wake up cpu nominated by sched_mc balance that will be
+ * used when most cpus are idle in the system indicating overall very
+ * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ */
+ unsigned int sched_mc_preferred_wakeup_cpu;
+#endif
};

/*
@@ -3407,6 +3415,10 @@ out_balanced:

if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ first_cpu(group_leader->cpumask);
+ }
return group_min;
}
#endif

2008-12-18 17:55:39

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 7/8] sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0

Add SD_BALANCE_NEWIDLE flag at MC level and CPU level
if sched_mc is set. This helps power savings and
will not affect performance when sched_mc=0

Ingo and Mike Galbraith have optimised the SD flags by
removing SD_BALANCE_NEWIDLE at MC and CPU level. This
helps performance but hurts power savings since this
slows down task consolidation by reducing the number
of times load_balance is run.

sched: fine-tune SD_MC_INIT
commit 14800984706bf6936bbec5187f736e928be5c218
Author: Mike Galbraith <[email protected]>
Date: Fri Nov 7 15:26:50 2008 +0100

sched: re-tune balancing -- revert
commit 9fcd18c9e63e325dbd2b4c726623f760788d5aa8
Author: Ingo Molnar <[email protected]>
Date: Wed Nov 5 16:52:08 2008 +0100

This patch selectively enables SD_BALANCE_NEWIDLE flag
only when sched_mc is set to 1 or 2. This helps power savings
by task consolidation and also does not hurt performance at
sched_mc=0 where all power saving optimisations are turned off.

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

include/linux/sched.h | 13 +++++++++++++
include/linux/topology.h | 6 ++++--
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5fe906b..07348ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,19 @@ static inline int sd_balance_for_package_power(void)
return 0;
}

+/*
+ * Optimise SD flags for power savings:
+ * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
+ * Keep default SD flags if sched_{smt,mc}_power_saving=0
+ */
+
+static inline int sd_power_saving_flags(void)
+{
+ if (sched_mc_power_savings | sched_smt_power_savings)
+ return SD_BALANCE_NEWIDLE;
+
+ return 0;
+}

struct sched_group {
struct sched_group *next; /* Must be a circular list */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7f10439..1628b43 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,7 +125,8 @@ void arch_update_cpu_topology(void);
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
- | sd_balance_for_mc_power(),\
+ | sd_balance_for_mc_power()\
+ | sd_power_saving_flags(),\
.last_balance = jiffies, \
.balance_interval = 1, \
}
@@ -150,7 +151,8 @@ void arch_update_cpu_topology(void);
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE \
- | sd_balance_for_package_power(),\
+ | sd_balance_for_package_power()\
+ | sd_power_saving_flags(),\
.last_balance = jiffies, \
.balance_interval = 1, \
}

2008-12-18 17:55:15

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle()

load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE
is set at higher level domain (3-CPU) and not in low level domain
(2-MC).

pulled_task is initialised to -1 and checked for non-zero which
is always true if the lowest level sched_domain does not have
SD_BALANCE_NEWIDLE flag set.

Trivial fix to initialise pulled_task to zero.

This patch has been queued for 2.6.29
http://lkml.org/lkml/2008/12/8/213

Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
---

kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f82a7a0..e6a88bf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3773,7 +3773,7 @@ out_balanced:
static void idle_balance(int this_cpu, struct rq *this_rq)
{
struct sched_domain *sd;
- int pulled_task = -1;
+ int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
cpumask_t tmpmask;

2008-12-18 18:12:52

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages

* Vaidyanathan Srinivasan <[email protected]> [2008-12-18 23:26:29]:

> Preferred wakeup cpu (from a semi idle package) has been
> nominated in find_busiest_group() in the previous patch. Use
> this information in sched_mc_preferred_wakeup_cpu in function
> wake_idle() to bias task wakeups if the following conditions
> are satisfied:
> - The present cpu that is trying to wakeup the process is
> idle and waking the target process on this cpu will
> potentially wakeup a completely idle package
> - The previous cpu on which the target process ran is
> also idle and hence selecting the previous cpu may
> wakeup a semi idle cpu package
> - The task being woken up is allowed to run in the
> nominated cpu (cpu affinity and restrictions)
>
> Basically if both the current cpu and the previous cpu on
> which the task ran is idle, select the nominated cpu from semi
> idle cpu package for running the new task that is waking up.
>
> Cache hotness is considered since the actual biasing happens
> in wake_idle() only if the application is cache cold.
>
> This technique will effectively move short running bursty jobs in
> a mostly idle system.
>
> Wakeup biasing for power savings gets automatically disabled if
> system utilisation increases due to the fact that the probability
> of finding both this_cpu and prev_cpu idle decreases.
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched_fair.c | 18 ++++++++++++++++++
> 1 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 98345e4..c82386c 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1027,6 +1027,24 @@ static int wake_idle(int cpu, struct task_struct *p)
> cpumask_t tmp;
> struct sched_domain *sd;
> int i;
> + unsigned int chosen_wakeup_cpu;
> + int this_cpu;
> +
> + /*
> + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> + * are idle and this is not a kernel thread and this task's affinity
> + * allows it to be moved to preferred cpu, then just move!
> + */
> +
> + this_cpu = smp_processor_id();
> + chosen_wakeup_cpu =
> + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> +
> + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> + idle_cpu(cpu) && idle_cpu(this_cpu) &&
> + p->mm && !(p->flags & PF_KTHREAD) &&
> + cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> + return chosen_wakeup_cpu;
>
> /*
> * If it is idle, then it is the best cpu to run this task.

Acked-by: Balbir Singh <[email protected]>

--
Balbir

2008-12-18 18:12:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu

* Vaidyanathan Srinivasan <[email protected]> [2008-12-18 23:26:22]:

> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
>
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
>
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical. Hence this information is stored in root_domain
> struct which is one copy per partitioned sched domain.
> The root_domain can be accessed from each cpu's runqueue
> and there is one copy per partitioned sched domain.
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 12 ++++++++++++
> 1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 0b9bbbd..3415fa3 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -493,6 +493,14 @@ struct root_domain {
> #ifdef CONFIG_SMP
> struct cpupri cpupri;
> #endif
> +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> + /*
> + * Preferred wake up cpu nominated by sched_mc balance that will be
> + * used when most cpus are idle in the system indicating overall very
> + * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
> + */
> + unsigned int sched_mc_preferred_wakeup_cpu;
> +#endif
> };
>
> /*
> @@ -3407,6 +3415,10 @@ out_balanced:
>
> if (this == group_leader && group_leader != group_min) {
> *imbalance = min_load_per_task;
> + if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
> + cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> + first_cpu(group_leader->cpumask);
> + }
> return group_min;
> }
> #endif

Acked-by: Balbir Singh <[email protected]>

--
Balbir

2008-12-18 18:13:34

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle()

* Vaidyanathan Srinivasan <[email protected]> [2008-12-18 23:26:59]:

> load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE
> is set at higher level domain (3-CPU) and not in low level domain
> (2-MC).
>
> pulled_task is initialised to -1 and checked for non-zero which
> is always true if the lowest level sched_domain does not have
> SD_BALANCE_NEWIDLE flag set.
>
> Trivial fix to initialise pulled_task to zero.
>
> This patch has been queued for 2.6.29
> http://lkml.org/lkml/2008/12/8/213
>
> Signed-off-by: Vaidyanathan Srinivasan <[email protected]>
> ---
>
> kernel/sched.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f82a7a0..e6a88bf 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3773,7 +3773,7 @@ out_balanced:
> static void idle_balance(int this_cpu, struct rq *this_rq)
> {
> struct sched_domain *sd;
> - int pulled_task = -1;
> + int pulled_task = 0;
> unsigned long next_balance = jiffies + HZ;
> cpumask_t tmpmask;
>

Isn't this already in Ingo's patch queue?

--
Balbir

2008-12-18 20:20:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n


* Vaidyanathan Srinivasan <[email protected]> wrote:

> Hi,
>
> The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
> the workload in the system on minimum number of CPU packages and tries
> to keep rest of the CPU packages idle for longer duration. Thus
> consolidating workloads to fewer packages help other packages to be in
> idle state and save power. The current implementation is very
> conservative and does not work effectively across different workloads.
> Initial idea of tunable sched_mc_power_savings=n was proposed to enable
> tuning of the power saving load balancer based on the system
> configuration, workload characteristics and end user requirements.
>
> The power savings and performance of the given workload in an under
> utilised system can be controlled by setting values of 0, 1 or 2 to
> /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
> performance and least power savings and level 2 indicating maximum power
> savings even at the cost of slight performance degradation.
>
> Please refer to the following discussions and article for details.
>
> [1]Making power policy just work
> http://lwn.net/Articles/287924/
>
> [2][RFC v1] Tunable sched_mc_power_savings=n
> http://lwn.net/Articles/287882/
>
> v2: http://lwn.net/Articles/297306/
> v3: http://lkml.org/lkml/2008/11/10/260
> v4: http://lkml.org/lkml/2008/11/21/47
> v5: http://lkml.org/lkml/2008/12/11/178
> v6: http://lwn.net/Articles/311830/
>
> The following series of patch demonstrates the basic framework for
> tunable sched_mc_power_savings.
>
> This version of the patch incorporates comments and feedback received
> on the previous post from Andrew Morton.
>
> Changes form v6:
> ----------------
> * Convert BALANCE_FOR_xx_POWER and related macros to inline functions
> based on comments from Andrew and Ingo.
> * Ran basic kernelbench test and did not see any performance variation
> due to the changes.
>
> Changes form v5:
> ---------------
> * Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
> * Dropped the RFC prefix to indicate that the patch is ready for
> testing and inclusion
> * Patch series against 2.6.28-rc8 kernel

thanks, applied - and i started testing them. It needed some help here and
there to resolve conflicts with pending cpumask changes. Could you please
double-check the merged up end result in the latest scheduler devel tree:

http://people.redhat.com/mingo/tip.git/README

thanks,

Ingo

2008-12-18 20:18:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle()


* Balbir Singh <[email protected]> wrote:

> > static void idle_balance(int this_cpu, struct rq *this_rq)
> > {
> > struct sched_domain *sd;
> > - int pulled_task = -1;
> > + int pulled_task = 0;
> > unsigned long next_balance = jiffies + HZ;
> > cpumask_t tmpmask;
> >
>
> Isn't this already in Ingo's patch queue?

yep - i left it out.

Ingo

2008-12-18 20:32:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n


* Ingo Molnar <[email protected]> wrote:

> thanks, applied - and i started testing them. [...]

the sched.h bits needed the small fix below. (struct sched_domain is not
defined on UP)

Ingo

------------->
>From edb4c71953409c1deac1a80528ac0aa768762b33 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Thu, 18 Dec 2008 21:30:23 +0100
Subject: [PATCH] sched: move test_sd_parent() to an SMP section of sched.h

Impact: build fix

Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a933d9..e5f928a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -920,6 +920,15 @@ extern void partition_sched_domains(int ndoms_new, struct cpumask *doms_new,
struct sched_domain_attr *dattr_new);
extern int arch_reinit_sched_domains(void);

+/* Test a flag in parent sched domain */
+static inline int test_sd_parent(struct sched_domain *sd, int flag)
+{
+ if (sd->parent && (sd->parent->flags & flag))
+ return 1;
+
+ return 0;
+}
+
#else /* CONFIG_SMP */

struct sched_domain_attr;
@@ -1431,15 +1440,6 @@ struct task_struct {
#endif
};

-/* Test a flag in parent sched domain */
-static inline int test_sd_parent(struct sched_domain *sd, int flag)
-{
- if (sd->parent && (sd->parent->flags & flag))
- return 1;
-
- return 0;
-}
-
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH

2009-01-02 07:23:27

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Vaidyanathan Srinivasan <[email protected]> [2008-12-30 23:37:22]:

> * Ingo Molnar <[email protected]> [2008-12-30 07:21:39]:
>
> >
> > * Balbir Singh <[email protected]> wrote:
> >
> > > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > > package system
> > > > >
> > > > > SchedMC Run Time Package Idle Energy Power
> > > > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
> >
> > > > Your result is very interesting.
> > > > level 2 is more fast and efficient of power.
> > > >
> > > > What's major contributor to use less time in level 2?
> > > > I think it's cache bounce is less time than old.
> > > > Is right ?
> > >
> > > Yes, correct
> >
> > yes, i too noticed that runtime improved so dramatically: +7.5% on
> > kernbench is a _very_ big deal.
> >
> > So i wanted to ask you to re-test whether this speedup is reproducible,
> > and if yes, please check a few other workloads (for example sysbench on
> > postgresql / mysqld) and send a patch that changes the
> > sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).
>
> The speedup for kernbench is reproducible. I will post a more
> detailed test report on kernbench soon. The power-performance benefit
> is for an under-utilised system (nr_cpus/2) run of kernbench which is
> very ideal to demonstrate the power savings feature. I will also try
> sysbench and update results little later. As Balbir mentioned, I am
> on vacation and traveling. I will post more benchmark results as soon
> as possible.

Hi Ingo,

I ran my kernbench test setup over multiple iterations with several
thread counts. The performance will almost max out at 8-10 threads of
kernbench since the system has total of 8 logical CPUs.

The power saving benefits is expected to be maximum around 30-50%
system utilisation (3-5 threads).

I have posted the normalised results in various tables below.
Apparently the sched_mc=2 settings is helping at 8 threads case also
due to better cache line sharing. I will work with Balbir to see if
we can prove this assertion using performance counters. I will start
looking at sysbench and try to get benchmark results and power/energy
data in similar configurations.

--Vaidy

All these tests were done on sched-tip at
commit: fe235c6b2a4b9eb11909fe40249abcce67f7d45d
uname -a is 2.6.28-rc8-tip-sv-05664-gfe235c6-dirty

Kernbench was run on source 2.6.28 while the previous tests used
2.6.25 kernel which had far less stuff to compile in defconfig :)

System configuration is x86 quad core dual socket system
with 8 logical CPUs

Each result is an average of 5 iterations.

Kernbench threads = 2

SchedMC Run Time Package Idle(%) Energy(J) Power(W)
0 217.37 73.30 76.93 1.00 1.00
1 204.68 50.86 99.91 0.95 1.01
2 204.49 50.69 99.93 0.95 1.01

Kernbench threads = 4

SchedMC Run Time Package Idle(%) Energy(J) Power(W)
0 109.11 51.99 54.91 1.00 1.00
1 109.71 35.86 71.88 0.96 0.96
2 102.95 25.02 82.47 0.92 0.97

Kernbench threads = 6

SchedMC Run Time Package Idle(%) Energy(J) Power(W)
0 74.77 35.29 36.37 1.00 1.00
1 74.54 28.83 40.90 0.99 0.99
2 73.38 21.09 47.29 0.98 1.00

Kernbench threads = 8

SchedMC Run Time Package Idle(%) Energy(J) Power(W)
0 63.18 26.23 28.67 1.00 1.00
1 56.89 19.85 22.38 0.93 1.02
2 55.27 17.65 19.84 0.91 1.03


Relative measures for sched_mc=1 compared to baseline sched_mc=0

Threads=> 2 4 6 8 10 12
KernbenchTime 0.94 1.01 1.00 0.90 0.93 0.95
Energy 0.95 0.96 0.99 0.93 0.95 0.97
Power 1.01 0.96 0.99 1.02 1.02 1.01

Relative measures for sched_mc=2 compared to baseline sched_mc=0

Threads=> 2 4 6 8 10 12
KernbenchTime 0.94 0.94 0.98 0.87 0.92 0.94
Energy 0.95 0.92 0.98 0.91 0.94 0.96
Power 1.01 0.97 1.00 1.03 1.02 1.02

2009-01-02 22:16:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n


* Vaidyanathan Srinivasan <[email protected]> wrote:

> * Vaidyanathan Srinivasan <[email protected]> [2008-12-30 23:37:22]:
>
> > * Ingo Molnar <[email protected]> [2008-12-30 07:21:39]:
> >
> > >
> > > * Balbir Singh <[email protected]> wrote:
> > >
> > > > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > > > package system
> > > > > >
> > > > > > SchedMC Run Time Package Idle Energy Power
> > > > > > 0 81.68 52.83% 54.71% 1.00x J 1.00y W
> > > > > > 1 80.70 36.62% 70.11% 0.95x J 0.96y W
> > > > > > 2 74.95 19.53% 85.92% 0.90x J 0.98y W
> > >
> > > > > Your result is very interesting.
> > > > > level 2 is more fast and efficient of power.
> > > > >
> > > > > What's major contributor to use less time in level 2?
> > > > > I think it's cache bounce is less time than old.
> > > > > Is right ?
> > > >
> > > > Yes, correct
> > >
> > > yes, i too noticed that runtime improved so dramatically: +7.5% on
> > > kernbench is a _very_ big deal.
> > >
> > > So i wanted to ask you to re-test whether this speedup is reproducible,
> > > and if yes, please check a few other workloads (for example sysbench on
> > > postgresql / mysqld) and send a patch that changes the
> > > sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).
> >
> > The speedup for kernbench is reproducible. I will post a more
> > detailed test report on kernbench soon. The power-performance benefit
> > is for an under-utilised system (nr_cpus/2) run of kernbench which is
> > very ideal to demonstrate the power savings feature. I will also try
> > sysbench and update results little later. As Balbir mentioned, I am
> > on vacation and traveling. I will post more benchmark results as soon
> > as possible.
>
> Hi Ingo,
>
> I ran my kernbench test setup over multiple iterations with several
> thread counts. The performance will almost max out at 8-10 threads of
> kernbench since the system has total of 8 logical CPUs.
>
> The power saving benefits is expected to be maximum around 30-50%
> system utilisation (3-5 threads).
>
> I have posted the normalised results in various tables below.
> Apparently the sched_mc=2 settings is helping at 8 threads case also
> due to better cache line sharing. I will work with Balbir to see if
> we can prove this assertion using performance counters. I will start
> looking at sysbench and try to get benchmark results and power/energy
> data in similar configurations.
>
> --Vaidy
>
> All these tests were done on sched-tip at
> commit: fe235c6b2a4b9eb11909fe40249abcce67f7d45d
> uname -a is 2.6.28-rc8-tip-sv-05664-gfe235c6-dirty
>
> Kernbench was run on source 2.6.28 while the previous tests used
> 2.6.25 kernel which had far less stuff to compile in defconfig :)
>
> System configuration is x86 quad core dual socket system
> with 8 logical CPUs
>
> Each result is an average of 5 iterations.
>
> Kernbench threads = 2
>
> SchedMC Run Time Package Idle(%) Energy(J) Power(W)
> 0 217.37 73.30 76.93 1.00 1.00
> 1 204.68 50.86 99.91 0.95 1.01
> 2 204.49 50.69 99.93 0.95 1.01
>
> Kernbench threads = 4
>
> SchedMC Run Time Package Idle(%) Energy(J) Power(W)
> 0 109.11 51.99 54.91 1.00 1.00
> 1 109.71 35.86 71.88 0.96 0.96
> 2 102.95 25.02 82.47 0.92 0.97
>
> Kernbench threads = 6
>
> SchedMC Run Time Package Idle(%) Energy(J) Power(W)
> 0 74.77 35.29 36.37 1.00 1.00
> 1 74.54 28.83 40.90 0.99 0.99
> 2 73.38 21.09 47.29 0.98 1.00
>
> Kernbench threads = 8
>
> SchedMC Run Time Package Idle(%) Energy(J) Power(W)
> 0 63.18 26.23 28.67 1.00 1.00
> 1 56.89 19.85 22.38 0.93 1.02
> 2 55.27 17.65 19.84 0.91 1.03

looks very promising.

> Relative measures for sched_mc=1 compared to baseline sched_mc=0
>
> Threads=> 2 4 6 8 10 12
> KernbenchTime 0.94 1.01 1.00 0.90 0.93 0.95
> Energy 0.95 0.96 0.99 0.93 0.95 0.97
> Power 1.01 0.96 0.99 1.02 1.02 1.01
>
> Relative measures for sched_mc=2 compared to baseline sched_mc=0
>
> Threads=> 2 4 6 8 10 12
> KernbenchTime 0.94 0.94 0.98 0.87 0.92 0.94
> Energy 0.95 0.92 0.98 0.91 0.94 0.96
> Power 1.01 0.97 1.00 1.03 1.02 1.02

That's icing on the cake.

Mike, would you be interesting in having a look at sched_mc=2 as a
kernel-wide default - and give it your blessing if you find it to be a net
improvement for the various performance and interactivity tests you do?

Ingo

2009-01-03 07:29:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:

> Mike, would you be interesting in having a look at sched_mc=2 as a
> kernel-wide default - and give it your blessing if you find it to be a net
> improvement for the various performance and interactivity tests you do?

Sure.

-Mike

2009-01-03 10:13:44

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-03 08:29:25]:

> On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
>
> > Mike, would you be interesting in having a look at sched_mc=2 as a
> > kernel-wide default - and give it your blessing if you find it to be a net
> > improvement for the various performance and interactivity tests you do?
>
> Sure.

Thanks Mike and Ingo. I will be glad to help with test and benchmarks
on the platforms that I have access.

I am presently working on sysbench.

--Vaidy

2009-01-03 11:22:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Sat, 2009-01-03 at 15:46 +0530, Vaidyanathan Srinivasan wrote:
> * Mike Galbraith <[email protected]> [2009-01-03 08:29:25]:
>
> > On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> >
> > > Mike, would you be interesting in having a look at sched_mc=2 as a
> > > kernel-wide default - and give it your blessing if you find it to be a net
> > > improvement for the various performance and interactivity tests you do?
> >
> > Sure.
>
> Thanks Mike and Ingo. I will be glad to help with test and benchmarks
> on the platforms that I have access.
>
> I am presently working on sysbench.

The testing I can do is rather severely limited since I have only one
Q6600. I butchered mc_capable() to use what I can though, ie see if
SD_BALANCE_NEWIDLE still harms tbench and mysql+oltp. I think that's
about all I can do on my wee box.

-Mike

2009-01-04 15:00:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Sat, 2009-01-03 at 12:22 +0100, Mike Galbraith wrote:
> On Sat, 2009-01-03 at 15:46 +0530, Vaidyanathan Srinivasan wrote:
> > * Mike Galbraith <[email protected]> [2009-01-03 08:29:25]:
> >
> > > On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> > >
> > > > Mike, would you be interesting in having a look at sched_mc=2 as a
> > > > kernel-wide default - and give it your blessing if you find it to be a net
> > > > improvement for the various performance and interactivity tests you do?
> > >
> > > Sure.
> >
> > Thanks Mike and Ingo. I will be glad to help with test and benchmarks
> > on the platforms that I have access.
> >
> > I am presently working on sysbench.
>
> The testing I can do is rather severely limited since I have only one
> Q6600. I butchered mc_capable() to use what I can though, ie see if
> SD_BALANCE_NEWIDLE still harms tbench and mysql+oltp. I think that's
> about all I can do on my wee box.

I do not see any difference for tbench, results are within jitter. I do
for mysql+oltp, and the test results are fairly strange.

If you take a peek at the attached chart: the 2.6.26.8 data is with
scalability backports/fixes. That's where our 29-to-be should be.
Domain tunings in this kernel are identical to 28/29 stock as well.

Note that there is no knee at 8 clients. If I turn SD_BALANCE_NEWIDLE
on in 26+backports, peak drops a tad, and that knee re-appears, just as
before we turned the thing off. Throughput after the knee also drops a
tad, nearly to the point where tip+sched_mc=2 now _comes up to_, and it
definitely is SD_BALANCE_NEWIDLE making the difference. IOW, what used
to be a loser, and still is a loser in 26+fixes, is now a winner in tip
after the knee which should not be there. Seems something else has
changed, re-introducing the knee, and cutting throughput a tad.

(The hefty difference at the very end I knew about. SD_BALANCE_NEWIDLE
helps considerably when mysql is very thoroughly jammed up on itself)

When I look at that chart, it looks almost like SD_BALANCE_NEWIDLE is
partially offsetting some other problem.

I haven't done any interactivity testing yet.

-Mike


Attachments:
zzz.jpg (74.20 kB)

2009-01-04 18:23:21

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-04 16:00:00]:

> On Sat, 2009-01-03 at 12:22 +0100, Mike Galbraith wrote:
> > On Sat, 2009-01-03 at 15:46 +0530, Vaidyanathan Srinivasan wrote:
> > > * Mike Galbraith <[email protected]> [2009-01-03 08:29:25]:
> > >
> > > > On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> > > >
> > > > > Mike, would you be interesting in having a look at sched_mc=2 as a
> > > > > kernel-wide default - and give it your blessing if you find it to be a net
> > > > > improvement for the various performance and interactivity tests you do?
> > > >
> > > > Sure.
> > >
> > > Thanks Mike and Ingo. I will be glad to help with test and benchmarks
> > > on the platforms that I have access.
> > >
> > > I am presently working on sysbench.
> >
> > The testing I can do is rather severely limited since I have only one
> > Q6600. I butchered mc_capable() to use what I can though, ie see if
> > SD_BALANCE_NEWIDLE still harms tbench and mysql+oltp. I think that's
> > about all I can do on my wee box.
>
> I do not see any difference for tbench, results are within jitter. I do
> for mysql+oltp, and the test results are fairly strange.
>
> If you take a peek at the attached chart: the 2.6.26.8 data is with
> scalability backports/fixes. That's where our 29-to-be should be.
> Domain tunings in this kernel are identical to 28/29 stock as well.
>
> Note that there is no knee at 8 clients. If I turn SD_BALANCE_NEWIDLE
> on in 26+backports, peak drops a tad, and that knee re-appears, just as
> before we turned the thing off. Throughput after the knee also drops a
> tad, nearly to the point where tip+sched_mc=2 now _comes up to_, and it
> definitely is SD_BALANCE_NEWIDLE making the difference. IOW, what used
> to be a loser, and still is a loser in 26+fixes, is now a winner in tip
> after the knee which should not be there. Seems something else has
> changed, re-introducing the knee, and cutting throughput a tad.
>
> (The hefty difference at the very end I knew about. SD_BALANCE_NEWIDLE
> helps considerably when mysql is very thoroughly jammed up on itself)
>
> When I look at that chart, it looks almost like SD_BALANCE_NEWIDLE is
> partially offsetting some other problem.
>
> I haven't done any interactivity testing yet.

Hi Mike,

Thanks for the detailed test report.

I am new to sysbench. I just started few OLTP runs with pgsql. In
your graph you are plotting and comparing read/write-per-sec and not
transactions-per-sec. Both the parameter vary in a similar manner.
Can you please let me know some background on using the
read/write-per-sec result for comparison.

I assume you have run the above tests on Q6600 box that has single
quad core package that consist of two dual core CPUs. Can you please
let me know the sched_domain tree that was build by hacking
mc_capable(). The effect of sched_mc={1,2} depends on the
sched groups that was build and their flags.

As you have mentioned in your data, sched_mc=2 helps recover some
performance mainly because of NEWIDLE balance.

You have mentioned clients in the x-axis of the graph, what is their
relation to the number of threads?

Please feel free to point me to any previous discussion on sysbench
where the above questions have been discussed.

Thanks,
Vaidy

2009-01-04 19:53:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Sun, 2009-01-04 at 23:49 +0530, Vaidyanathan Srinivasan wrote:

> I am new to sysbench. I just started few OLTP runs with pgsql. In
> your graph you are plotting and comparing read/write-per-sec and not
> transactions-per-sec. Both the parameter vary in a similar manner.
> Can you please let me know some background on using the
> read/write-per-sec result for comparison.

It means nothing. One result is the same as any other. I prefer low
level read/write but someone else may prefer higher level transactions.

> I assume you have run the above tests on Q6600 box that has single
> quad core package that consist of two dual core CPUs.

Yes. I can go down from there, but not up. Oh darn.

> Can you please
> let me know the sched_domain tree that was build by hacking
> mc_capable(). The effect of sched_mc={1,2} depends on the
> sched groups that was build and their flags.

Hm. I don't know what you're asking. I hacked mc_capable only so I
could have the tweakable handy in case there was something I didn't know
about yet (having _just_ jumped in with both feet;)

> As you have mentioned in your data, sched_mc=2 helps recover some
> performance mainly because of NEWIDLE balance.

Yes, odd.

> You have mentioned clients in the x-axis of the graph, what is their
> relation to the number of threads?

I have 4 cores, and 2 caches. Any relationship is all about cache.

Please feel free to point me to any previous discussion on sysbench
> where the above questions have been discussed.

I haven't seen such.

> Thanks,
> Vaidy

Cheers,

-Mike

2009-01-05 03:23:18

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-04 20:52:49]:

> On Sun, 2009-01-04 at 23:49 +0530, Vaidyanathan Srinivasan wrote:
>
> > I am new to sysbench. I just started few OLTP runs with pgsql. In
> > your graph you are plotting and comparing read/write-per-sec and not
> > transactions-per-sec. Both the parameter vary in a similar manner.
> > Can you please let me know some background on using the
> > read/write-per-sec result for comparison.
>
> It means nothing. One result is the same as any other. I prefer low
> level read/write but someone else may prefer higher level transactions.
>
> > I assume you have run the above tests on Q6600 box that has single
> > quad core package that consist of two dual core CPUs.
>
> Yes. I can go down from there, but not up. Oh darn.
>
> > Can you please
> > let me know the sched_domain tree that was build by hacking
> > mc_capable(). The effect of sched_mc={1,2} depends on the
> > sched groups that was build and their flags.
>
> Hm. I don't know what you're asking. I hacked mc_capable only so I
> could have the tweakable handy in case there was something I didn't know
> about yet (having _just_ jumped in with both feet;)

When CONFIG_SCHED_DEBUG is enabled, the sched domain tree is dumped
(dmesg)

CPU0 attaching sched-domain:
domain 0: span 0,2-4 level MC
groups: 0 2 3 4
domain 1: span 0-7 level CPU
groups: 0,2-4 1,5-7
CPU1 attaching sched-domain:
domain 0: span 1,5-7 level MC
groups: 1 5 6 7
domain 1: span 0-7 level CPU
groups: 1,5-7 0,2-4

This tree clearly depicts what the scheduler thinks of the system.
Correct sched domain and groups setup is critical for performance and
powersavings. When you hack mc_capable, domain at MC level should have
one group with two core (sharing the L2 cache) and CPU level shall
have two groups each with two cpus representing two dual core CPUs.

The above example is from my dual socket quad core system.

> > As you have mentioned in your data, sched_mc=2 helps recover some
> > performance mainly because of NEWIDLE balance.
>
> Yes, odd.
>
> > You have mentioned clients in the x-axis of the graph, what is their
> > relation to the number of threads?
>
> I have 4 cores, and 2 caches. Any relationship is all about cache.

I was actually asking about software threads specified in the sysbench
benchmark. Your have run almost 256 clients on a 4 core box, does
that mean sysbench had 256 worker threads?

Thanks for the clarifications.

--Vaidy

2009-01-05 04:40:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Mon, 2009-01-05 at 08:50 +0530, Vaidyanathan Srinivasan wrote:
> When CONFIG_SCHED_DEBUG is enabled, the sched domain tree is dumped
> (dmesg)

Oh, that. I'm dense <thwack>

[ 0.476050] CPU0 attaching sched-domain:
[ 0.476052] domain 0: span 0-1 level MC
[ 0.476054] groups: 0 1
[ 0.476057] domain 1: span 0-3 level CPU
[ 0.476058] groups: 0-1 2-3
[ 0.476062] CPU1 attaching sched-domain:
[ 0.476064] domain 0: span 0-1 level MC
[ 0.476065] groups: 1 0
[ 0.476067] domain 1: span 0-3 level CPU
[ 0.476069] groups: 0-1 2-3
[ 0.476072] CPU2 attaching sched-domain:
[ 0.476073] domain 0: span 2-3 level MC
[ 0.476075] groups: 2 3
[ 0.476077] domain 1: span 0-3 level CPU
[ 0.476078] groups: 2-3 0-1
[ 0.476081] CPU3 attaching sched-domain:
[ 0.476083] domain 0: span 2-3 level MC
[ 0.476084] groups: 3 2
[ 0.476086] domain 1: span 0-3 level CPU
[ 0.476088] groups: 2-3 0-1

2.6.26.8
[ 0.524043] CPU0 attaching sched-domain:
[ 0.524045] domain 0: span 0-1
[ 0.524046] groups: 0 1
[ 0.524049] domain 1: span 0-3
[ 0.524051] groups: 0-1 2-3
[ 0.524054] CPU1 attaching sched-domain:
[ 0.524055] domain 0: span 0-1
[ 0.524056] groups: 1 0
[ 0.524059] domain 1: span 0-3
[ 0.524060] groups: 0-1 2-3
[ 0.524063] CPU2 attaching sched-domain:
[ 0.524064] domain 0: span 2-3
[ 0.524065] groups: 2 3
[ 0.524068] domain 1: span 0-3
[ 0.524069] groups: 2-3 0-1
[ 0.524072] CPU3 attaching sched-domain:
[ 0.524073] domain 0: span 2-3
[ 0.524075] groups: 3 2
[ 0.524077] domain 1: span 0-3
[ 0.524078] groups: 2-3 0-1

> I was actually asking about software threads specified in the sysbench
> benchmark. Your have run almost 256 clients on a 4 core box, does
> that mean sysbench had 256 worker threads?

Yes.

-Mike

2009-01-05 06:37:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

Virgin 28 does not have a knee, but grows one when NEWIDLE is enabled.
The throughput curve is essentially the mc=2 curve with the knee
smoothed away, and a higher peak. The very end is considerably lower.
ie it only helps the extremely contested end, and not nearly as much.
The whole curve is roughly a point under the 26.8 curve (which could
easily be config differences etc, but the shape is right).

I'll rummage around.

-Mike

2009-01-05 15:20:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Mon, 2009-01-05 at 07:37 +0100, Mike Galbraith wrote:

> I'll rummage around.

Seems to be about the only thing it could be, load balancing inflicting
injury on very sensitive mysql+oltp pairs.

-Mike

2009-01-06 09:31:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Mon, 2009-01-05 at 16:19 +0100, Mike Galbraith wrote:
> On Mon, 2009-01-05 at 07:37 +0100, Mike Galbraith wrote:
>
> > I'll rummage around.
>
> Seems to be about the only thing it could be, load balancing inflicting
> injury on very sensitive mysql+oltp pairs.

BTW, I verified this. Reverting all load-balancing changes fully
restored mysql+oltp peak, and brought mid-range throughput to the same
level as sched_mc=2 except at the log-jam end. (haven't looked at
vmark, though I'd expect it to be hurting a bit too, it's affinity
sensitive as well)

I expected sched_mc=2 to help an nfs mount kbuild, and it did, quite a
bit. I first tried an nfs4 mount, but after a while, the odd ipv6 80%
idle problem came back, so I reverted to nfs3. Full built time there
went from 4m25s to 4m2s. A nice improvement.

I haven't noticed anything on the interactivity front.

Personally, I'd go for sched_mc=2 as default. I value the fork/exec
load much more than sensitive benchmarks, though what hurts mysql+oltp
will certainly hurt others as well. We have a bit of conflict between
keeping CPUs busy and affinity cost. Something to work on.

-Mike

2009-01-06 14:52:14

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-05 05:40:16]:

> On Mon, 2009-01-05 at 08:50 +0530, Vaidyanathan Srinivasan wrote:
> > When CONFIG_SCHED_DEBUG is enabled, the sched domain tree is dumped
> > (dmesg)
>
> Oh, that. I'm dense <thwack>
>
> [ 0.476050] CPU0 attaching sched-domain:
> [ 0.476052] domain 0: span 0-1 level MC
> [ 0.476054] groups: 0 1
> [ 0.476057] domain 1: span 0-3 level CPU
> [ 0.476058] groups: 0-1 2-3
> [ 0.476062] CPU1 attaching sched-domain:
> [ 0.476064] domain 0: span 0-1 level MC
> [ 0.476065] groups: 1 0
> [ 0.476067] domain 1: span 0-3 level CPU
> [ 0.476069] groups: 0-1 2-3
> [ 0.476072] CPU2 attaching sched-domain:
> [ 0.476073] domain 0: span 2-3 level MC
> [ 0.476075] groups: 2 3
> [ 0.476077] domain 1: span 0-3 level CPU
> [ 0.476078] groups: 2-3 0-1
> [ 0.476081] CPU3 attaching sched-domain:
> [ 0.476083] domain 0: span 2-3 level MC
> [ 0.476084] groups: 3 2
> [ 0.476086] domain 1: span 0-3 level CPU
> [ 0.476088] groups: 2-3 0-1

Hi Mike,

This seems to be correct for the configuration. Hope this would be
same for shced_mc=1 and sched_mc=2 since you would have hacked
mc_capable.

By default, all 4 cores will be 1 group at CPU at sched_mc={1,2} so
that packages are clearly identified in the CPU level sched groups.


> 2.6.26.8
> [ 0.524043] CPU0 attaching sched-domain:
> [ 0.524045] domain 0: span 0-1
> [ 0.524046] groups: 0 1
> [ 0.524049] domain 1: span 0-3
> [ 0.524051] groups: 0-1 2-3
> [ 0.524054] CPU1 attaching sched-domain:
> [ 0.524055] domain 0: span 0-1
> [ 0.524056] groups: 1 0
> [ 0.524059] domain 1: span 0-3
> [ 0.524060] groups: 0-1 2-3
> [ 0.524063] CPU2 attaching sched-domain:
> [ 0.524064] domain 0: span 2-3
> [ 0.524065] groups: 2 3
> [ 0.524068] domain 1: span 0-3
> [ 0.524069] groups: 2-3 0-1
> [ 0.524072] CPU3 attaching sched-domain:
> [ 0.524073] domain 0: span 2-3
> [ 0.524075] groups: 3 2
> [ 0.524077] domain 1: span 0-3
> [ 0.524078] groups: 2-3 0-1
>
> > I was actually asking about software threads specified in the sysbench
> > benchmark. Your have run almost 256 clients on a 4 core box, does
> > that mean sysbench had 256 worker threads?
>
> Yes.

Let me try similar experiments on my dual socket quad core system.
I was limiting the threads to 8 assuming that the system will max-out
by then.

Thanks for the updates.

--Vaidy

2009-01-06 15:04:32

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-06 10:31:37]:

> On Mon, 2009-01-05 at 16:19 +0100, Mike Galbraith wrote:
> > On Mon, 2009-01-05 at 07:37 +0100, Mike Galbraith wrote:
> >
> > > I'll rummage around.
> >
> > Seems to be about the only thing it could be, load balancing inflicting
> > injury on very sensitive mysql+oltp pairs.
>
> BTW, I verified this. Reverting all load-balancing changes fully
> restored mysql+oltp peak, and brought mid-range throughput to the same
> level as sched_mc=2 except at the log-jam end. (haven't looked at
> vmark, though I'd expect it to be hurting a bit too, it's affinity
> sensitive as well)
>
> I expected sched_mc=2 to help an nfs mount kbuild, and it did, quite a
> bit. I first tried an nfs4 mount, but after a while, the odd ipv6 80%
> idle problem came back, so I reverted to nfs3. Full built time there
> went from 4m25s to 4m2s. A nice improvement.
>
> I haven't noticed anything on the interactivity front.
>
> Personally, I'd go for sched_mc=2 as default. I value the fork/exec
> load much more than sensitive benchmarks, though what hurts mysql+oltp
> will certainly hurt others as well. We have a bit of conflict between
> keeping CPUs busy and affinity cost. Something to work on.

Hi Mike,

Thanks for the detailed benchmark reports. Glad to hear that
sched_mc=2 is helping in most scenarios. Though we would be tempted to
make it default, I would still like to default to zero in order to
provide base line performance. I would expect end users to flip the
settings to sched_mc=2 if it helps their workload in terms of
performance and/or power savings.

The fact that sched_mc=2 provide performance and/or power saving
benefits is a good justification to include the new code and tunable.

The benefits from the sched_mc=2 settings vary widely based on
workload and system configuration. Hence in my opinion, I would not
want to change the default to 2 at this time until more wide spread
use of the tunable under various workloads and system configurations.

--Vaidy

2009-01-06 17:48:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Tue, 2009-01-06 at 20:37 +0530, Vaidyanathan Srinivasan wrote:

> Hi Mike,
>
> Thanks for the detailed benchmark reports. Glad to hear that
> sched_mc=2 is helping in most scenarios. Though we would be tempted to
> make it default, I would still like to default to zero in order to
> provide base line performance. I would expect end users to flip the
> settings to sched_mc=2 if it helps their workload in terms of
> performance and/or power savings.

The mysql+oltp peak loss is there either way, but with 2, mid range
throughput is ~28 baseline. High end (yawn) is better, and the nfs
kbuild performs better than baseline.

Baseline performance, at least wrt mysql+oltp doesn't seem to be an
option. Not my call. More testing and more testers required I suppose.

-Mike

2009-01-06 18:46:10

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-06 18:48:37]:

> On Tue, 2009-01-06 at 20:37 +0530, Vaidyanathan Srinivasan wrote:
>
> > Hi Mike,
> >
> > Thanks for the detailed benchmark reports. Glad to hear that
> > sched_mc=2 is helping in most scenarios. Though we would be tempted to
> > make it default, I would still like to default to zero in order to
> > provide base line performance. I would expect end users to flip the
> > settings to sched_mc=2 if it helps their workload in terms of
> > performance and/or power savings.
>
> The mysql+oltp peak loss is there either way, but with 2, mid range
> throughput is ~28 baseline. High end (yawn) is better, and the nfs
> kbuild performs better than baseline.
>
> Baseline performance, at least wrt mysql+oltp doesn't seem to be an
> option. Not my call. More testing and more testers required I suppose.

Yes, more testing is definitely due. I'd like to hear from people
with larger and newer boxes as well before I would be comfortable making
sched_mc=2 as default.

--
Balbir

2009-01-07 08:59:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

I have a couple questions if you don't mind.

In wake_idle() there's an odd looking check for kthreads. Why does it
matter if a kbuild task wakes kernel or userland helper thread?

I also don't see why sched_mc overrides domain tunings. You can turn
NEWIDLE off and sched_mc remains as set, so it's a one-way override. If
NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
sched_mc to 0 if the user explicitly disables NEWIDLE.

-Mike

2009-01-07 11:24:08

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-07 09:59:15]:

> I have a couple questions if you don't mind.
>
> In wake_idle() there's an odd looking check for kthreads. Why does it
> matter if a kbuild task wakes kernel or userland helper thread?

Only user space tasks/threads should be biased at wakeup and
consolidated for the following reasons:

* kernel threads generally perform housekeeping tasks and its rate of
execution and cpu utilisation is far less than that of user task in
an under utilised system.
* Biasing at wakeup help consolidate bursty user space tasks. Long
running user space tasks will be consolidated by load balancer.
kthreads are not bursty, they are generally very short running.

> I also don't see why sched_mc overrides domain tunings. You can turn
> NEWIDLE off and sched_mc remains as set, so it's a one-way override. If
> NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
> sched_mc to 0 if the user explicitly disables NEWIDLE.

SD_BALANCE_NEWIDLE is required for sched_mc=2 and power savings while
not mandated for sched_mc=0. We can remove NEWIDLE balance at
sched_mc=0 if that helps baseline performance. Ingo had identified
that removing NEWIDLE balance help performance in certain benchmarks.

I do not get what you have mentioned by NEWIDLE off? Is there
a separate user space control to independently set or reset
SD_BALANCE_NEWIDLE at a give sched_domain level?

Thanks for the detailed review.

--Vaidy

2009-01-07 14:36:43

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Wed, 2009-01-07 at 16:56 +0530, Vaidyanathan Srinivasan wrote:
> * Mike Galbraith <[email protected]> [2009-01-07 09:59:15]:
>
> > I have a couple questions if you don't mind.
> >
> > In wake_idle() there's an odd looking check for kthreads. Why does it
> > matter if a kbuild task wakes kernel or userland helper thread?
>
> Only user space tasks/threads should be biased at wakeup and
> consolidated for the following reasons:
>
> * kernel threads generally perform housekeeping tasks and its rate of
> execution and cpu utilisation is far less than that of user task in
> an under utilised system.
> * Biasing at wakeup help consolidate bursty user space tasks. Long
> running user space tasks will be consolidated by load balancer.
> kthreads are not bursty, they are generally very short running.

Rummaging around in an nfs mount is bursty, and the nfs threads are part
of the burst. No big deal, but I think it's illogical to differentiate.

> > I also don't see why sched_mc overrides domain tunings. You can turn
> > NEWIDLE off and sched_mc remains as set, so it's a one-way override. If
> > NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
> > sched_mc to 0 if the user explicitly disables NEWIDLE.
>
> SD_BALANCE_NEWIDLE is required for sched_mc=2 and power savings while

That's a buglet then, because you can have sched_mc=2 and NEWIDLE off.

> not mandated for sched_mc=0. We can remove NEWIDLE balance at
> sched_mc=0 if that helps baseline performance. Ingo had identified
> that removing NEWIDLE balance help performance in certain benchmarks.

It used to help mysql+oltp low end throughput for one. efbe027 seems to
have done away with that though (not that alone).

> I do not get what you have mentioned by NEWIDLE off? Is there
> a separate user space control to independently set or reset
> SD_BALANCE_NEWIDLE at a give sched_domain level?

Yes. /proc/sys/kernel/sched_domain/cpuN/domainN/*

> Thanks for the detailed review.

You're very welcome.

-Mike

> --Vaidy

2009-01-07 15:32:33

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-07 15:36:25]:

> On Wed, 2009-01-07 at 16:56 +0530, Vaidyanathan Srinivasan wrote:
> > * Mike Galbraith <[email protected]> [2009-01-07 09:59:15]:
> >
> > > I have a couple questions if you don't mind.
> > >
> > > In wake_idle() there's an odd looking check for kthreads. Why does it
> > > matter if a kbuild task wakes kernel or userland helper thread?
> >
> > Only user space tasks/threads should be biased at wakeup and
> > consolidated for the following reasons:
> >
> > * kernel threads generally perform housekeeping tasks and its rate of
> > execution and cpu utilisation is far less than that of user task in
> > an under utilised system.
> > * Biasing at wakeup help consolidate bursty user space tasks. Long
> > running user space tasks will be consolidated by load balancer.
> > kthreads are not bursty, they are generally very short running.
>
> Rummaging around in an nfs mount is bursty, and the nfs threads are part
> of the burst. No big deal, but I think it's illogical to differentiate.

nfs threads is a real good example. I missed that. I will play
around with nfs threads and try to optimise the condition.

> > > I also don't see why sched_mc overrides domain tunings. You can turn
> > > NEWIDLE off and sched_mc remains as set, so it's a one-way override. If
> > > NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
> > > sched_mc to 0 if the user explicitly disables NEWIDLE.
> >
> > SD_BALANCE_NEWIDLE is required for sched_mc=2 and power savings while
>
> That's a buglet then, because you can have sched_mc=2 and NEWIDLE off.

Tweaking the sched domain parameters affect the default power saving
heuristics anyway. Protecting against the flags alone will not help.
Your point is valid, I will think about this scenario and see how we
can protect the behavior.

> > not mandated for sched_mc=0. We can remove NEWIDLE balance at
> > sched_mc=0 if that helps baseline performance. Ingo had identified
> > that removing NEWIDLE balance help performance in certain benchmarks.
>
> It used to help mysql+oltp low end throughput for one. efbe027 seems to
> have done away with that though (not that alone).
>
> > I do not get what you have mentioned by NEWIDLE off? Is there
> > a separate user space control to independently set or reset
> > SD_BALANCE_NEWIDLE at a give sched_domain level?
>
> Yes. /proc/sys/kernel/sched_domain/cpuN/domainN/*

Changing flags through this interface is complete low level control.
End users who want to tweak individual sched domain parameters are
either experimenting or optimising for their specific workload.

The defaults should generally across common workloads. Users can
further tweak this low level settings. We can clearly document the
relations so that we know what to expect.

sched_mc=2 depends on NEWIDLE but the wakeup biasing part will be
enabled even without this flag. The power savings will be marginally
better than sched_mc=1 without NEWIDLE. But the end user can have
that setup if they want to.

You have pointed out a scenario where users can turn off NEWIDLE and
still expect to use sched_mc=2. The scenario is valid and the user
should be allowed to do so.

--Vaidy

2009-01-08 08:07:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Wed, 2009-01-07 at 21:05 +0530, Vaidyanathan Srinivasan wrote:

> sched_mc=2 depends on NEWIDLE but the wakeup biasing part will be
> enabled even without this flag. The power savings will be marginally
> better than sched_mc=1 without NEWIDLE. But the end user can have
> that setup if they want to.

That implies to me that there should exist a flag SD_WAKEUP_BIAS or
whatnot. All settings should be visible. Currently, sched_mc > 0 turns
on NEWIDLE, which is visible, but 2 turns on this 'invisible to the
ultimate low-level control interface' thing.

-Mike

2009-01-08 17:44:20

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

* Mike Galbraith <[email protected]> [2009-01-08 09:06:48]:

> On Wed, 2009-01-07 at 21:05 +0530, Vaidyanathan Srinivasan wrote:
>
> > sched_mc=2 depends on NEWIDLE but the wakeup biasing part will be
> > enabled even without this flag. The power savings will be marginally
> > better than sched_mc=1 without NEWIDLE. But the end user can have
> > that setup if they want to.
>
> That implies to me that there should exist a flag SD_WAKEUP_BIAS or
> whatnot. All settings should be visible. Currently, sched_mc > 0 turns
> on NEWIDLE, which is visible, but 2 turns on this 'invisible to the
> ultimate low-level control interface' thing.

Having a feature flag for each power saving balance is possible but
will be an overkill since independent control of features at each
sched domain may not yield useful configurations. The present power
saving heuristics that gets enabled at sched_mc=1 is primarily
controlled by SD_POWERSAVE_BALANCE but has side effects and can be
rendered ineffective by changing other SD flags.

We did consider having individual SD_flag for each of the power
savings functions, but that quickly lead to too many possible
configurations with most of them incorrect or ineffective. In order
to improve the consumeability of the power saving features, we have
proposed to enhance the existing tunable with monotonically increasing
powersavings vs performance tradeoffs.

I agree with the 'visibility' to low level interface that you have
pointed out. It will be good to have flags showup at the low level
interface, but do end users like system administrators would want to
know and tune these flags? It makes sense with the current set of
flags which depict important scheduler features, but power savings
related flags only further qualify or emphasise existing functions and
control points in the scheduler.

In my opinion adding flags for each powersaving feature will add
un-necessary complexity and may not be as useful to end users.

--Vaidy

2009-01-09 06:00:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n

On Thu, 2009-01-08 at 23:16 +0530, Vaidyanathan Srinivasan wrote:

> I agree with the 'visibility' to low level interface that you have
> pointed out. It will be good to have flags showup at the low level
> interface, but do end users like system administrators would want to
> know and tune these flags?

IMHO there is no difference between these tunings and any other tuning,
the unwary admin can shoot himself in the foot in any number of ways.
The monotonic interface setting sensible flags is fine, but invisible
flags is not.

We may have to agree to disagree.

-Mike