2010-04-09 06:21:29

by Michael Neuling

[permalink] [raw]
Subject: [PATCH 0/5] sched: asymmetrical packing for POWER7 SMT4

This patch series implements asymmetric SMT packing which ensures
consistently good performance on POWER7.

I'm no scheduler expert so these probably need some review. It works
for my simple test case on a POWER7 with SMT4. I can now run a number
of CPU bound processes and they are pulled down to lower threads
giving the best performance consistently.

Suresh: some of this touches code you changed recently, so you may
want to test in your environment.

The 1st patch is a fix for SMT4 in general.
The 2nd adds actual the asymmetrical packing infrastructure.
The 3rd adds the powerpc specific hooks for POWER7.
The 4th and 5th are fixes required for packing to work.


2010-04-09 06:21:27

by Michael Neuling

[permalink] [raw]
Subject: [PATCH 3/5] powerpc: enabled asymmetric SMT scheduling on POWER7

The POWER7 core has dynamic SMT mode switching which is controlled by
the hypervisor. There are 3 SMT modes:
SMT1 uses thread 0
SMT2 uses threads 0 & 1
SMT4 uses threads 0, 1, 2 & 3
When in any particular SMT mode, all threads have the same performance
as each other (ie. at any moment in time, all threads perform the same).

The SMT mode switching works such that when linux has threads 2 & 3 idle
and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the
idle loop and the hypervisor will automatically switch to SMT2 for that
core (independent of other cores). The opposite is not true, so if
threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode.

Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go
into SMT1 mode.

If we can get the core into a lower SMT mode (SMT1 is best), the threads
will perform better (since they share less core resources). Hence when
we have idle threads, we want them to be the higher ones.

This adds a feature bit for asymmetric packing to powerpc and then
enables it on POWER7.

Signed-off-by: Michael Neuling <[email protected]>

---

arch/powerpc/include/asm/cputable.h | 3 ++-
arch/powerpc/kernel/process.c | 13 +++++++++++++
2 files changed, 15 insertions(+), 1 deletion(-)

Index: linux-2.6-ozlabs/arch/powerpc/include/asm/cputable.h
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/cputable.h
+++ linux-2.6-ozlabs/arch/powerpc/include/asm/cputable.h
@@ -195,6 +195,7 @@ extern const char *powerpc_base_platform
#define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000)
#define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000)
#define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000)
+#define CPU_FTR_ASYM_SMT LONG_ASM_CONST(0x0100000000000000)

#ifndef __ASSEMBLY__

@@ -409,7 +410,7 @@ extern const char *powerpc_base_platform
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
- CPU_FTR_DSCR | CPU_FTR_SAO)
+ CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT)
#define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
Index: linux-2.6-ozlabs/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/process.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/process.c
@@ -1265,3 +1265,16 @@ unsigned long randomize_et_dyn(unsigned

return ret;
}
+
+unsigned char asym_smt_print = 1;
+int arch_sd_sibiling_asym_packing(void)
+{
+ if (cpu_has_feature(CPU_FTR_ASYM_SMT)){
+ if (asym_smt_print) {
+ pr_info("Enabling Asymetric SMT scheduling\n");
+ asym_smt_print = 0;
+ }
+ return SD_ASYM_PACKING;
+ }
+ return 0;
+}

2010-04-09 06:21:31

by Michael Neuling

[permalink] [raw]
Subject: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing

With the asymmetric packing infrastructure, fix_small_imbalance is
causing idle higher threads to pull tasks off lower threads.

This is being caused by an off-by-one error.

Signed-off-by: Michael Neuling <[email protected]>
---
I'm not sure this is the right fix but without it, higher threads pull
tasks off the lower threads, then the packing pulls it back down, etc
etc and tasks bounce around constantly.

---

kernel/sched_fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s
* SCHED_LOAD_SCALE;
scaled_busy_load_per_task /= sds->busiest->cpu_power;

- if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
+ if (sds->max_load - sds->this_load + scaled_busy_load_per_task >
(scaled_busy_load_per_task * imbn)) {
*imbalance = sds->busiest_load_per_task;
return;

2010-04-09 06:21:24

by Michael Neuling

[permalink] [raw]
Subject: [PATCH 2/5] sched: add asymmetric packing option for sibling domain

Some CPUs perform better when tasks are run on lower thread numbers.
In the case of POWER7, when higher threads are idled, the core can run
in lower SMT modes and hence perform better.

This creates a new sd flag to prefer lower threads.

Based heavily on patch from Peter Zijlstra.

Signed-off-by: Michael Neuling <[email protected]>
---
Peter: Since this is based mainly off your initial patch, it should
have your signed-off-by too, but I didn't want to add without your
permission. Can I add it?

---

include/linux/sched.h | 4 ++
include/linux/topology.h | 1
kernel/sched_fair.c | 64 ++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 65 insertions(+), 4 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -799,7 +799,7 @@ enum cpu_idle_type {
#define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
-
+#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */

enum powersavings_balance_level {
@@ -834,6 +834,8 @@ static inline int sd_balance_for_package
return SD_PREFER_SIBLING;
}

+extern int __weak arch_sd_sibiling_asym_packing(void);
+
/*
* Optimise SD flags for power savings:
* SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
Index: linux-2.6-ozlabs/include/linux/topology.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/topology.h
+++ linux-2.6-ozlabs/include/linux/topology.h
@@ -102,6 +102,7 @@ int arch_update_cpu_topology(void);
| 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
+ | arch_sd_sibiling_asym_packing() \
, \
.last_balance = jiffies, \
.balance_interval = 1, \
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2493,6 +2493,31 @@ static inline void update_sg_lb_stats(st
}

/**
+ * update_sd_pick_busiest - return 1 on busiest
+ */
+static int update_sd_pick_busiest(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ struct sched_group *sg,
+ struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->group_capacity)
+ return 1;
+
+ if (sgs->group_imb)
+ return 1;
+
+ if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) {
+ if (!sds->busiest)
+ return 1;
+
+ if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
+ return 1;
+ }
+
+ return 0;
+}
+
+/**
* update_sd_lb_stats - Update sched_group's statistics for load balancing.
* @sd: sched_domain whose statistics are to be updated.
* @this_cpu: Cpu for which load balance is currently performed.
@@ -2546,9 +2571,8 @@ static inline void update_sd_lb_stats(st
sds->this = group;
sds->this_nr_running = sgs.sum_nr_running;
sds->this_load_per_task = sgs.sum_weighted_load;
- } else if (sgs.avg_load > sds->max_load &&
- (sgs.sum_nr_running > sgs.group_capacity ||
- sgs.group_imb)) {
+ } else if (sgs.avg_load >= sds->max_load &&
+ update_sd_pick_busiest(sd, sds, group, &sgs)) {
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
@@ -2562,6 +2586,36 @@ static inline void update_sd_lb_stats(st
} while (group != sd->groups);
}

+int __weak arch_sd_sibiling_asym_packing(void)
+{
+ return 0*SD_ASYM_PACKING;
+}
+
+/**
+ * check_asym_packing - Check to see if we the group is packed into
+ * the sched doman
+ */
+static int check_asym_packing(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ int this_cpu, unsigned long *imbalance)
+{
+ int busiest_cpu;
+
+ if (!(sd->flags & SD_ASYM_PACKING))
+ return 0;
+
+ if (!sds->busiest)
+ return 0;
+
+ busiest_cpu = group_first_cpu(sds->busiest);
+ if (this_cpu > busiest_cpu)
+ return 0;
+
+ *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
+ SCHED_LOAD_SCALE);
+ return 1;
+}
+
/**
* fix_small_imbalance - Calculate the minor imbalance that exists
* amongst the groups of a sched_domain, during
@@ -2754,6 +2808,10 @@ find_busiest_group(struct sched_domain *
if (!(*balance))
goto ret;

+ if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
+ check_asym_packing(sd, &sds, this_cpu, imbalance))
+ return sds.busiest;
+
if (!sds.busiest || sds.busiest_nr_running == 0)
goto out_balanced;

2010-04-09 06:21:55

by Michael Neuling

[permalink] [raw]
Subject: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance()

need_active_balance() gates the asymmetric packing based due to power
save logic, but for packing we don't care.

This marks the type of balanace we are attempting to do perform from
f_b_g() and stops need_active_balance() power save logic gating a
balance in the asymmetric packing case.

Signed-off-by: Michael Neuling <[email protected]>

---

kernel/sched_fair.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -91,6 +91,13 @@ const_debug unsigned int sysctl_sched_mi

static const struct sched_class fair_sched_class;

+enum balance_type {
+ BALANCE_NONE = 0,
+ BALANCE_LOAD,
+ BALANCE_POWER,
+ BALANCE_PACKING
+};
+
/**************************************************************
* CFS operations on generic schedulable entities:
*/
@@ -2783,7 +2790,8 @@ static inline void calculate_imbalance(s
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
- int *sd_idle, const struct cpumask *cpus, int *balance)
+ int *sd_idle, const struct cpumask *cpus, int *balance,
+ enum balance_type *bt)
{
struct sd_lb_stats sds;

@@ -2808,6 +2816,7 @@ find_busiest_group(struct sched_domain *
if (!(*balance))
goto ret;

+ *bt = BALANCE_PACKING;
if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
check_asym_packing(sd, &sds, this_cpu, imbalance))
return sds.busiest;
@@ -2828,6 +2837,7 @@ find_busiest_group(struct sched_domain *

/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
+ *bt = BALANCE_LOAD;
return sds.busiest;

out_balanced:
@@ -2835,10 +2845,12 @@ out_balanced:
* There is no obvious imbalance. But check if we can do some balancing
* to save power.
*/
+ *bt = BALANCE_POWER;
if (check_power_save_busiest_group(&sds, this_cpu, imbalance))
return sds.busiest;
ret:
*imbalance = 0;
+ *bt = BALANCE_NONE;
return NULL;
}

@@ -2899,9 +2911,10 @@ find_busiest_queue(struct sched_group *g
/* Working cpumask for load_balance and load_balance_newidle. */
static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
+static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
+ enum balance_type *bt)
{
- if (idle == CPU_NEWLY_IDLE) {
+ if (idle == CPU_NEWLY_IDLE && *bt != BALANCE_PACKING) {
/*
* The only task running in a non-idle cpu can be moved to this
* cpu in an attempt to completely freeup the other CPU
@@ -2946,6 +2959,7 @@ static int load_balance(int this_cpu, st
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
+ enum balance_type bt;

cpumask_copy(cpus, cpu_active_mask);

@@ -2964,7 +2978,7 @@ static int load_balance(int this_cpu, st
redo:
update_shares(sd);
group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
- cpus, balance);
+ cpus, balance, &bt);

if (*balance == 0)
goto out_balanced;
@@ -3018,7 +3032,7 @@ redo:
schedstat_inc(sd, lb_failed[idle]);
sd->nr_balance_failed++;

- if (need_active_balance(sd, sd_idle, idle)) {
+ if (need_active_balance(sd, sd_idle, idle, &bt)) {
raw_spin_lock_irqsave(&busiest->lock, flags);

/* don't kick the migration_thread, if the curr

2010-04-09 06:21:22

by Michael Neuling

[permalink] [raw]
Subject: [PATCH 1/5] sched: fix capacity calculations for SMT4

When calculating capacity we use the following calculation:

capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);

In SMT2, power will be 1178/2 (provided we are not scaling power with
freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
being 1.

With SMT4 however, power will be 1178/4, hence capacity will end up as
0.

Fix this by ensuring capacity is always at least 1 after this
calculation.

Signed-off-by: Michael Neuling <[email protected]>
---
I'm not sure this is the correct fix but this works for me.
Original post here:
http://lkml.org/lkml/2010/3/30/884

---

kernel/sched_fair.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta
}

capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+ capacity = max(capacity, 1UL);

if (tmp->flags & SD_POWERSAVINGS_BALANCE)
nr_running /= 2;
@@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st

sgs->group_capacity =
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+ sgs->group_capacity = max(sgs->group_capacity, 1UL);
}

/**
@@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g

for_each_cpu(i, sched_group_cpus(group)) {
unsigned long power = power_of(i);
- unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+ unsigned long capacity;
unsigned long wl;

+ capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
+
if (!cpumask_test_cpu(i, cpus))
continue;

2010-04-09 06:48:51

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 3/5] powerpc: enabled asymmetric SMT scheduling on POWER7



In message <[email protected]> you wrote:
> The POWER7 core has dynamic SMT mode switching which is controlled by
> the hypervisor. There are 3 SMT modes:
> SMT1 uses thread 0
> SMT2 uses threads 0 & 1
> SMT4 uses threads 0, 1, 2 & 3
> When in any particular SMT mode, all threads have the same performance
> as each other (ie. at any moment in time, all threads perform the same).
>
> The SMT mode switching works such that when linux has threads 2 & 3 idle
> and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the
> idle loop and the hypervisor will automatically switch to SMT2 for that
> core (independent of other cores). The opposite is not true, so if
> threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode.
>
> Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go
> into SMT1 mode.
>
> If we can get the core into a lower SMT mode (SMT1 is best), the threads
> will perform better (since they share less core resources). Hence when
> we have idle threads, we want them to be the higher ones.
>
> This adds a feature bit for asymmetric packing to powerpc and then
> enables it on POWER7.
>
> Signed-off-by: Michael Neuling <[email protected]>
>
> ---
>
> arch/powerpc/include/asm/cputable.h | 3 ++-
> arch/powerpc/kernel/process.c | 13 +++++++++++++
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-ozlabs/arch/powerpc/include/asm/cputable.h
> ===================================================================
> --- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/cputable.h
> +++ linux-2.6-ozlabs/arch/powerpc/include/asm/cputable.h
> @@ -195,6 +195,7 @@ extern const char *powerpc_base_platform
> #define CPU_FTR_SAO LONG_ASM_CONST(0x0020000000000000)
> #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x0040000000000000)
> #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x0080000000000000)
> +#define CPU_FTR_ASYM_SMT LONG_ASM_CONST(0x0100000000000000)
>
> #ifndef __ASSEMBLY__
>
> @@ -409,7 +410,7 @@ extern const char *powerpc_base_platform
> CPU_FTR_MMCRA | CPU_FTR_SMT | \
> CPU_FTR_COHERENT_ICACHE | CPU_FTR_LOCKLESS_TLBIE | \
> CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
> - CPU_FTR_DSCR | CPU_FTR_SAO)
> + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT)
> #define CPU_FTRS_CELL (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
> CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
> CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
> Index: linux-2.6-ozlabs/arch/powerpc/kernel/process.c
> ===================================================================
> --- linux-2.6-ozlabs.orig/arch/powerpc/kernel/process.c
> +++ linux-2.6-ozlabs/arch/powerpc/kernel/process.c
> @@ -1265,3 +1265,16 @@ unsigned long randomize_et_dyn(unsigned
>
> return ret;
> }
> +
> +unsigned char asym_smt_print = 1;
> +int arch_sd_sibiling_asym_packing(void)
> +{
> + if (cpu_has_feature(CPU_FTR_ASYM_SMT)){
> + if (asym_smt_print) {
> + pr_info("Enabling Asymetric SMT scheduling\n");
> + asym_smt_print = 0;
> + }

Oops... before anyone else laughs at me, I'm changing this to use
printk_once.

Mikey

> + return SD_ASYM_PACKING;
> + }
> + return 0;
> +}
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

2010-04-13 12:29:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> When calculating capacity we use the following calculation:
>
> capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
>
> In SMT2, power will be 1178/2 (provided we are not scaling power with
> freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
> being 1.
>
> With SMT4 however, power will be 1178/4, hence capacity will end up as
> 0.
>
> Fix this by ensuring capacity is always at least 1 after this
> calculation.
>
> Signed-off-by: Michael Neuling <[email protected]>
> ---
> I'm not sure this is the correct fix but this works for me.

Right, so I suspect this will indeed break some things.

We initially allowed 0 capacity for when a cpu is consumed by an RT task
and there simply isn't much capacity left, in that case you really want
to try and move load to your sibling cpus if possible.

However you're right that this goes awry in your case.

One thing to look at is if that 15% increase is indeed representative
for the power7 cpu, it having 4 SMT threads seems to suggest there was
significant gains, otherwise they'd not have wasted the silicon.

(The broken x86 code was meant to actually compute the SMT gain, so that
we'd not have to guess the 15%)

Now, increasing this will only marginally fix the issue, since if you
end up with 512 per thread it only takes a very tiny amount of RT
workload to drop below and end up at 0 again.

One thing we could look at is using the cpu base power to compute
capacity from. We'd have to add another field to sched_group and store
power before we do the scale_rt_power() stuff.

Thoughts?


> kernel/sched_fair.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta
> }
>
> capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> + capacity = max(capacity, 1UL);
>
> if (tmp->flags & SD_POWERSAVINGS_BALANCE)
> nr_running /= 2;
> @@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st
>
> sgs->group_capacity =
> DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
> + sgs->group_capacity = max(sgs->group_capacity, 1UL);
> }
>
> /**
> @@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g
>
> for_each_cpu(i, sched_group_cpus(group)) {
> unsigned long power = power_of(i);
> - unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> + unsigned long capacity;
> unsigned long wl;
>
> + capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
> +
> if (!cpumask_test_cpu(i, cpus))
> continue;
>

2010-04-13 12:30:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance()

On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> need_active_balance() gates the asymmetric packing based due to power
> save logic, but for packing we don't care.

This explanation lacks a how/why.

So the problem is that need_active_balance() ends up returning false and
prevents the active balance from pulling a task to a lower available SMT
sibling?

> This marks the type of balanace we are attempting to do perform from
> f_b_g() and stops need_active_balance() power save logic gating a
> balance in the asymmetric packing case.

At the very least this wants more comments in the code. I'm not really
charmed by having to add yet another variable to pass around that mess,
but I can't seem to come up with something cleaner either.

> Signed-off-by: Michael Neuling <[email protected]>
>
> ---
>
> kernel/sched_fair.c | 24 +++++++++++++++++++-----
> 1 file changed, 19 insertions(+), 5 deletions(-)
>
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -91,6 +91,13 @@ const_debug unsigned int sysctl_sched_mi
>
> static const struct sched_class fair_sched_class;
>
> +enum balance_type {
> + BALANCE_NONE = 0,
> + BALANCE_LOAD,
> + BALANCE_POWER,
> + BALANCE_PACKING
> +};
> +
> /**************************************************************
> * CFS operations on generic schedulable entities:
> */
> @@ -2783,7 +2790,8 @@ static inline void calculate_imbalance(s
> static struct sched_group *
> find_busiest_group(struct sched_domain *sd, int this_cpu,
> unsigned long *imbalance, enum cpu_idle_type idle,
> - int *sd_idle, const struct cpumask *cpus, int *balance)
> + int *sd_idle, const struct cpumask *cpus, int *balance,
> + enum balance_type *bt)
> {
> struct sd_lb_stats sds;
>
> @@ -2808,6 +2816,7 @@ find_busiest_group(struct sched_domain *
> if (!(*balance))
> goto ret;
>
> + *bt = BALANCE_PACKING;
> if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
> check_asym_packing(sd, &sds, this_cpu, imbalance))
> return sds.busiest;
> @@ -2828,6 +2837,7 @@ find_busiest_group(struct sched_domain *
>
> /* Looks like there is an imbalance. Compute it */
> calculate_imbalance(&sds, this_cpu, imbalance);
> + *bt = BALANCE_LOAD;
> return sds.busiest;
>
> out_balanced:
> @@ -2835,10 +2845,12 @@ out_balanced:
> * There is no obvious imbalance. But check if we can do some balancing
> * to save power.
> */
> + *bt = BALANCE_POWER;
> if (check_power_save_busiest_group(&sds, this_cpu, imbalance))
> return sds.busiest;
> ret:
> *imbalance = 0;
> + *bt = BALANCE_NONE;
> return NULL;
> }
>
> @@ -2899,9 +2911,10 @@ find_busiest_queue(struct sched_group *g
> /* Working cpumask for load_balance and load_balance_newidle. */
> static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
>
> -static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
> +static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
> + enum balance_type *bt)
> {
> - if (idle == CPU_NEWLY_IDLE) {
> + if (idle == CPU_NEWLY_IDLE && *bt != BALANCE_PACKING) {
> /*
> * The only task running in a non-idle cpu can be moved to this
> * cpu in an attempt to completely freeup the other CPU
> @@ -2946,6 +2959,7 @@ static int load_balance(int this_cpu, st
> struct rq *busiest;
> unsigned long flags;
> struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
> + enum balance_type bt;
>
> cpumask_copy(cpus, cpu_active_mask);
>
> @@ -2964,7 +2978,7 @@ static int load_balance(int this_cpu, st
> redo:
> update_shares(sd);
> group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
> - cpus, balance);
> + cpus, balance, &bt);
>
> if (*balance == 0)
> goto out_balanced;
> @@ -3018,7 +3032,7 @@ redo:
> schedstat_inc(sd, lb_failed[idle]);
> sd->nr_balance_failed++;
>
> - if (need_active_balance(sd, sd_idle, idle)) {
> + if (need_active_balance(sd, sd_idle, idle, &bt)) {
> raw_spin_lock_irqsave(&busiest->lock, flags);
>
> /* don't kick the migration_thread, if the curr

2010-04-13 12:29:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing

On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> With the asymmetric packing infrastructure, fix_small_imbalance is
> causing idle higher threads to pull tasks off lower threads.
>
> This is being caused by an off-by-one error.
>
> Signed-off-by: Michael Neuling <[email protected]>
> ---
> I'm not sure this is the right fix but without it, higher threads pull
> tasks off the lower threads, then the packing pulls it back down, etc
> etc and tasks bounce around constantly.

Would help if you expand upon the why/how it manages to get pulled up.

I can't immediately spot anything wrong with the patch, but then that
isn't my favourite piece of code either.. Suresh, any comments?

> ---
>
> kernel/sched_fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6-ozlabs/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> +++ linux-2.6-ozlabs/kernel/sched_fair.c
> @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s
> * SCHED_LOAD_SCALE;
> scaled_busy_load_per_task /= sds->busiest->cpu_power;
>
> - if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >
> (scaled_busy_load_per_task * imbn)) {
> *imbalance = sds->busiest_load_per_task;
> return;

2010-04-13 12:30:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: add asymmetric packing option for sibling domain

On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> Peter: Since this is based mainly off your initial patch, it should
> have your signed-off-by too, but I didn't want to add without your
> permission. Can I add it?

Of course! :-)

This thing does need a better changelog though, and maybe a larger
comment with check_asym_packing(), explaining why and what we're doing
and what we're assuming (that lower cpu number also means lower thread
number).

2010-04-14 01:32:22

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing

On Tue, 2010-04-13 at 05:29 -0700, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > With the asymmetric packing infrastructure, fix_small_imbalance is
> > causing idle higher threads to pull tasks off lower threads.
> >
> > This is being caused by an off-by-one error.
> >
> > Signed-off-by: Michael Neuling <[email protected]>
> > ---
> > I'm not sure this is the right fix but without it, higher threads pull
> > tasks off the lower threads, then the packing pulls it back down, etc
> > etc and tasks bounce around constantly.
>
> Would help if you expand upon the why/how it manages to get pulled up.
>
> I can't immediately spot anything wrong with the patch, but then that
> isn't my favourite piece of code either.. Suresh, any comments?
>

Sorry didn't pay much attention to this patchset. But based on the
comments from Michael and looking at this patchset, it has SMT/MC
implications. I will review and run some tests and get back in a day.

As far as this particular patch is concerned, original code is coming
from Ingo's original CFS code commit (dd41f596) and the below hunk
pretty much explains what the change is about.

- if (max_load - this_load >= busiest_load_per_task * imbn) {
+ if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+ busiest_load_per_task * imbn) {

So the below proposed change will probably break what the above
mentioned commit was trying to achieve, which is: for fairness reasons
we were bouncing the small extra load (between the max_load and
this_load) around.

> > ---
> >
> > kernel/sched_fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Index: linux-2.6-ozlabs/kernel/sched_fair.c
> > ===================================================================
> > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> > +++ linux-2.6-ozlabs/kernel/sched_fair.c
> > @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s
> > * SCHED_LOAD_SCALE;
> > scaled_busy_load_per_task /= sds->busiest->cpu_power;
> >
> > - if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> > + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >
> > (scaled_busy_load_per_task * imbn)) {
> > *imbalance = sds->busiest_load_per_task;
> > return;
>

thanks,
suresh

2010-04-14 04:28:39

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

In message <1271161766.4807.1280.camel@twins> you wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > When calculating capacity we use the following calculation:
> >=20
> > capacity =3D DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
> >=20
> > In SMT2, power will be 1178/2 (provided we are not scaling power with
> > freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
> > being 1.
> >=20
> > With SMT4 however, power will be 1178/4, hence capacity will end up as
> > 0.
> >=20
> > Fix this by ensuring capacity is always at least 1 after this
> > calculation.
> >=20
> > Signed-off-by: Michael Neuling <[email protected]>
> > ---
> > I'm not sure this is the correct fix but this works for me. =20
>
> Right, so I suspect this will indeed break some things.
>
> We initially allowed 0 capacity for when a cpu is consumed by an RT task
> and there simply isn't much capacity left, in that case you really want
> to try and move load to your sibling cpus if possible.

Changing the CPU power based on what tasks are running on them seems a
bit wrong to me. Shouldn't we keep those concepts separate?

> However you're right that this goes awry in your case.
>
> One thing to look at is if that 15% increase is indeed representative
> for the power7 cpu, it having 4 SMT threads seems to suggest there was
> significant gains, otherwise they'd not have wasted the silicon.

There are certainly, for most workloads, per core gains for SMT4 over
SMT2 on P7. My kernels certainly compile faster and that's the only
workload anyone who matters cares about.... ;-)

> (The broken x86 code was meant to actually compute the SMT gain, so that
> we'd not have to guess the 15%)
>
> Now, increasing this will only marginally fix the issue, since if you
> end up with 512 per thread it only takes a very tiny amount of RT
> workload to drop below and end up at 0 again.

I tried initialled to make smt_gain programmable and at 2048 per core
(512 per thread), the packing became unstable, as you intimated.

> One thing we could look at is using the cpu base power to compute
> capacity from. We'd have to add another field to sched_group and store
> power before we do the scale_rt_power() stuff.

Separating capacity from what RT tasks are running seems like a good
idea to me.

This would fix the RT issue, but it's not clear to me how you are
suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting
we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate
the RT tasks out from capacity and keep the max(1, capacity) that I've
added? Or something else?

Would another possibility be changing capacity a scaled value (like
cpu_power is now) rather than a small integer as it is now. For
example, a scaled capacity of 1024 would be equivalent to a capacity of
1 now. This might enable us to handle partial capacities better? We'd
probably have to scale a bunch of nr_running too.

Mikey

2010-04-14 06:09:34

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: add asymmetric packing option for sibling domain

In message <1271161767.4807.1281.camel@twins> you wrote:
> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > Peter: Since this is based mainly off your initial patch, it should
> > have your signed-off-by too, but I didn't want to add without your
> > permission. Can I add it?
>
> Of course! :-)
>
> This thing does need a better changelog though, and maybe a larger
> comment with check_asym_packing(), explaining why and what we're doing
> and what we're assuming (that lower cpu number also means lower thread
> number).
>

OK, updated patch below...

Mikey


[PATCH 2/5] sched: add asymmetric group packing option for sibling domain

Check to see if the group is packed in a sched doman.

This is primarily intended to used at the sibling level. Some cores
like POWER7 prefer to use lower numbered SMT threads. In the case of
POWER7, it can move to lower SMT modes only when higher threads are
idle. When in lower SMT modes, the threads will perform better since
they share less core resources. Hence when we have idle threads, we
want them to be the higher ones.

This adds a hook into f_b_g() called check_asym_packing() to check the
packing. This packing function is run on idle threads. It checks to
see if the busiest CPU in this domain (core in the P7 case) has a
higher CPU number than what where the packing function is being run
on. If it is, calculate the imbalance and return the higher busier
thread as the busiest group to f_b_g(). Here we are assuming a lower
CPU number will be equivalent to a lower SMT thread number.

It also creates a new SD_ASYM_PACKING flag to enable this feature at
any scheduler domain level.

It also creates an arch hook to enable this feature at the sibling
level. The default function doesn't enable this feature.

Based heavily on patch from Peter Zijlstra.

Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 4 +-
include/linux/topology.h | 1
kernel/sched_fair.c | 93 +++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 94 insertions(+), 4 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -799,7 +799,7 @@ enum cpu_idle_type {
#define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
-
+#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */

enum powersavings_balance_level {
@@ -834,6 +834,8 @@ static inline int sd_balance_for_package
return SD_PREFER_SIBLING;
}

+extern int __weak arch_sd_sibiling_asym_packing(void);
+
/*
* Optimise SD flags for power savings:
* SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
Index: linux-2.6-ozlabs/include/linux/topology.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/topology.h
+++ linux-2.6-ozlabs/include/linux/topology.h
@@ -102,6 +102,7 @@ int arch_update_cpu_topology(void);
| 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
+ | arch_sd_sibiling_asym_packing() \
, \
.last_balance = jiffies, \
.balance_interval = 1, \
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2493,6 +2493,39 @@ static inline void update_sg_lb_stats(st
}

/**
+ * update_sd_pick_busiest - return 1 on busiest group
+ * @sd: sched_domain whose statistics are to be checked
+ * @sds: sched_domain statistics
+ * @sg: sched_group candidate to be checked for being the busiest
+ * @sds: sched_group statistics
+ *
+ * This returns 1 for the busiest group. If asymmetric packing is
+ * enabled and we already have a busiest, but this candidate group has
+ * a higher cpu number than the current busiest, pick this sg.
+ */
+static int update_sd_pick_busiest(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ struct sched_group *sg,
+ struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->group_capacity)
+ return 1;
+
+ if (sgs->group_imb)
+ return 1;
+
+ if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) {
+ if (!sds->busiest)
+ return 1;
+
+ if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
+ return 1;
+ }
+
+ return 0;
+}
+
+/**
* update_sd_lb_stats - Update sched_group's statistics for load balancing.
* @sd: sched_domain whose statistics are to be updated.
* @this_cpu: Cpu for which load balance is currently performed.
@@ -2546,9 +2579,8 @@ static inline void update_sd_lb_stats(st
sds->this = group;
sds->this_nr_running = sgs.sum_nr_running;
sds->this_load_per_task = sgs.sum_weighted_load;
- } else if (sgs.avg_load > sds->max_load &&
- (sgs.sum_nr_running > sgs.group_capacity ||
- sgs.group_imb)) {
+ } else if (sgs.avg_load >= sds->max_load &&
+ update_sd_pick_busiest(sd, sds, group, &sgs)) {
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
@@ -2562,6 +2594,57 @@ static inline void update_sd_lb_stats(st
} while (group != sd->groups);
}

+int __weak arch_sd_sibiling_asym_packing(void)
+{
+ return 0*SD_ASYM_PACKING;
+}
+
+/**
+ * check_asym_packing - Check to see if the group is packed into the
+ * sched doman.
+ *
+ * This is primarily intended to used at the sibling level. Some
+ * cores like POWER7 prefer to use lower numbered SMT threads. In the
+ * case of POWER7, it can move to lower SMT modes only when higher
+ * threads are idle. When in lower SMT modes, the threads will
+ * perform better since they share less core resources. Hence when we
+ * have idle threads, we want them to be the higher ones.
+ *
+ * This packing function is run on idle threads. It checks to see if
+ * the busiest CPU in this domain (core in the P7 case) has a higher
+ * CPU number than the packing function is being run on. Here we are
+ * assuming lower CPU number will be equivalent to lower a SMT thread
+ * number.
+ *
+ * @sd: The sched_domain whose packing is to be checked.
+ * @sds: Statistics of the sched_domain which is to be packed
+ * @this_cpu: The cpu at whose sched_domain we're performing load-balance.
+ * @imbalance: returns amount of imbalanced due to packing.
+ *
+ * Returns 1 when packing is required and a task should be moved to
+ * this CPU. The amount of the imbalance is returned in *imbalance.
+ */
+static int check_asym_packing(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ int this_cpu, unsigned long *imbalance)
+{
+ int busiest_cpu;
+
+ if (!(sd->flags & SD_ASYM_PACKING))
+ return 0;
+
+ if (!sds->busiest)
+ return 0;
+
+ busiest_cpu = group_first_cpu(sds->busiest);
+ if (this_cpu > busiest_cpu)
+ return 0;
+
+ *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->cpu_power,
+ SCHED_LOAD_SCALE);
+ return 1;
+}
+
/**
* fix_small_imbalance - Calculate the minor imbalance that exists
* amongst the groups of a sched_domain, during
@@ -2754,6 +2837,10 @@ find_busiest_group(struct sched_domain *
if (!(*balance))
goto ret;

+ if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
+ check_asym_packing(sd, &sds, this_cpu, imbalance))
+ return sds.busiest;
+
if (!sds.busiest || sds.busiest_nr_running == 0)
goto out_balanced;

2010-04-15 04:15:16

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance()

> On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > need_active_balance() gates the asymmetric packing based due to power
> > save logic, but for packing we don't care.
>
> This explanation lacks a how/why.
>
> So the problem is that need_active_balance() ends up returning false and
> prevents the active balance from pulling a task to a lower available SMT
> sibling?

Correct. I've put a more detailed description in the patch below.

> > This marks the type of balanace we are attempting to do perform from
> > f_b_g() and stops need_active_balance() power save logic gating a
> > balance in the asymmetric packing case.
>
> At the very least this wants more comments in the code.

Sorry again for the lack luster comments. I've updated this patch also.

> I'm not really charmed by having to add yet another variable to pass
> around that mess, but I can't seem to come up with something cleaner
> either.

Yeah, the current case only ever reads the balance type in the !=
BALANCE_POWER so a full enum might be overkill, but I though it might
come in useful for someone else.

Updated patch below.

Mikey


[PATCH 4/5] sched: fix need_active_balance() from preventing asymmetric packing

need_active_balance() prevents a task being pulled onto a newly idle
package in an attempt to completely free it so it can be powered down.
Hence it returns false to load_balance() and prevents the active
balance from occurring.

Unfortunately, when asymmetric packing is enabled at the sibling level
this power save logic is preventing the packing balance from moving a
task to a lower idle thread. At the sibling level SD_SHARE_CPUPOWER
and parent(SD_POWERSAVINGS_BALANCE) are enabled and the domain is also
non-idle (since we have at least 1 task we are trying to move down).
Hence the following code, prevents the an active balance from
occurring:

if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return 0;

To fix this, this patch classifies the type of balance we are
attempting to perform into none, load, power and packing based on what
function finds busiest in f_b_g(). This classification is then used
by need_active_balance() to prevent the above power saving logic from
stopping a balance due to asymmetric packing. This ensures tasks can
be correctly moved down to lower sibling threads.

Signed-off-by: Michael Neuling <[email protected]>
---

kernel/sched_fair.c | 35 ++++++++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 5 deletions(-)

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -91,6 +91,14 @@ const_debug unsigned int sysctl_sched_mi

static const struct sched_class fair_sched_class;

+/* Enum to classify the type of balance we are attempting to perform */
+enum balance_type {
+ BALANCE_NONE = 0,
+ BALANCE_LOAD,
+ BALANCE_POWER,
+ BALANCE_PACKING
+};
+
/**************************************************************
* CFS operations on generic schedulable entities:
*/
@@ -2803,16 +2811,19 @@ static inline void calculate_imbalance(s
* @cpus: The set of CPUs under consideration for load-balancing.
* @balance: Pointer to a variable indicating if this_cpu
* is the appropriate cpu to perform load balancing at this_level.
+ * @bt: returns the type of imbalance found
*
* Returns: - the busiest group if imbalance exists.
* - If no imbalance and user has opted for power-savings balance,
* return the least loaded group whose CPUs can be
* put to idle by rebalancing its tasks onto our group.
+ * - *bt classifies the type of imbalance found
*/
static struct sched_group *
find_busiest_group(struct sched_domain *sd, int this_cpu,
unsigned long *imbalance, enum cpu_idle_type idle,
- int *sd_idle, const struct cpumask *cpus, int *balance)
+ int *sd_idle, const struct cpumask *cpus, int *balance,
+ enum balance_type *bt)
{
struct sd_lb_stats sds;

@@ -2837,6 +2848,7 @@ find_busiest_group(struct sched_domain *
if (!(*balance))
goto ret;

+ *bt = BALANCE_PACKING;
if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) &&
check_asym_packing(sd, &sds, this_cpu, imbalance))
return sds.busiest;
@@ -2857,6 +2869,7 @@ find_busiest_group(struct sched_domain *

/* Looks like there is an imbalance. Compute it */
calculate_imbalance(&sds, this_cpu, imbalance);
+ *bt = BALANCE_LOAD;
return sds.busiest;

out_balanced:
@@ -2864,10 +2877,12 @@ out_balanced:
* There is no obvious imbalance. But check if we can do some balancing
* to save power.
*/
+ *bt = BALANCE_POWER;
if (check_power_save_busiest_group(&sds, this_cpu, imbalance))
return sds.busiest;
ret:
*imbalance = 0;
+ *bt = BALANCE_NONE;
return NULL;
}

@@ -2928,9 +2943,18 @@ find_busiest_queue(struct sched_group *g
/* Working cpumask for load_balance and load_balance_newidle. */
static DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle)
+static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle,
+ enum balance_type *bt)
{
- if (idle == CPU_NEWLY_IDLE) {
+ /*
+ * The powersave code will stop a task being moved in an
+ * attempt to freeup CPU package wich could be powered
+ * down. In the case where we are attempting to balance due to
+ * asymmetric packing at the sibling level, we don't care
+ * about power save. Hence prevent powersave stopping a
+ * balance trigged by packing.
+ */
+ if (idle == CPU_NEWLY_IDLE && *bt != BALANCE_PACKING) {
/*
* The only task running in a non-idle cpu can be moved to this
* cpu in an attempt to completely freeup the other CPU
@@ -2975,6 +2999,7 @@ static int load_balance(int this_cpu, st
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
+ enum balance_type bt;

cpumask_copy(cpus, cpu_active_mask);

@@ -2993,7 +3018,7 @@ static int load_balance(int this_cpu, st
redo:
update_shares(sd);
group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
- cpus, balance);
+ cpus, balance, &bt);

if (*balance == 0)
goto out_balanced;
@@ -3047,7 +3072,7 @@ redo:
schedstat_inc(sd, lb_failed[idle]);
sd->nr_balance_failed++;

- if (need_active_balance(sd, sd_idle, idle)) {
+ if (need_active_balance(sd, sd_idle, idle, &bt)) {
raw_spin_lock_irqsave(&busiest->lock, flags);

/* don't kick the migration_thread, if the curr

2010-04-15 05:06:35

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing

In message <[email protected]> you wrote:
> On Tue, 2010-04-13 at 05:29 -0700, Peter Zijlstra wrote:
> > On Fri, 2010-04-09 at 16:21 +1000, Michael Neuling wrote:
> > > With the asymmetric packing infrastructure, fix_small_imbalance is
> > > causing idle higher threads to pull tasks off lower threads.
> > >
> > > This is being caused by an off-by-one error.
> > >
> > > Signed-off-by: Michael Neuling <[email protected]>
> > > ---
> > > I'm not sure this is the right fix but without it, higher threads pull
> > > tasks off the lower threads, then the packing pulls it back down, etc
> > > etc and tasks bounce around constantly.
> >
> > Would help if you expand upon the why/how it manages to get pulled up.
> >
> > I can't immediately spot anything wrong with the patch, but then that
> > isn't my favourite piece of code either.. Suresh, any comments?
> >
>
> Sorry didn't pay much attention to this patchset. But based on the
> comments from Michael and looking at this patchset, it has SMT/MC
> implications. I will review and run some tests and get back in a day.
>
> As far as this particular patch is concerned, original code is coming
> from Ingo's original CFS code commit (dd41f596) and the below hunk
> pretty much explains what the change is about.
>
> - if (max_load - this_load >= busiest_load_per_task * imbn) {
> + if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> + busiest_load_per_task * imbn) {
>
> So the below proposed change will probably break what the above
> mentioned commit was trying to achieve, which is: for fairness reasons
> we were bouncing the small extra load (between the max_load and
> this_load) around.

Actually, you can drop this patch.

In the process of clarifying why it was needed for the changelog, I
discovered I don't actually need it.

Sorry about that.

Mikey

>
> > > ---
> > >
> > > kernel/sched_fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > Index: linux-2.6-ozlabs/kernel/sched_fair.c
> > > ===================================================================
> > > --- linux-2.6-ozlabs.orig/kernel/sched_fair.c
> > > +++ linux-2.6-ozlabs/kernel/sched_fair.c
> > > @@ -2652,7 +2652,7 @@ static inline void fix_small_imbalance(s
> > > * SCHED_LOAD_SCALE;
> > > scaled_busy_load_per_task /= sds->busiest->cpu_power;
> > >
> > > - if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
> > > + if (sds->max_load - sds->this_load + scaled_busy_load_per_task >
> > > (scaled_busy_load_per_task * imbn)) {
> > > *imbalance = sds->busiest_load_per_task;
> > > return;
> >
>
> thanks,
> suresh
>

2010-04-16 13:58:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote:

> > Right, so I suspect this will indeed break some things.
> >
> > We initially allowed 0 capacity for when a cpu is consumed by an RT task
> > and there simply isn't much capacity left, in that case you really want
> > to try and move load to your sibling cpus if possible.
>
> Changing the CPU power based on what tasks are running on them seems a
> bit wrong to me. Shouldn't we keep those concepts separate?

Well the thing cpu_power represents is a ratio of compute capacity
available to this cpu as compared to other cpus. By normalizing the
runqueue weights with this we end up with a fair balance.

The thing to realize here is that this is solely about SCHED_NORMAL
tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
fairness and available compute capacity.

So if we were to ignore RT tasks, you'd end up with a situation where,
assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
would end up on the cpu the RT tasks would be running on would not run
as fast -- is that fair?

Since RT tasks do not have a weight (FIFO/RR have no limit at all,
DEADLINE would have something equivalent to a max weight), it is
impossible to account them in the normal weight sense.

Therefore the current model takes them into account by lowering the
compute capacity according to their (avg) cpu usage. So if the RT task
would consume 66% cputime, we'd end up with a situation where the cpu
running the RT task would get 1 NORMAL task, and other cpu would have
the remaining 3, that way they'd all get 33% cpu.

> > However you're right that this goes awry in your case.
> >
> > One thing to look at is if that 15% increase is indeed representative
> > for the power7 cpu, it having 4 SMT threads seems to suggest there was
> > significant gains, otherwise they'd not have wasted the silicon.
>
> There are certainly, for most workloads, per core gains for SMT4 over
> SMT2 on P7. My kernels certainly compile faster and that's the only
> workload anyone who matters cares about.... ;-)

For sure ;-)

Are there any numbers available on how much they gain? It might be worth
to stick in real numbers instead of this alleged 15%.

> > One thing we could look at is using the cpu base power to compute
> > capacity from. We'd have to add another field to sched_group and store
> > power before we do the scale_rt_power() stuff.
>
> Separating capacity from what RT tasks are running seems like a good
> idea to me.

Well, per the above we cannot fully separate them.

> This would fix the RT issue, but it's not clear to me how you are
> suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting
> we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate
> the RT tasks out from capacity and keep the max(1, capacity) that I've
> added? Or something else?

I would think that 4 SMT threads are still slower than two full cores,
right? So cpu_power=2048 would not be appropriate.

> Would another possibility be changing capacity a scaled value (like
> cpu_power is now) rather than a small integer as it is now. For
> example, a scaled capacity of 1024 would be equivalent to a capacity of
> 1 now. This might enable us to handle partial capacities better? We'd
> probably have to scale a bunch of nr_running too.

Right, so my proposal was to scale down the capacity divider (currently
1024) to whatever would be the base capacity for that cpu. Trouble seems
to be that that makes group capacity a lot more complex, as you would
end up needing to average all the cpu's their base capacity.


Hrmm, my brain seems muddled but I might have another solution, let me
ponder this for a bit..


2010-04-18 21:34:21

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4


In message <1271426308.1674.429.camel@laptop> you wrote:
> On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote:
>
> > > Right, so I suspect this will indeed break some things.
> > >
> > > We initially allowed 0 capacity for when a cpu is consumed by an RT task
> > > and there simply isn't much capacity left, in that case you really want
> > > to try and move load to your sibling cpus if possible.
> >
> > Changing the CPU power based on what tasks are running on them seems a
> > bit wrong to me. Shouldn't we keep those concepts separate?
>
> Well the thing cpu_power represents is a ratio of compute capacity
> available to this cpu as compared to other cpus. By normalizing the
> runqueue weights with this we end up with a fair balance.
>
> The thing to realize here is that this is solely about SCHED_NORMAL
> tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
> fairness and available compute capacity.
>
> So if we were to ignore RT tasks, you'd end up with a situation where,
> assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
> load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
> would end up on the cpu the RT tasks would be running on would not run
> as fast -- is that fair?
>
> Since RT tasks do not have a weight (FIFO/RR have no limit at all,
> DEADLINE would have something equivalent to a max weight), it is
> impossible to account them in the normal weight sense.
>
> Therefore the current model takes them into account by lowering the
> compute capacity according to their (avg) cpu usage. So if the RT task
> would consume 66% cputime, we'd end up with a situation where the cpu
> running the RT task would get 1 NORMAL task, and other cpu would have
> the remaining 3, that way they'd all get 33% cpu.

Thanks for the explanation. Your last example makes perfect sense for
me.

>
> > > However you're right that this goes awry in your case.
> > >
> > > One thing to look at is if that 15% increase is indeed representative
> > > for the power7 cpu, it having 4 SMT threads seems to suggest there was
> > > significant gains, otherwise they'd not have wasted the silicon.
> >
> > There are certainly, for most workloads, per core gains for SMT4 over
> > SMT2 on P7. My kernels certainly compile faster and that's the only
> > workload anyone who matters cares about.... ;-)
>
> For sure ;-)
>
> Are there any numbers available on how much they gain? It might be worth
> to stick in real numbers instead of this alleged 15%.

I get some gain numbers but obviously the workloads makes a huge
difference. From a scheduler perspective, I assume an
average/representative gain is best rather than an optimistic or
pessimistic one?

We'll have different gains for SMT2 and SMT4, so we could change the
gain dynamically based on which SMT mode we are in. Does that seem like
something we should add as an arch hook?

> > > One thing we could look at is using the cpu base power to compute
> > > capacity from. We'd have to add another field to sched_group and store
> > > power before we do the scale_rt_power() stuff.
> >
> > Separating capacity from what RT tasks are running seems like a good
> > idea to me.
>
> Well, per the above we cannot fully separate them.

Yep.

>
> > This would fix the RT issue, but it's not clear to me how you are
> > suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting
> > we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate
> > the RT tasks out from capacity and keep the max(1, capacity) that I've
> > added? Or something else?
>
> I would think that 4 SMT threads are still slower than two full cores,
> right? So cpu_power=2048 would not be appropriate.

Yes for an average workload, 4 SMT threads are slower than 2 full
cores.

>
> > Would another possibility be changing capacity a scaled value (like
> > cpu_power is now) rather than a small integer as it is now. For
> > example, a scaled capacity of 1024 would be equivalent to a capacity of
> > 1 now. This might enable us to handle partial capacities better? We'd
> > probably have to scale a bunch of nr_running too.
>
> Right, so my proposal was to scale down the capacity divider (currently
> 1024) to whatever would be the base capacity for that cpu. Trouble seems
> to be that that makes group capacity a lot more complex, as you would
> end up needing to average all the cpu's their base capacity.
>
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..

Great!

Mikey

2010-04-19 14:49:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

On Mon, 2010-04-19 at 07:34 +1000, Michael Neuling wrote:
> > Are there any numbers available on how much they gain? It might be worth
> > to stick in real numbers instead of this alleged 15%.
>
> I get some gain numbers but obviously the workloads makes a huge
> difference. From a scheduler perspective, I assume an
> average/representative gain is best rather than an optimistic or
> pessimistic one?

Yeah, average would be best.

> We'll have different gains for SMT2 and SMT4, so we could change the
> gain dynamically based on which SMT mode we are in. Does that seem like
> something we should add as an arch hook?

That's the sort of thing you can use arch_scale_smt_power() for. But be
weary to not fall into the same trap I did with x86, where I confused
actual gain with capacity (When idle the actual gain is 0, but the
capacity is not).



2010-04-19 20:45:48

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

In message <1271688543.1488.253.camel@laptop> you wrote:
> On Mon, 2010-04-19 at 07:34 +1000, Michael Neuling wrote:
> > > Are there any numbers available on how much they gain? It might be worth
> > > to stick in real numbers instead of this alleged 15%.
> >
> > I get some gain numbers but obviously the workloads makes a huge
> > difference. From a scheduler perspective, I assume an
> > average/representative gain is best rather than an optimistic or
> > pessimistic one?
>
> Yeah, average would be best.

Ok.

> > We'll have different gains for SMT2 and SMT4, so we could change the
> > gain dynamically based on which SMT mode we are in. Does that seem like
> > something we should add as an arch hook?
>
> That's the sort of thing you can use arch_scale_smt_power() for. But be
> weary to not fall into the same trap I did with x86, where I confused
> actual gain with capacity (When idle the actual gain is 0, but the
> capacity is not).

Oops, yes of course :-)

<from before>
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..

Let me know if/when you come up this solution or if I can help.

Mikey

2010-04-30 17:28:49

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

In message <1271426308.1674.429.camel@laptop> you wrote:
> On Wed, 2010-04-14 at 14:28 +1000, Michael Neuling wrote:
>
> > > Right, so I suspect this will indeed break some things.
> > >
> > > We initially allowed 0 capacity for when a cpu is consumed by an RT task
> > > and there simply isn't much capacity left, in that case you really want
> > > to try and move load to your sibling cpus if possible.
> >
> > Changing the CPU power based on what tasks are running on them seems a
> > bit wrong to me. Shouldn't we keep those concepts separate?
>
> Well the thing cpu_power represents is a ratio of compute capacity
> available to this cpu as compared to other cpus. By normalizing the
> runqueue weights with this we end up with a fair balance.
>
> The thing to realize here is that this is solely about SCHED_NORMAL
> tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
> fairness and available compute capacity.
>
> So if we were to ignore RT tasks, you'd end up with a situation where,
> assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
> load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
> would end up on the cpu the RT tasks would be running on would not run
> as fast -- is that fair?
>
> Since RT tasks do not have a weight (FIFO/RR have no limit at all,
> DEADLINE would have something equivalent to a max weight), it is
> impossible to account them in the normal weight sense.
>
> Therefore the current model takes them into account by lowering the
> compute capacity according to their (avg) cpu usage. So if the RT task
> would consume 66% cputime, we'd end up with a situation where the cpu
> running the RT task would get 1 NORMAL task, and other cpu would have
> the remaining 3, that way they'd all get 33% cpu.
>
> > > However you're right that this goes awry in your case.
> > >
> > > One thing to look at is if that 15% increase is indeed representative
> > > for the power7 cpu, it having 4 SMT threads seems to suggest there was
> > > significant gains, otherwise they'd not have wasted the silicon.
> >
> > There are certainly, for most workloads, per core gains for SMT4 over
> > SMT2 on P7. My kernels certainly compile faster and that's the only
> > workload anyone who matters cares about.... ;-)
>
> For sure ;-)
>
> Are there any numbers available on how much they gain? It might be worth
> to stick in real numbers instead of this alleged 15%.
>
> > > One thing we could look at is using the cpu base power to compute
> > > capacity from. We'd have to add another field to sched_group and store
> > > power before we do the scale_rt_power() stuff.
> >
> > Separating capacity from what RT tasks are running seems like a good
> > idea to me.
>
> Well, per the above we cannot fully separate them.
>
> > This would fix the RT issue, but it's not clear to me how you are
> > suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting
> > we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate
> > the RT tasks out from capacity and keep the max(1, capacity) that I've
> > added? Or something else?
>
> I would think that 4 SMT threads are still slower than two full cores,
> right? So cpu_power=2048 would not be appropriate.
>
> > Would another possibility be changing capacity a scaled value (like
> > cpu_power is now) rather than a small integer as it is now. For
> > example, a scaled capacity of 1024 would be equivalent to a capacity of
> > 1 now. This might enable us to handle partial capacities better? We'd
> > probably have to scale a bunch of nr_running too.
>
> Right, so my proposal was to scale down the capacity divider (currently
> 1024) to whatever would be the base capacity for that cpu. Trouble seems
> to be that that makes group capacity a lot more complex, as you would
> end up needing to average all the cpu's their base capacity.
>
>
> Hrmm, my brain seems muddled but I might have another solution, let me
> ponder this for a bit..
>

Peter,

Did you manage to get anywhere on this capacity issue?

Mikey

2010-06-01 22:53:04

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

* Peter Zijlstra <[email protected]> [2010-05-31 10:33:16]:

> On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> >
> >
> > Hrmm, my brain seems muddled but I might have another solution, let me
> > ponder this for a bit..
> >
>
> Right, so the thing I was thinking about is taking the group capacity
> into account when determining the capacity for a single cpu.
>
> Say the group contains all the SMT siblings, then use the group capacity
> (usually larger than 1024) and then distribute the capacity over the
> group members, preferring CPUs with higher individual cpu_power over
> those with less.
>
> So suppose you've got 4 siblings with cpu_power=294 each, then we assign
> capacity 1 to the first member, and the remaining 153 is insufficient,
> and thus we stop and the rest lives with 0 capacity.
>
> Now take the example that the first sibling would be running a heavy RT
> load, and its cpu_power would be reduced to say, 50, then we still got
> nearly 933 left over the others, which is still sufficient for one
> capacity, but because the first sibling is low, we'll assign it 0 and
> instead assign 1 to the second, again, leaving the third and fourth 0.

Hi Peter,

Thanks for the suggestion.

> If the group were a core group, the total would be much higher and we'd
> likely end up assigning 1 to each before we'd run out of capacity.

This is a tricky case because we are depending upon the
DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We
will not have any task movement until capacity is depleted to quite
low value due to RT task. Having a threshold to flag 0/1 instead of
DIV_ROUND_CLOSEST just like you have suggested in the power savings
case may help here as well to move tasks to other idle cores.

> For power savings, we can lower the threshold and maybe use the maximal
> individual cpu_power in the group to base 1 capacity from.
>
> So, suppose the second example, where sibling0 has 50 and the others
> have 294, you'd end up with a capacity distribution of: {0,1,1,1}.

One challenge here is that if RT tasks run on more that one thread in
this group, we will have slightly different cpu powers. Arranging
them from max to min and having a cutoff threshold should work.

Should we keep the RT scaling as a separate entity along with
cpu_power to simplify these thresholds. Whenever we need to scale
group load with cpu power can take the product of cpu_power and
scale_rt_power but in these cases where we compute capacity, we can
mark a 0 or 1 just based on whether scale_rt_power was less than
SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as
a product of all scaling factors as it is today but save the component
scale factors also like scale_rt_power() and arch_scale_freq_power()
so that it can be used in load balance decisions.

Basically in power save balance we would give all threads a capacity
'1' unless the cpu_power was reduced due to RT task. Similarly in
the non-power save case, we can have flag 1,0,0,0 unless first thread
had a RT scaling during the last interval.

I am suggesting to distinguish the reduction is cpu_power due to
architectural (hardware DVFS) reasons from RT tasks so that it is easy
to decide if moving tasks to sibling thread or core can help or not.

--Vaidy

2010-06-03 08:57:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

On Wed, 2010-06-02 at 04:22 +0530, Vaidyanathan Srinivasan wrote:

> > If the group were a core group, the total would be much higher and we'd
> > likely end up assigning 1 to each before we'd run out of capacity.
>
> This is a tricky case because we are depending upon the
> DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We
> will not have any task movement until capacity is depleted to quite
> low value due to RT task. Having a threshold to flag 0/1 instead of
> DIV_ROUND_CLOSEST just like you have suggested in the power savings
> case may help here as well to move tasks to other idle cores.

Right, well we could put the threshold higher than the 50%, say 90% or
so.

> > For power savings, we can lower the threshold and maybe use the maximal
> > individual cpu_power in the group to base 1 capacity from.
> >
> > So, suppose the second example, where sibling0 has 50 and the others
> > have 294, you'd end up with a capacity distribution of: {0,1,1,1}.
>
> One challenge here is that if RT tasks run on more that one thread in
> this group, we will have slightly different cpu powers. Arranging
> them from max to min and having a cutoff threshold should work.

Right, like the 90% above.

> Should we keep the RT scaling as a separate entity along with
> cpu_power to simplify these thresholds. Whenever we need to scale
> group load with cpu power can take the product of cpu_power and
> scale_rt_power but in these cases where we compute capacity, we can
> mark a 0 or 1 just based on whether scale_rt_power was less than
> SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as
> a product of all scaling factors as it is today but save the component
> scale factors also like scale_rt_power() and arch_scale_freq_power()
> so that it can be used in load balance decisions.

Right, so the question is, do we only care about RT or should capacity
reflect the full asymmetric MP case.

I don't quite see why RT is special from any of the other scale factors,
if someone pegged one core at half the frequency of the others you'd
still want it to get 0 capacity so that we only try to populate it on
overload.

> Basically in power save balance we would give all threads a capacity
> '1' unless the cpu_power was reduced due to RT task. Similarly in
> the non-power save case, we can have flag 1,0,0,0 unless first thread
> had a RT scaling during the last interval.
>
> I am suggesting to distinguish the reduction is cpu_power due to
> architectural (hardware DVFS) reasons from RT tasks so that it is easy
> to decide if moving tasks to sibling thread or core can help or not.

For power savings such a special heuristic _might_ make sense.

2010-06-07 15:07:19

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

On Mon, May 31, 2010 at 10:33:16AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 15:58 +0200, Peter Zijlstra wrote:
> >
> >
> > Hrmm, my brain seems muddled but I might have another solution, let me
> > ponder this for a bit..
> >
>
> Right, so the thing I was thinking about is taking the group capacity
> into account when determining the capacity for a single cpu.

Peter,
We are exploring an alternate solution which seems to be working as
expected. Basically allow capacity of 1 for SMT threads provided there is
no significant influence by RT tasks or freq scaling. Note that at core level,
capacity is unchanged and hence this affects only how tasks are distributed
within a core.

Mike Neuling should post an updated patchset containing this patch
(with more comments added ofcourse!).


Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -860,7 +860,7 @@ struct sched_group {
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU.
*/
- unsigned int cpu_power;
+ unsigned int cpu_power, cpu_power_orig;

/*
* The CPUs this group covers.
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sche
unsigned long power = SCHED_LOAD_SCALE;
struct sched_group *sdg = sd->groups;

- if (sched_feat(ARCH_POWER))
- power *= arch_scale_freq_power(sd, cpu);
- else
- power *= default_scale_freq_power(sd, cpu);
-
- power >>= SCHED_LOAD_SHIFT;
-
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
if (sched_feat(ARCH_POWER))
power *= arch_scale_smt_power(sd, cpu);
@@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sche
power >>= SCHED_LOAD_SHIFT;
}

+ sdg->cpu_power_orig = power;
+
+ if (sched_feat(ARCH_POWER))
+ power *= arch_scale_freq_power(sd, cpu);
+ else
+ power *= default_scale_freq_power(sd, cpu);
+
+ power >>= SCHED_LOAD_SHIFT;
+
power *= scale_rt_power(cpu);
power >>= SCHED_LOAD_SHIFT;

@@ -2333,6 +2335,22 @@ static void update_group_power(struct sc
sdg->cpu_power = power;
}

+static inline int
+rt_freq_influence(struct sched_group *group, struct sched_domain *sd)
+{
+ if (sd->child)
+ return 1;
+
+ /*
+ * Check to see if the final cpu power was reduced by more
+ * than 10% by frequency or rt tasks
+ */
+ if (group->cpu_power * 100 < group->cpu_power_orig * 90)
+ return 1;
+
+ return 0;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @sd: The sched_domain whose statistics are to be updated.
@@ -2426,6 +2444,8 @@ static inline void update_sg_lb_stats(st

sgs->group_capacity =
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+ if (!sgs->group_capacity && !rt_freq_influence(group, sd))
+ sgs->group_capacity = 1;
}

/**
@@ -2725,7 +2745,8 @@ ret:
*/
static struct rq *
find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle,
- unsigned long imbalance, const struct cpumask *cpus)
+ unsigned long imbalance, const struct cpumask *cpus,
+ struct sched_domain *sd)
{
struct rq *busiest = NULL, *rq;
unsigned long max_load = 0;
@@ -2736,6 +2757,9 @@ find_busiest_queue(struct sched_group *g
unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
unsigned long wl;

+ if (!capacity && !rt_freq_influence(group, sd))
+ capacity = 1;
+
if (!cpumask_test_cpu(i, cpus))
continue;

@@ -2852,7 +2876,7 @@ redo:
goto out_balanced;
}

- busiest = find_busiest_queue(group, idle, imbalance, cpus);
+ busiest = find_busiest_queue(group, idle, imbalance, cpus, sd);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;