Subject: sched: avoid large irq-latencies in smp-balancing
SMP balancing is done with IRQs disabled and can iterate the full rq. When rqs
are large this can cause large irq-latencies. Limit the nr of iterations on
each run.
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Peter Williams <[email protected]>
---
kernel/sched.c | 10 ++++++++--
kernel/sysctl.c | 8 ++++++++
2 files changed, 16 insertions(+), 2 deletions(-)
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -474,6 +474,12 @@ const_debug unsigned int sysctl_sched_fe
#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
/*
+ * Number of tasks to iterate in a single balance run.
+ * Limited because this is done with IRQs disabled.
+ */
+const_debug unsigned int sysctl_sched_nr_migrate = 32;
+
+/*
* For kernel-internal use: high-speed (but slightly incorrect) per-cpu
* clock constructed from sched_clock():
*/
@@ -2237,7 +2243,7 @@ balance_tasks(struct rq *this_rq, int th
enum cpu_idle_type idle, int *all_pinned,
int *this_best_prio, struct rq_iterator *iterator)
{
- int pulled = 0, pinned = 0, skip_for_load;
+ int loops = 0, pulled = 0, pinned = 0, skip_for_load;
struct task_struct *p;
long rem_load_move = max_load_move;
@@ -2251,7 +2257,7 @@ balance_tasks(struct rq *this_rq, int th
*/
p = iterator->start(iterator->arg);
next:
- if (!p)
+ if (!p || loops++ > sysctl_sched_nr_migrate)
goto out;
/*
* To help distribute high priority tasks accross CPUs we don't
Index: linux-2.6-2/kernel/sysctl.c
===================================================================
--- linux-2.6-2.orig/kernel/sysctl.c
+++ linux-2.6-2/kernel/sysctl.c
@@ -298,6 +298,14 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_nr_migrate",
+ .data = &sysctl_sched_nr_migrate,
+ .maxlen = sizeof(unsigned int),
+ .mode = 644,
+ .proc_handler = &proc_dointvec,
+ },
#endif
{
.ctl_name = CTL_UNNUMBERED,
Bah, missed a hunk
---
Subject: sched: avoid large irq-latencies in smp-balancing
SMP balancing is done with IRQs disabled and can iterate the full rq. When rqs
are large this can cause large irq-latencies. Limit the nr of iterations on
each run.
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Peter Williams <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched.c | 15 ++++++++++-----
kernel/sysctl.c | 8 ++++++++
3 files changed, 19 insertions(+), 5 deletions(-)
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -474,6 +474,12 @@ const_debug unsigned int sysctl_sched_fe
#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
/*
+ * Number of tasks to iterate in a single balance run.
+ * Limited because this is done with IRQs disabled.
+ */
+const_debug unsigned int sysctl_sched_nr_migrate = 32;
+
+/*
* For kernel-internal use: high-speed (but slightly incorrect) per-cpu
* clock constructed from sched_clock():
*/
@@ -2237,7 +2243,7 @@ balance_tasks(struct rq *this_rq, int th
enum cpu_idle_type idle, int *all_pinned,
int *this_best_prio, struct rq_iterator *iterator)
{
- int pulled = 0, pinned = 0, skip_for_load;
+ int loops = 0, pulled = 0, pinned = 0, skip_for_load;
struct task_struct *p;
long rem_load_move = max_load_move;
@@ -2251,10 +2257,10 @@ balance_tasks(struct rq *this_rq, int th
*/
p = iterator->start(iterator->arg);
next:
- if (!p)
+ if (!p || loops++ > sysctl_sched_nr_migrate)
goto out;
/*
- * To help distribute high priority tasks accross CPUs we don't
+ * To help distribute high priority tasks across CPUs we don't
* skip a task if it will be the highest priority task (i.e. smallest
* prio value) on its new queue regardless of its load weight
*/
@@ -2271,8 +2277,7 @@ next:
rem_load_move -= p->se.load.weight;
/*
- * We only want to steal up to the prescribed number of tasks
- * and the prescribed amount of weighted load.
+ * We only want to steal up to the prescribed amount of weighted load.
*/
if (rem_load_move > 0) {
if (p->prio < *this_best_prio)
Index: linux-2.6-2/kernel/sysctl.c
===================================================================
--- linux-2.6-2.orig/kernel/sysctl.c
+++ linux-2.6-2/kernel/sysctl.c
@@ -298,6 +298,14 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_nr_migrate",
+ .data = &sysctl_sched_nr_migrate,
+ .maxlen = sizeof(unsigned int),
+ .mode = 644,
+ .proc_handler = &proc_dointvec,
+ },
#endif
{
.ctl_name = CTL_UNNUMBERED,
Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -1466,6 +1466,7 @@ extern unsigned int sysctl_sched_batch_w
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;
extern unsigned int sysctl_sched_migration_cost;
+extern unsigned int sysctl_sched_nr_migrate;
#endif
extern unsigned int sysctl_sched_compat_yield;
> On Wed, 07 Nov 2007 13:17:00 +0100 Peter Zijlstra <[email protected]> wrote:
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "sched_nr_migrate",
> + .data = &sysctl_sched_nr_migrate,
> + .maxlen = sizeof(unsigned int),
> + .mode = 644,
> + .proc_handler = &proc_dointvec,
> + },
This (and all the other stuff in that table) should be described in
Documentation/, please.
It would be nice if sched_nr_migrate didn't exist, really. It's hard to
imagine anyone wanting to tweak it, apart from developers.
On Wed, 2007-11-07 at 10:27 -0800, Andrew Morton wrote:
> > On Wed, 07 Nov 2007 13:17:00 +0100 Peter Zijlstra <[email protected]> wrote:
> > + {
> > + .ctl_name = CTL_UNNUMBERED,
> > + .procname = "sched_nr_migrate",
> > + .data = &sysctl_sched_nr_migrate,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 644,
> > + .proc_handler = &proc_dointvec,
> > + },
>
> This (and all the other stuff in that table) should be described in
> Documentation/, please.
>
> It would be nice if sched_nr_migrate didn't exist, really. It's hard to
> imagine anyone wanting to tweak it, apart from developers.
Right, most of these SCHED_DEBUG sysctls are only interesting to
developers, although stuff like sched_latency might be interesting to
some users.
I'll try and write up a documentation thingy before .24 hits the street.
On Wed, Nov 07, 2007 at 10:27:25AM -0800, Andrew Morton wrote:
> > On Wed, 07 Nov 2007 13:17:00 +0100 Peter Zijlstra <[email protected]> wrote:
> > + {
> > + .ctl_name = CTL_UNNUMBERED,
> > + .procname = "sched_nr_migrate",
> > + .data = &sysctl_sched_nr_migrate,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 644,
> > + .proc_handler = &proc_dointvec,
> > + },
>
> This (and all the other stuff in that table) should be described in
> Documentation/, please.
>
> It would be nice if sched_nr_migrate didn't exist, really. It's hard to
> imagine anyone wanting to tweak it, apart from developers.
I'm not so sure about that. It is a tunable for RT. That is we can tweak
this value to be smaller if we don't like the latencies it gives us.
This is one of those things that sacrifices performance for latency.
The higher the number, the better it can spread tasks around, but it
also causes large latencies.
I've just included this patch into 2.6.23.1-rt11 and it brought down an
unbounded latency to just 42us. (previously we got into the
milliseconds!).
Perhaps when this feature matures, we can come to a good defined value
that would be good for all. But until then, I recommend keeping this a
tunable.
Acked-by: Steven Rostedt <[email protected]>
-- Steve
On Wed, 2007-11-07 at 17:10 -0500, Steven Rostedt wrote:
> >
> > It would be nice if sched_nr_migrate didn't exist, really. It's hard to
> > imagine anyone wanting to tweak it, apart from developers.
>
> I'm not so sure about that. It is a tunable for RT. That is we can tweak
> this value to be smaller if we don't like the latencies it gives us.
>
> This is one of those things that sacrifices performance for latency.
> The higher the number, the better it can spread tasks around, but it
> also causes large latencies.
>
> I've just included this patch into 2.6.23.1-rt11 and it brought down an
> unbounded latency to just 42us. (previously we got into the
> milliseconds!).
>
> Perhaps when this feature matures, we can come to a good defined value
> that would be good for all. But until then, I recommend keeping this a
> tunable.
Why not use the latency-expectation infrastructure?
Iterate under lock until (or before...) the system global latency is
respected.
- Eric
Peter Zijlstra wrote:
> Bah, missed a hunk
>
> ---
> Subject: sched: avoid large irq-latencies in smp-balancing
>
> SMP balancing is done with IRQs disabled and can iterate the full rq. When rqs
> are large this can cause large irq-latencies. Limit the nr of iterations on
> each run.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> CC: Peter Williams <[email protected]>
Tested-by: Gregory Haskins <[email protected]> (as part of 23.1-rt11)
> ---
> include/linux/sched.h | 1 +
> kernel/sched.c | 15 ++++++++++-----
> kernel/sysctl.c | 8 ++++++++
> 3 files changed, 19 insertions(+), 5 deletions(-)
>
> Index: linux-2.6-2/kernel/sched.c
> ===================================================================
> --- linux-2.6-2.orig/kernel/sched.c
> +++ linux-2.6-2/kernel/sched.c
> @@ -474,6 +474,12 @@ const_debug unsigned int sysctl_sched_fe
> #define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
>
> /*
> + * Number of tasks to iterate in a single balance run.
> + * Limited because this is done with IRQs disabled.
> + */
> +const_debug unsigned int sysctl_sched_nr_migrate = 32;
> +
> +/*
> * For kernel-internal use: high-speed (but slightly incorrect) per-cpu
> * clock constructed from sched_clock():
> */
> @@ -2237,7 +2243,7 @@ balance_tasks(struct rq *this_rq, int th
> enum cpu_idle_type idle, int *all_pinned,
> int *this_best_prio, struct rq_iterator *iterator)
> {
> - int pulled = 0, pinned = 0, skip_for_load;
> + int loops = 0, pulled = 0, pinned = 0, skip_for_load;
> struct task_struct *p;
> long rem_load_move = max_load_move;
>
> @@ -2251,10 +2257,10 @@ balance_tasks(struct rq *this_rq, int th
> */
> p = iterator->start(iterator->arg);
> next:
> - if (!p)
> + if (!p || loops++ > sysctl_sched_nr_migrate)
> goto out;
> /*
> - * To help distribute high priority tasks accross CPUs we don't
> + * To help distribute high priority tasks across CPUs we don't
> * skip a task if it will be the highest priority task (i.e. smallest
> * prio value) on its new queue regardless of its load weight
> */
> @@ -2271,8 +2277,7 @@ next:
> rem_load_move -= p->se.load.weight;
>
> /*
> - * We only want to steal up to the prescribed number of tasks
> - * and the prescribed amount of weighted load.
> + * We only want to steal up to the prescribed amount of weighted load.
> */
> if (rem_load_move > 0) {
> if (p->prio < *this_best_prio)
> Index: linux-2.6-2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-2.orig/kernel/sysctl.c
> +++ linux-2.6-2/kernel/sysctl.c
> @@ -298,6 +298,14 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = &proc_dointvec,
> },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "sched_nr_migrate",
> + .data = &sysctl_sched_nr_migrate,
> + .maxlen = sizeof(unsigned int),
> + .mode = 644,
> + .proc_handler = &proc_dointvec,
> + },
> #endif
> {
> .ctl_name = CTL_UNNUMBERED,
> Index: linux-2.6-2/include/linux/sched.h
> ===================================================================
> --- linux-2.6-2.orig/include/linux/sched.h
> +++ linux-2.6-2/include/linux/sched.h
> @@ -1466,6 +1466,7 @@ extern unsigned int sysctl_sched_batch_w
> extern unsigned int sysctl_sched_child_runs_first;
> extern unsigned int sysctl_sched_features;
> extern unsigned int sysctl_sched_migration_cost;
> +extern unsigned int sysctl_sched_nr_migrate;
> #endif
>
> extern unsigned int sysctl_sched_compat_yield;
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Peter Zijlstra <peterz <at> infradead.org> writes:
> @@ -2237,7 +2243,7 @@ balance_tasks(struct rq *this_rq, int th
> enum cpu_idle_type idle, int *all_pinned,
> int *this_best_prio, struct rq_iterator *iterator)
> {
> - int pulled = 0, pinned = 0, skip_for_load;
> + int loops = 0, pulled = 0, pinned = 0, skip_for_load;
> struct task_struct *p;
> long rem_load_move = max_load_move;
>
> @@ -2251,10 +2257,10 @@ balance_tasks(struct rq *this_rq, int th
> */
> p = iterator->start(iterator->arg);
> next:
> - if (!p)
> + if (!p || loops++ > sysctl_sched_nr_migrate)
> goto out;
Looks to me like an off-by-one thingy. If sysctl_sched_nr_migrate is zero we
can still migrate one task. Feature or bug?
/dm
On Thursday 08 November 2007 15:37, Gregory Haskins wrote:
> Peter Zijlstra wrote:
> > Bah, missed a hunk
> >
> > ---
> > Subject: sched: avoid large irq-latencies in smp-balancing
> >
> > SMP balancing is done with IRQs disabled and can iterate the full rq.
> > When rqs are large this can cause large irq-latencies. Limit the nr of
> > iterations on each run.
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > CC: Peter Williams <[email protected]>
>
> Tested-by: Gregory Haskins <[email protected]> (as part of 23.1-rt11)
OK, but let's not make this default for mainline just yet please. There
are so many performance regression possibilities in 2.6.24 already that
we should go a little bit slower for now IMO.