2011-04-02 10:31:32

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] scheduler fixes

Linus,

Please pull the latest sched-fixes-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus

Thanks,

Ingo

------------------>
Borislav Petkov (1):
sched, doc: Beef up load balancing description

Dario Faggioli (1):
sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks

Sisir Koppaka (1):
sched: Fix rebalance interval calculation


Documentation/scheduler/sched-domains.txt | 32 ++++++++++++++++++++--------
kernel/sched.c | 11 ++++++++++
kernel/sched_fair.c | 5 ++-
3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt
index 373ceac..b7ee379 100644
--- a/Documentation/scheduler/sched-domains.txt
+++ b/Documentation/scheduler/sched-domains.txt
@@ -1,8 +1,7 @@
-Each CPU has a "base" scheduling domain (struct sched_domain). These are
-accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain
+Each CPU has a "base" scheduling domain (struct sched_domain). The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent
-MUST be NULL terminated, and domain structures should be per-CPU as they
-are locklessly updated.
+MUST be NULL terminated, and domain structures should be per-CPU as they are
+locklessly updated.

Each scheduling domain spans a number of CPUs (stored in the ->span field).
A domain's span MUST be a superset of it child's span (this restriction could
@@ -26,11 +25,26 @@ is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.

-In kernel/sched.c, rebalance_tick is run periodically on each CPU. This
-function takes its CPU's base sched domain and checks to see if has reached
-its rebalance interval. If so, then it will run load_balance on that domain.
-rebalance_tick then checks the parent sched_domain (if it exists), and the
-parent of the parent and so forth.
+In kernel/sched.c, trigger_load_balance() is run periodically on each CPU
+through scheduler_tick(). It raises a softirq after the next regularly scheduled
+rebalancing event for the current runqueue has arrived. The actual load
+balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
+in softirq context (SCHED_SOFTIRQ).
+
+The latter function takes two arguments: the current CPU and whether it was idle
+at the time the scheduler_tick() happened and iterates over all sched domains
+our CPU is on, starting from its base domain and going up the ->parent chain.
+While doing that, it checks to see if the current domain has exhausted its
+rebalance interval. If so, it runs load_balance() on that domain. It then checks
+the parent sched_domain (if it exists), and the parent of the parent and so
+forth.
+
+Initially, load_balance() finds the busiest group in the current sched domain.
+If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
+that group. If it manages to find such a runqueue, it locks both our initial
+CPU's runqueue and the newly found busiest one and starts moving tasks from it
+to our runqueue. The exact number of tasks amounts to an imbalance previously
+computed while iterating over this sched domain's groups.

*** Implementing sched domains ***
The "base" domain will "span" the first level of the hierarchy. In the case
diff --git a/kernel/sched.c b/kernel/sched.c
index f592ce6..a884551 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5011,6 +5011,17 @@ recheck:
return -EINVAL;
}

+ /*
+ * If not changing anything there's no need to proceed further:
+ */
+ if (unlikely(policy == p->policy && (!rt_policy(policy) ||
+ param->sched_priority == p->rt_priority))) {
+
+ __task_rq_unlock(rq);
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+ return 0;
+ }
+
#ifdef CONFIG_RT_GROUP_SCHED
if (user) {
/*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3f7ec9e..c7ec5c8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -22,6 +22,7 @@

#include <linux/latencytop.h>
#include <linux/sched.h>
+#include <linux/cpumask.h>

/*
* Targeted preemption latency for CPU-bound tasks:
@@ -3850,8 +3851,8 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
interval = msecs_to_jiffies(interval);
if (unlikely(!interval))
interval = 1;
- if (interval > HZ*NR_CPUS/10)
- interval = HZ*NR_CPUS/10;
+ if (interval > HZ*num_online_cpus()/10)
+ interval = HZ*num_online_cpus()/10;

need_serialize = sd->flags & SD_SERIALIZE;


2011-04-04 15:46:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] scheduler fixes

So I pulled this, but I think this:

On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <[email protected]> wrote:
>
> - ? ? ? ? ? ? ? if (interval > HZ*NR_CPUS/10)
> - ? ? ? ? ? ? ? ? ? ? ? interval = HZ*NR_CPUS/10;
> + ? ? ? ? ? ? ? if (interval > HZ*num_online_cpus()/10)
> + ? ? ? ? ? ? ? ? ? ? ? interval = HZ*num_online_cpus()/10;

is a horrible patch.

Think about what that expands to. It's going to potentially be two
function calls. And the code is just ugly.

So please, when doing search-and-replace changes like this, just clean
up the code at the same time. Back when it was about pure constants,
there was only a typing/ugly overhead from duplicating the constant,
but the compiler would see a single constant.

Now it's _possible_ that the compiler could do the analysis and fold
it all back to a single thing. But it's unlikely to happen except for
configurations that end up making it all trivial.

So just add something like a

int max_interval = HZ*num_online_cpus()/10;

possibly even with a comment about _why_ that is the max interval allowed. Ok?

Linus

2011-04-04 16:08:50

by Sisir Koppaka

[permalink] [raw]
Subject: Re: [GIT PULL] scheduler fixes

On Mon, Apr 4, 2011 at 9:15 PM, Linus Torvalds
<[email protected]> wrote:
> So I pulled this, but I think this:
>
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <[email protected]> wrote:
>>
>> - ? ? ? ? ? ? ? if (interval > HZ*NR_CPUS/10)
>> - ? ? ? ? ? ? ? ? ? ? ? interval = HZ*NR_CPUS/10;
>> + ? ? ? ? ? ? ? if (interval > HZ*num_online_cpus()/10)
>> + ? ? ? ? ? ? ? ? ? ? ? interval = HZ*num_online_cpus()/10;
>
> is a horrible patch.
>
> Think about what that expands to. It's going to potentially be two
> function calls. And the code is just ugly.
>
> So please, when doing search-and-replace changes like this, just clean
> up the code at the same time. Back when it was about pure constants,
> there was only a typing/ugly overhead from duplicating the constant,
> but the compiler would see a single constant.
>
> Now it's _possible_ that the compiler could do the analysis and fold
> it all back to a single thing. But it's unlikely to happen except for
> configurations that end up making it all trivial.
>
> So just add something like a
>
> ? int max_interval = HZ*num_online_cpus()/10;
>
> possibly even with a comment about _why_ that is the max interval allowed. ?Ok?
>
> ? ? ? ? ? ? ? ? ? ? ? ?Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

Ok sure. You're right, I'll resend the patch.
Sisir

2011-04-04 16:15:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [GIT PULL] scheduler fixes

On Mon, 4 Apr 2011 08:45:34 -0700 Linus Torvalds <[email protected]> wrote:

> So I pulled this, but I think this:
>
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <[email protected]> wrote:
> >
> > - if (interval > HZ*NR_CPUS/10)
> > - interval = HZ*NR_CPUS/10;
> > + if (interval > HZ*num_online_cpus()/10)
> > + interval = HZ*num_online_cpus()/10;
>
> is a horrible patch.
>
> Think about what that expands to. It's going to potentially be two
> function calls. And the code is just ugly.
>
> So please, when doing search-and-replace changes like this, just clean
> up the code at the same time. Back when it was about pure constants,
> there was only a typing/ugly overhead from duplicating the constant,
> but the compiler would see a single constant.
>
> Now it's _possible_ that the compiler could do the analysis and fold
> it all back to a single thing. But it's unlikely to happen except for
> configurations that end up making it all trivial.
>
> So just add something like a
>
> int max_interval = HZ*num_online_cpus()/10;
>
> possibly even with a comment about _why_ that is the max interval allowed. Ok?
>

max() conveniently avoids the double-evaluation:

#define max(x, y) ({ \
typeof(x) _max1 = (x); \
typeof(y) _max2 = (y); \
(void) (&_max1 == &_max2); \
_max1 > _max2 ? _max1 : _max2; })

2011-04-04 19:19:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] scheduler fixes


* Linus Torvalds <[email protected]> wrote:

> So I pulled this, but I think this:
>
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <[email protected]> wrote:
> >
> > - ? ? ? ? ? ? ? if (interval > HZ*NR_CPUS/10)
> > - ? ? ? ? ? ? ? ? ? ? ? interval = HZ*NR_CPUS/10;
> > + ? ? ? ? ? ? ? if (interval > HZ*num_online_cpus()/10)
> > + ? ? ? ? ? ? ? ? ? ? ? interval = HZ*num_online_cpus()/10;
>
> is a horrible patch.

Indeed, will fix it.

Thanks,

Ingo

2011-04-05 08:14:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] scheduler fixes

On Mon, 2011-04-04 at 08:45 -0700, Linus Torvalds wrote:
> So I pulled this, but I think this:
>
> On Sat, Apr 2, 2011 at 3:31 AM, Ingo Molnar <[email protected]> wrote:
> >
> > - if (interval > HZ*NR_CPUS/10)
> > - interval = HZ*NR_CPUS/10;
> > + if (interval > HZ*num_online_cpus()/10)
> > + interval = HZ*num_online_cpus()/10;
>
> is a horrible patch.
>
> Think about what that expands to. It's going to potentially be two
> function calls. And the code is just ugly.
>
> So please, when doing search-and-replace changes like this, just clean
> up the code at the same time. Back when it was about pure constants,
> there was only a typing/ugly overhead from duplicating the constant,
> but the compiler would see a single constant.
>
> Now it's _possible_ that the compiler could do the analysis and fold
> it all back to a single thing. But it's unlikely to happen except for
> configurations that end up making it all trivial.
>
> So just add something like a
>
> int max_interval = HZ*num_online_cpus()/10;
>
> possibly even with a comment about _why_ that is the max interval allowed. Ok?

How about something like the below?

---
Subject: sched: Clean up load-balance interval calculation

Instead of the possible multiple-evaluation of num_online_cpus() avoid
it all together in the normal case since its implemented with a hamming
weight function over a cpu bitmask which can be darn expensive for those
with big iron.

Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched.c | 3 +++
kernel/sched_fair.c | 16 ++++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a884551..17b4d22 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6331,6 +6331,9 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
#endif
}
+
+ update_max_interval();
+
return NOTIFY_OK;
}

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c7ec5c8..a14a04e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3820,6 +3820,17 @@ void select_nohz_load_balancer(int stop_tick)

static DEFINE_SPINLOCK(balancing);

+static unsigned long max_load_balance_interval = HZ/10;
+
+/*
+ * Scale the max load_balance interval with the number of CPUs in the system.
+ * This trades load-balance latency on larger machines for less cross talk.
+ */
+static void update_max_interval(void)
+{
+ max_load_balance_interval = HZ*num_online_cpus()/10;
+}
+
/*
* It checks each scheduling domain to see if it is due to be balanced,
* and initiates a balancing operation if so.
@@ -3849,10 +3860,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)

/* scale ms to jiffies */
interval = msecs_to_jiffies(interval);
- if (unlikely(!interval))
- interval = 1;
- if (interval > HZ*num_online_cpus()/10)
- interval = HZ*num_online_cpus()/10;
+ interval = clamp(interval, 1UL, max_load_balance_interval);

need_serialize = sd->flags & SD_SERIALIZE;

2011-04-05 08:49:40

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Clean up rebalance_domains() load-balance interval calculation

Commit-ID: 49c022e657fbe661460d191fbe776a387132e2b3
Gitweb: http://git.kernel.org/tip/49c022e657fbe661460d191fbe776a387132e2b3
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 5 Apr 2011 10:14:25 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 5 Apr 2011 10:29:36 +0200

sched: Clean up rebalance_domains() load-balance interval calculation

Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.

Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1301991265.2225.12.camel@twins>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 3 +++
kernel/sched_fair.c | 16 ++++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a884551..17b4d22 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6331,6 +6331,9 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
#endif
}
+
+ update_max_interval();
+
return NOTIFY_OK;
}

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c7ec5c8..80ecd09 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3820,6 +3820,17 @@ void select_nohz_load_balancer(int stop_tick)

static DEFINE_SPINLOCK(balancing);

+static unsigned long __read_mostly max_load_balance_interval = HZ/10;
+
+/*
+ * Scale the max load_balance interval with the number of CPUs in the system.
+ * This trades load-balance latency on larger machines for less cross talk.
+ */
+static void update_max_interval(void)
+{
+ max_load_balance_interval = HZ*num_online_cpus()/10;
+}
+
/*
* It checks each scheduling domain to see if it is due to be balanced,
* and initiates a balancing operation if so.
@@ -3849,10 +3860,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)

/* scale ms to jiffies */
interval = msecs_to_jiffies(interval);
- if (unlikely(!interval))
- interval = 1;
- if (interval > HZ*num_online_cpus()/10)
- interval = HZ*num_online_cpus()/10;
+ interval = clamp(interval, 1UL, max_load_balance_interval);

need_serialize = sd->flags & SD_SERIALIZE;