2009-11-26 16:10:58

by Christian Ehrhardt

[permalink] [raw]
Subject: Missing recalculation of scheduler tunables in case of cpu hot add/remove

Hi everybody,
while testing different scheduler tunables I came across the function
sched_init_granularity which recalculates the values of
sysctl_sched_min_granularity, sysctl_sched_latency and
sysctl_sched_wakeup_granularity in reference to the number of cpu's
currently online on boot time. While someone could think the 1+
ilog2(num_online_cpus() factor might be wrong or suboptimal I wanted to
avoid that discussion (at least in this thread :-)).

What I consider more important at the moment is that there is no hook to
recalculate these values in case cpu hot add/remove takes place.
As an example someone could boot a machine with one online cpu and get
the low non scaled defaults, later on driven by load the system
activates more and more processors. Therefore the system could end up
having a large amount of cpus with non recalculated scheduler tunables.

I'm looking forward to all other solutions approaches that will come up,
so the following is just a suggestion.
We might store the corresponding 1cpu values in hidden variables and
rescale the effective ones on every cpu add/remove.
Additionally there would be the need for some logic to update the
corresponding 1cpu values every time a user sets new values via the proc
interface.
I already thought about potential rounding errors in the suggested
solution, but those might be better than systems being factors off the
value they should have.

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


2009-11-26 16:19:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

On Thu, 2009-11-26 at 17:10 +0100, Christian Ehrhardt wrote:

> What I consider more important at the moment is that there is no hook to
> recalculate these values in case cpu hot add/remove takes place.
> As an example someone could boot a machine with one online cpu and get
> the low non scaled defaults, later on driven by load the system
> activates more and more processors. Therefore the system could end up
> having a large amount of cpus with non recalculated scheduler tunables.

This is virt junk that's playing dumb games with hotplug isn't it?

Normal machines simply don't change their numbers of cpus, if they
hotplug its usually for things like suspend or actual replacement of a
faulty piece of kit, in which case there's little point in adjusting
things.

Aside from that, we probably should put an upper limit in place, as I
guess large cpu count machines get silly large values.

2009-11-26 16:22:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

On Thu, 2009-11-26 at 17:10 +0100, Christian Ehrhardt wrote:
> We might store the corresponding 1cpu values in hidden variables and
> rescale the effective ones on every cpu add/remove.
> Additionally there would be the need for some logic to update the
> corresponding 1cpu values every time a user sets new values via the proc
> interface.

If you're going to do something like that, setting it based on the
number of cpus in the root_domain might be a better solution.

2009-11-26 16:25:46

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 17:10 +0100, Christian Ehrhardt wrote:
>
>
>> What I consider more important at the moment is that there is no hook to
>> recalculate these values in case cpu hot add/remove takes place.
>> As an example someone could boot a machine with one online cpu and get
>> the low non scaled defaults, later on driven by load the system
>> activates more and more processors. Therefore the system could end up
>> having a large amount of cpus with non recalculated scheduler tunables.
>>
>
> This is virt junk that's playing dumb games with hotplug isn't it?
>
Some sort of, its on s390 which does that all the time. By default there
is a daemon that activates/deactivates cpus according to load to cover
load peaks but also save virtualization overhead.
> Normal machines simply don't change their numbers of cpus, if they
> hotplug its usually for things like suspend or actual replacement of a
> faulty piece of kit, in which case there's little point in adjusting
> things.
>
>
What is still "normal" today, you cant get s390 without virt so I would
consider it normal and a real use case for us :-)
> Aside from that, we probably should put an upper limit in place, as I
> guess large cpu count machines get silly large values
I agree to that, but in the code is already an upper limit of
200.000.000 - well we might discuss if that is too low/high.


--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

2009-11-26 16:28:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

On Thu, 2009-11-26 at 17:25 +0100, Christian Ehrhardt wrote:
> > Aside from that, we probably should put an upper limit in place, as I
> > guess large cpu count machines get silly large values
> I agree to that, but in the code is already an upper limit of
> 200.000.000 - well we might discuss if that is too low/high.

Yeah, I think we should cap it around the 8-16 CPUs.

2009-11-26 16:31:57

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 17:25 +0100, Christian Ehrhardt wrote:
>
>>> Aside from that, we probably should put an upper limit in place, as I
>>> guess large cpu count machines get silly large values
>>>
>> I agree to that, but in the code is already an upper limit of
>> 200.000.000 - well we might discuss if that is too low/high.
>>
>
> Yeah, I think we should cap it around the 8-16 CPUs.
>
>
ok for me, driven by that finding I think I have to measure different
kind of scalings anyway, but as usually that takes some time :-/
At least too time much for the discussion & solution of that bug I guess.

The question for now is what we do on cpu hot add/remove?
Would hooking somewhere in kernel/cpu.c be the right approach - I'm not
quite sure about my own suggestion yet :-).

--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

2009-11-26 16:45:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

On Thu, 2009-11-26 at 17:31 +0100, Christian Ehrhardt wrote:
> Peter Zijlstra wrote:
> > On Thu, 2009-11-26 at 17:25 +0100, Christian Ehrhardt wrote:
> >
> >>> Aside from that, we probably should put an upper limit in place, as I
> >>> guess large cpu count machines get silly large values
> >>>
> >> I agree to that, but in the code is already an upper limit of
> >> 200.000.000 - well we might discuss if that is too low/high.
> >>
> >
> > Yeah, I think we should cap it around the 8-16 CPUs.
> >
> >
> ok for me, driven by that finding I think I have to measure different
> kind of scalings anyway, but as usually that takes some time :-/
> At least too time much for the discussion & solution of that bug I guess.
>
> The question for now is what we do on cpu hot add/remove?
> Would hooking somewhere in kernel/cpu.c be the right approach - I'm not
> quite sure about my own suggestion yet :-).

Something like the below might work I suppose, just needs a cleanup and
such.


diff --git a/kernel/sched.c b/kernel/sched.c
index 0cbf2ef..210365f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -814,6 +814,7 @@ const_debug unsigned int sysctl_sched_nr_migrate = 32;
* default: 0.25ms
*/
unsigned int sysctl_sched_shares_ratelimit = 250000;
+unsigned int default_sysctl_sched_shares_ratelimit = 250000;

/*
* Inject some fuzzyness into changing the per-cpu group shares
@@ -1810,6 +1811,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
#endif

static void calc_load_account_active(struct rq *this_rq);
+static void update_sysctl(void);

#include "sched_stats.h"
#include "sched_idletask.c"
@@ -7019,22 +7021,24 @@ cpumask_var_t nohz_cpu_mask;
*
* This idea comes from the SD scheduler of Con Kolivas:
*/
-static inline void sched_init_granularity(void)
+#define SET_SYSCTL(name, factor) \
+ sysctl_##name = (factor) * default_sysctl_##name
+
+static void update_sysctl(void)
{
- unsigned int factor = 1 + ilog2(num_online_cpus());
+ unsigned int cpus = max(num_active_cpus(), 8);
+ unsigned int factor = 1 + ilog2(cpus);
const unsigned long limit = 200000000;

- sysctl_sched_min_granularity *= factor;
- if (sysctl_sched_min_granularity > limit)
- sysctl_sched_min_granularity = limit;
-
- sysctl_sched_latency *= factor;
- if (sysctl_sched_latency > limit)
- sysctl_sched_latency = limit;
-
- sysctl_sched_wakeup_granularity *= factor;
+ SET_SYSCTL(sched_min_granularity);
+ SET_SYSCTL(sched_latency);
+ SET_SYSCTL(sched_wakeup_granularity);
+ SET_SYSCTL(sched_shares_ratelimit);
+}

- sysctl_sched_shares_ratelimit *= factor;
+static inline void sched_init_granularity(void)
+{
+ update_sysctl();
}

#ifdef CONFIG_SMP
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0ff21af..4d429b8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -35,12 +35,14 @@
* run vmstat and monitor the context-switches (cs) field)
*/
unsigned int sysctl_sched_latency = 5000000ULL;
+unsigned int default_sysctl_sched_latency = 5000000ULL;

/*
* Minimal preemption granularity for CPU-bound tasks:
* (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
unsigned int sysctl_sched_min_granularity = 1000000ULL;
+unsigned int default_sysctl_sched_min_granularity = 1000000ULL;

/*
* is kept at sysctl_sched_latency / sysctl_sched_min_granularity
@@ -70,6 +72,7 @@ unsigned int __read_mostly sysctl_sched_compat_yield;
* have immediate wakeup/sleep latencies.
*/
unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
+unsigned int default_sysctl_sched_wakeup_granularity = 1000000UL;

const_debug unsigned int sysctl_sched_migration_cost = 500000UL;

@@ -1905,6 +1908,17 @@ move_one_task_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,

return 0;
}
+
+static void rq_online_fair(struct rq *rq)
+{
+ update_sysctl();
+}
+
+static void rq_offline_fair(struct rq *rq)
+{
+ update_sysctl();
+}
+
#endif /* CONFIG_SMP */

/*
@@ -2052,6 +2066,8 @@ static const struct sched_class fair_sched_class = {

.load_balance = load_balance_fair,
.move_one_task = move_one_task_fair,
+ .rq_online = rq_online_fair,
+ .rq_offline = rq_offline_fair,
#endif

.set_curr_task = set_curr_task_fair,

2009-11-26 18:39:03

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

Peter Zijlstra wrote:
> On Thu, 2009-11-26 at 17:31 +0100, Christian Ehrhardt wrote:
>
>> [...]
>> The question for now is what we do on cpu hot add/remove?
>> Would hooking somewhere in kernel/cpu.c be the right approach - I'm not
>> quite sure about my own suggestion yet :-).
>>
>
> Something like the below might work I suppose, just needs a cleanup and
> such.
>
>

Looks very promising, I did not expect it would be so easy to hook up to
the hotplug events, but you're absolutley right the scheduler already
has hooks for that with rq_online/offline.
From looking at the patch alone I expect it will loose user updates to
sysfs. Might just need adding some feedback from the sysctl writer
functions to set the default values to setval/1+ilog2; that includes
renaming default to "normalized" or somethng like that. But I'll test
this patch in depth tomorrow morning anyway and give more detailed feedback.

Thanks a lot!

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 0cbf2ef..210365f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -814,6 +814,7 @@ const_debug unsigned int sysctl_sched_nr_migrate = 32;
> * default: 0.25ms
> */
> unsigned int sysctl_sched_shares_ratelimit = 250000;
> +unsigned int default_sysctl_sched_shares_ratelimit = 250000;
>
> /*
> * Inject some fuzzyness into changing the per-cpu group shares
> @@ -1810,6 +1811,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
> #endif
>
> static void calc_load_account_active(struct rq *this_rq);
> +static void update_sysctl(void);
>
> #include "sched_stats.h"
> #include "sched_idletask.c"
> @@ -7019,22 +7021,24 @@ cpumask_var_t nohz_cpu_mask;
> *
> * This idea comes from the SD scheduler of Con Kolivas:
> */
> -static inline void sched_init_granularity(void)
> +#define SET_SYSCTL(name, factor) \
> + sysctl_##name = (factor) * default_sysctl_##name
> +
> +static void update_sysctl(void)
> {
> - unsigned int factor = 1 + ilog2(num_online_cpus());
> + unsigned int cpus = max(num_active_cpus(), 8);
> + unsigned int factor = 1 + ilog2(cpus);
> const unsigned long limit = 200000000;
>
> - sysctl_sched_min_granularity *= factor;
> - if (sysctl_sched_min_granularity > limit)
> - sysctl_sched_min_granularity = limit;
> -
> - sysctl_sched_latency *= factor;
> - if (sysctl_sched_latency > limit)
> - sysctl_sched_latency = limit;
> -
> - sysctl_sched_wakeup_granularity *= factor;
> + SET_SYSCTL(sched_min_granularity);
> + SET_SYSCTL(sched_latency);
> + SET_SYSCTL(sched_wakeup_granularity);
> + SET_SYSCTL(sched_shares_ratelimit);
> +}
>
> - sysctl_sched_shares_ratelimit *= factor;
> +static inline void sched_init_granularity(void)
> +{
> + update_sysctl();
> }
>
> #ifdef CONFIG_SMP
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0ff21af..4d429b8 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -35,12 +35,14 @@
> * run vmstat and monitor the context-switches (cs) field)
> */
> unsigned int sysctl_sched_latency = 5000000ULL;
> +unsigned int default_sysctl_sched_latency = 5000000ULL;
>
> /*
> * Minimal preemption granularity for CPU-bound tasks:
> * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
> */
> unsigned int sysctl_sched_min_granularity = 1000000ULL;
> +unsigned int default_sysctl_sched_min_granularity = 1000000ULL;
>
> /*
> * is kept at sysctl_sched_latency / sysctl_sched_min_granularity
> @@ -70,6 +72,7 @@ unsigned int __read_mostly sysctl_sched_compat_yield;
> * have immediate wakeup/sleep latencies.
> */
> unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
> +unsigned int default_sysctl_sched_wakeup_granularity = 1000000UL;
>
> const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
>
> @@ -1905,6 +1908,17 @@ move_one_task_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
>
> return 0;
> }
> +
> +static void rq_online_fair(struct rq *rq)
> +{
> + update_sysctl();
> +}
> +
> +static void rq_offline_fair(struct rq *rq)
> +{
> + update_sysctl();
> +}
> +
> #endif /* CONFIG_SMP */
>
> /*
> @@ -2052,6 +2066,8 @@ static const struct sched_class fair_sched_class = {
>
> .load_balance = load_balance_fair,
> .move_one_task = move_one_task_fair,
> + .rq_online = rq_online_fair,
> + .rq_offline = rq_offline_fair,
> #endif
>
> .set_curr_task = set_curr_task_fair,
>
>
>


--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

2009-11-26 18:53:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

On Thu, 2009-11-26 at 19:39 +0100, Christian Ehrhardt wrote:
> Looks very promising, I did not expect it would be so easy to hook up to
> the hotplug events, but you're absolutley right the scheduler already
> has hooks for that with rq_online/offline.
> From looking at the patch alone I expect it will loose user updates to
> sysfs. Might just need adding some feedback from the sysctl writer
> functions to set the default values to setval/1+ilog2; that includes
> renaming default to "normalized" or somethng like that. But I'll test
> this patch in depth tomorrow morning anyway and give more detailed feedback.

Yeah, its needs better integration with the sysctl bits and such, but it
shows the general shape and form of things.


2009-12-03 09:12:39

by Pavel Machek

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

Hi!

> > >>> Aside from that, we probably should put an upper limit in place, as I
> > >>> guess large cpu count machines get silly large values
> > >>>
> > >> I agree to that, but in the code is already an upper limit of
> > >> 200.000.000 - well we might discuss if that is too low/high.
> > >>
> > >
> > > Yeah, I think we should cap it around the 8-16 CPUs.
> > >
> > >
> > ok for me, driven by that finding I think I have to measure different
> > kind of scalings anyway, but as usually that takes some time :-/
> > At least too time much for the discussion & solution of that bug I guess.
> >
> > The question for now is what we do on cpu hot add/remove?
> > Would hooking somewhere in kernel/cpu.c be the right approach - I'm not
> > quite sure about my own suggestion yet :-).
>
> Something like the below might work I suppose, just needs a cleanup and
> such.

I see a rather fundamental problem: what if user wants to override
those values, and wants them stay that way?

If you do this, suspend/resume will put the old values back AFAICT.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-12-03 09:31:11

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Missing recalculation of scheduler tunables in case of cpu hot add/remove

Pavel Machek wrote:
> Hi!
>
>
>>>>>> Aside from that, we probably should put an upper limit in place, as I
>>>>>> guess large cpu count machines get silly large values
>>>>>>
>>>>>>
>>>>> I agree to that, but in the code is already an upper limit of
>>>>> 200.000.000 - well we might discuss if that is too low/high.
>>>>>
>>>>>
>>>> Yeah, I think we should cap it around the 8-16 CPUs.
>>>>
>>>>
>>>>
>>> ok for me, driven by that finding I think I have to measure different
>>> kind of scalings anyway, but as usually that takes some time :-/
>>> At least too time much for the discussion & solution of that bug I guess.
>>>
>>> The question for now is what we do on cpu hot add/remove?
>>> Would hooking somewhere in kernel/cpu.c be the right approach - I'm not
>>> quite sure about my own suggestion yet :-).
>>>
>> Something like the below might work I suppose, just needs a cleanup and
>> such.
>>
>
> I see a rather fundamental problem: what if user wants to override
> those values, and wants them stay that way
Yep a fundamental problem, but fortunately solved already ;-)

See the series "[PATCH 0/3] fix rescaling of scheduler tunables v2"
posted after this discussion.
That is exactly what patch #2 is about.
Giving users the choice to either set things constant (scaling=none) or
dynamic (log or linear) as it is done boot time.

As I considered it a bug to miss the updates, the current patch
initializes it with scaling=log to let runtime and boot behave the same way.
I could do an update to better keep interfaces which would initialize it
with "scaling=none" to reflect by default the behavior of the current
code that is missing scaling completely.
Comments welcome

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization