LinuxLists.cc - [patch] scheduler: rebalance

2004-11-15 22:39:14

Subject: [patch] scheduler: rebalance_tick interval update

The current rebalance_tick() code assigns each sched_domain's
last_balance field to += interval after performing a load_balance. If
interval is 10, this has the effect of saying: we want to run
load_balance at time = 10, 20, 30, 40, etc... If for example
last_balance=10 and for some reason rebalance_tick can't be run until
30, load_balance will be called and last_balance will be updated to 20,
causing it to call load_balance again immediately the next time it is
called since the interval is 10 and we are already at >30. It seems to
me that it would make much more sense for last_balance to be assigned
jiffies after a load_balance, then the meaning of last_balance is more
exact: "this domain was last balanced at jiffies" rather than "we last
handled the balance we were supposed to do at 20, at some indeterminate
time". The following patch makes this change.

Signed-off-by: Darren Hart <[email protected]>
---

diff -purN -X /home/mbligh/.diff.exclude /home/linux/views/linux-2.6.10-rc1-mm5/kernel/sched.c linux-2.6.10-rc1-mm5-jni/kernel/sched.c
--- /home/linux/views/linux-2.6.10-rc1-mm5/kernel/sched.c 2004-11-11 05:33:53.000000000 -0800
+++ linux-2.6.10-rc1-mm5-jni/kernel/sched.c 2004-11-15 11:28:16.000000000 -0800
@@ -2201,7 +2201,7 @@ static void rebalance_tick(int this_cpu,
/* We've pulled tasks over so no longer idle */
idle = NOT_IDLE;
}
- sd->last_balance += interval;
+ sd->last_balance = jiffies;
}
}
}

2004-11-16 01:17:13

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

Darren Hart wrote:

>The current rebalance_tick() code assigns each sched_domain's
>last_balance field to += interval after performing a load_balance. If
>interval is 10, this has the effect of saying: we want to run
>load_balance at time = 10, 20, 30, 40, etc... If for example
>last_balance=10 and for some reason rebalance_tick can't be run until
>30, load_balance will be called and last_balance will be updated to 20,
>causing it to call load_balance again immediately the next time it is
>called since the interval is 10 and we are already at >30. It seems to
>me that it would make much more sense for last_balance to be assigned
>jiffies after a load_balance, then the meaning of last_balance is more
>exact: "this domain was last balanced at jiffies" rather than "we last
>handled the balance we were supposed to do at 20, at some indeterminate
>time". The following patch makes this change.
>
>

Hi Darren,

This is how I first implemented it... but I think this will cause
rebalance points of each processor to tend to become synchronised
(rather than staggered) as ticks get lost.

2004-11-16 01:53:23

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

On Mon, 2004-11-15 at 17:17, Nick Piggin wrote:
> Darren Hart wrote:
>
> >The current rebalance_tick() code assigns each sched_domain's
> >last_balance field to += interval after performing a load_balance. If
> >interval is 10, this has the effect of saying: we want to run
> >load_balance at time = 10, 20, 30, 40, etc... If for example
> >last_balance=10 and for some reason rebalance_tick can't be run until
> >30, load_balance will be called and last_balance will be updated to 20,
> >causing it to call load_balance again immediately the next time it is
> >called since the interval is 10 and we are already at >30. It seems to
> >me that it would make much more sense for last_balance to be assigned
> >jiffies after a load_balance, then the meaning of last_balance is more
> >exact: "this domain was last balanced at jiffies" rather than "we last
> >handled the balance we were supposed to do at 20, at some indeterminate
> >time". The following patch makes this change.
> >
> >
>
> Hi Darren,
>
> This is how I first implemented it... but I think this will cause
> rebalance points of each processor to tend to become synchronised
> (rather than staggered) as ticks get lost.

But isn't that what this is supposed to stop:

unsigned long j = jiffies + CPU_OFFSET(this_cpu);
....
if (j - sd->last_balance >= interval) {
if (load_balance(this_cpu, this_rq, sd, idle)) {
/* We've pulled tasks over so no longer idle */
idle = NOT_IDLE;
}
sd->last_balance += interval;
}

The CPU_OFFSET() macro is designed to spread out the balancing so they
don't all occur at the same time, no?

-Matt

2004-11-16 02:18:07

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

Matthew Dobson wrote:

>On Mon, 2004-11-15 at 17:17, Nick Piggin wrote:
>
>>Darren Hart wrote:
>>
>>
>>>The current rebalance_tick() code assigns each sched_domain's
>>>last_balance field to += interval after performing a load_balance. If
>>>interval is 10, this has the effect of saying: we want to run
>>>load_balance at time = 10, 20, 30, 40, etc... If for example
>>>last_balance=10 and for some reason rebalance_tick can't be run until
>>>30, load_balance will be called and last_balance will be updated to 20,
>>>causing it to call load_balance again immediately the next time it is
>>>called since the interval is 10 and we are already at >30. It seems to
>>>me that it would make much more sense for last_balance to be assigned
>>>jiffies after a load_balance, then the meaning of last_balance is more
>>>exact: "this domain was last balanced at jiffies" rather than "we last
>>>handled the balance we were supposed to do at 20, at some indeterminate
>>>time". The following patch makes this change.
>>>
>>>
>>>
>>Hi Darren,
>>
>>This is how I first implemented it... but I think this will cause
>>rebalance points of each processor to tend to become synchronised
>>(rather than staggered) as ticks get lost.
>>
>
>
>But isn't that what this is supposed to stop:
>
> unsigned long j = jiffies + CPU_OFFSET(this_cpu);
>....
> if (j - sd->last_balance >= interval) {
> if (load_balance(this_cpu, this_rq, sd, idle)) {
> /* We've pulled tasks over so no longer idle */
> idle = NOT_IDLE;
> }
> sd->last_balance += interval;
> }
>
>The CPU_OFFSET() macro is designed to spread out the balancing so they
>don't all occur at the same time, no?
>
>

Yes, but if you balance n ticks since the last _rebalance_, then things will
be able to drift. Let's say 2 CPUs, they balance at 10 jiffies intervals,
5 jiffies apart:

jiffy CPU0 CPU1
0 rebalance (next, 10)
5 rebalance (next, 15)
10 rebalance (next, 20)
15 rebalance can't be run until
30 as per Darren's example.
20 rebalance (next, 30)
30 rebalance (next, 40) rebalance (next, 40)

So CPU0 and CPU1 are now synchronised. Having the next balance be calculated
from the _current_ time leaves you open to all sorts of these drift issues.

Another example, in some ticks, a CPU won't see the updated 'jiffies', other
times it will (at least on Altix systems, this can happen).

2004-11-16 02:27:26

by Nick Piggin

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

Nick Piggin wrote:

>
> Another example, in some ticks, a CPU won't see the updated 'jiffies',
> other
> times it will (at least on Altix systems, this can happen).
>
>

Note that if you didn't want to have this rash of balancing attempted after
a CPU wasn't able to run the rebalance for a long time, the solution would
be to keep adding the balance interval until it becomes greater than the
current jiffies.

I actually prefer it to try to make up the lost balances, just from the
perspective of gathering scheduler statistics. I don't suspect it happens
enough to justify adding the extra logic - Darren, are you actually seeing
problems?

2004-11-16 03:50:19

by Darren Hart

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

On Tue, 2004-11-16 at 13:27 +1100, Nick Piggin wrote:
>
> Nick Piggin wrote:
>
> >
> > Another example, in some ticks, a CPU won't see the updated 'jiffies',
> > other
> > times it will (at least on Altix systems, this can happen).
> >
> >
>
> Note that if you didn't want to have this rash of balancing attempted after
> a CPU wasn't able to run the rebalance for a long time, the solution would
> be to keep adding the balance interval until it becomes greater than the
> current jiffies.

As I mentioned in my last post, I don't think the "synchronized
rebalancing" is a real concern since the interval isn't likely to be the
same and the CPU_OFFSET macro is already in place to prevent this "rash
of balancing" (nice term :-).

> I actually prefer it to try to make up the lost balances, just from the
> perspective of gathering scheduler statistics.

IMO, scheduler statistics are not worth running load_balance() for no
reason. (And running it two or three times in a row is clearly not
accomplishing anything)

> I don't suspect it happens
> enough to justify adding the extra logic - Darren, are you actually seeing
> problems?

Not seeing in obvious problems, but the existing logic seems incorrect
to me (and the term last_balanced is currently misleading). Running
load_balance() multiple times in order to catch up seems wasteful to me
as well. The current code says something like: run load_balance() 10
times in a second. If the second is almost up and you have only run it
6 times, it will run it 4 times in a row, that just seems wrong to me.

--
Darren Hart <[email protected]>

2004-11-16 03:53:58

by Darren Hart

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

On Tue, 2004-11-16 at 13:17 +1100, Nick Piggin wrote:
>
> Matthew Dobson wrote:
>
> >On Mon, 2004-11-15 at 17:17, Nick Piggin wrote:
> >
> >>Darren Hart wrote:
> >>
> >>
> >>>The current rebalance_tick() code assigns each sched_domain's
> >>>last_balance field to += interval after performing a load_balance. If
> >>>interval is 10, this has the effect of saying: we want to run
> >>>load_balance at time = 10, 20, 30, 40, etc... If for example
> >>>last_balance=10 and for some reason rebalance_tick can't be run until
> >>>30, load_balance will be called and last_balance will be updated to 20,
> >>>causing it to call load_balance again immediately the next time it is
> >>>called since the interval is 10 and we are already at >30. It seems to
> >>>me that it would make much more sense for last_balance to be assigned
> >>>jiffies after a load_balance, then the meaning of last_balance is more
> >>>exact: "this domain was last balanced at jiffies" rather than "we last
> >>>handled the balance we were supposed to do at 20, at some indeterminate
> >>>time". The following patch makes this change.
> >>>
> >>>
> >>>
> >>Hi Darren,
> >>
> >>This is how I first implemented it... but I think this will cause
> >>rebalance points of each processor to tend to become synchronised
> >>(rather than staggered) as ticks get lost.
> >>
> >
> >
> >But isn't that what this is supposed to stop:
> >
> > unsigned long j = jiffies + CPU_OFFSET(this_cpu);
> >....
> > if (j - sd->last_balance >= interval) {
> > if (load_balance(this_cpu, this_rq, sd, idle)) {
> > /* We've pulled tasks over so no longer idle */
> > idle = NOT_IDLE;
> > }
> > sd->last_balance += interval;
> > }
> >
> >The CPU_OFFSET() macro is designed to spread out the balancing so they
> >don't all occur at the same time, no?
> >
> >
>
> Yes, but if you balance n ticks since the last _rebalance_, then things will
> be able to drift. Let's say 2 CPUs, they balance at 10 jiffies intervals,
> 5 jiffies apart:
>
> jiffy CPU0 CPU1
> 0 rebalance (next, 10)
> 5 rebalance (next, 15)
> 10 rebalance (next, 20)
> 15 rebalance can't be run until
> 30 as per Darren's example.
> 20 rebalance (next, 30)
> 30 rebalance (next, 40) rebalance (next, 40)
>

This example didn't take into account:
unsigned long j = jiffies + CPU_OFFSET(this_cpu);
Which, even if the last_balance's were equal and intervals were the same
(unlikely since each CPU has it's own domain and the intervals are
adjusted independently?), would prevent them from both running on the
same timer tick. So I don't think this example is accurate. On the
other hand, even if it was valid, I would prefer this to running the
load_balance code on the same CPU and domain several ticks in a row
(which obviously accomplishes nothing).

--
Darren Hart <[email protected]>

2004-11-16 07:00:39

by Rick Lindsley

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

Yes, but if you balance n ticks since the last _rebalance_, then things will
be able to drift. Let's say 2 CPUs, they balance at 10 jiffies intervals,
5 jiffies apart:

[example deleted]

First, why wasn't the rebalance run when it was supposed to be run?

The answer isn't really all that important. Regardless of the answer,
if this can happen once, I presume it can happen more than once. Despite
our best efforts, rebalancing ran in the "wrong timeslot". Aside from the
CPU_OFFSET math, it would seem to me that this inexplicable delay, itself,
introduces enough variance to make any hope of keeping the cpus strictly
unsynchronized (!!) something of a pipe dream. We start them out of sync, and
know that they'll drift. We can correct them, but they'll drift again.
Why not just acknowledge that behavior, instead of putting effort into
(unsuccessfully) countering it?

Second, it's of marginal usefulness anyway when you consider that
rebalance_tick() is only called with HZ granularity, except for
unpredictable times from fork(). You're going to be doubling up no matter
what you do once you get more than 16 siblings (see SD_SIBLING_INIT:
busy_factor * max_interval).

Lastly, why do we care? To turn the question around, were we seeing load
balancing colliding without this attempt at forced spacing? And if so,
what was the effect (and patterns) of those collisions? Having a variable
named last_balance which represents when we last *would have liked to
have* balanced seems far less useful than knowing when we really did.

Rick

2004-11-16 15:57:58

by Darren Hart

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

> >This example didn't take into account:
> > unsigned long j = jiffies + CPU_OFFSET(this_cpu);
> >Which, even if the last_balance's were equal and intervals were the same
> >(unlikely since each CPU has it's own domain and the intervals are
> >adjusted independently?),
> >
>
> That is correct.
>
> > would prevent them from both running on the
> >same timer tick. So I don't think this example is accurate. On the
> >other hand, even if it was valid, I would prefer this to running the
> >load_balance code on the same CPU and domain several ticks in a row
> >(which obviously accomplishes nothing).
> >
>
> It is valid because the CPU_OFFSET basically just adds a constant amount
> to each CPU's 'jiffies'. My example took that into account already and
> used the absolute jiffies value. If that doesn't convince you, pretend
> that the delay were to occur to CPU0, whos CPU_OFFSET == 0.

Heh, I thought of exactly that case when I rolled out of bed this
morning. Funny how that happens sometimes, you can be sitting in front
of the code and make what you think is a perfectly valid analysis and
then wake up the next morning with no context at all and realize you
were wrong. That happens to me anyway :-), I can't be the only one...

Nevertheless, that is a pretty unlikely corner case. They double up
only when last_balance+interval is the same and one of the CPUs involved
is CPU 0.

>
>
--
Darren Hart <[email protected]>

2004-11-18 16:23:55

by Darren Hart

[permalink] [raw]

Subject: Re: [patch] scheduler: rebalance_tick interval update

On Tue, 2004-11-16 at 07:56 -0800, Darren Hart wrote:
> > >This example didn't take into account:
> > > unsigned long j = jiffies + CPU_OFFSET(this_cpu);
> > >Which, even if the last_balance's were equal and intervals were the same
> > >(unlikely since each CPU has it's own domain and the intervals are
> > >adjusted independently?),
> > >
> >
> > That is correct.
> >
> > > would prevent them from both running on the
> > >same timer tick. So I don't think this example is accurate. On the
> > >other hand, even if it was valid, I would prefer this to running the
> > >load_balance code on the same CPU and domain several ticks in a row
> > >(which obviously accomplishes nothing).
> > >
> >
> > It is valid because the CPU_OFFSET basically just adds a constant amount
> > to each CPU's 'jiffies'. My example took that into account already and
> > used the absolute jiffies value. If that doesn't convince you, pretend
> > that the delay were to occur to CPU0, whos CPU_OFFSET == 0.
>
>
> Heh, I thought of exactly that case when I rolled out of bed this
> morning. Funny how that happens sometimes, you can be sitting in front
> of the code and make what you think is a perfectly valid analysis and
> then wake up the next morning with no context at all and realize you
> were wrong. That happens to me anyway :-), I can't be the only one...
>
> Nevertheless, that is a pretty unlikely corner case. They double up
> only when last_balance+interval is the same and one of the CPUs involved
> is CPU 0.
>

Hey Nick,

I haven't heard anything on this thread in a couple of days, do you
agree we should me the change? I feel there are legitimate reasons to
use =jiffies instead of +=interval for the ->last_balanced assignment.
I have already elaborated on these and provided counter arguments to the
ones you posted that I think make a strong case. If you still oppose
the change, what would I need to provide to convince you?

--
Darren Hart <[email protected]>