2003-08-24 17:17:30

by Erich Focht

[permalink] [raw]
Subject: [patch 2.6.0t4] 1 cpu/node scheduler fix

This is the 1 cpu/node fix of the NUMA scheduler rewritten for the new
cpumask handling. The previous version was a bit too aggressive with
cross node balancing so I changed the default timings a bit such that
the behavior is very similar to the old one.

Here is what the patch does:
- Links the frequency of cross-node balances to the number of failed
local balance attempts. This simplifies the code by removing the too
rigid cross-node balancing dependency of the timer interrupts.

- Fixes the 1 CPU/node issue, i.e. eliminates local balance attempts
for the nodes which have only one CPU. Can happen on any NUMA
platform (playing around with a 2 CPU/node box and have a flaky CPU,
so I have sometimes a node with only one CPU), is a major issue on
AMD64.

- Makes the cross-node balance frequency tunable by the parameter
NUMA_FACTOR_BONUS. Its default setting is such that the scheduler
behaves like before: cross node balance every 5 local node balances on
an idle CPU, every 2 local node balances on a busy CPU. This parameter
should be tuned for each platform depending on its NUMA factor.

This simple patch is not meant as opposition to Andrew's attempt to
NUMAize the whole scheduler. That one will definitely make NUMA
coders' lives easier but I fear that it is a bit too complex for
2.6. The attached small incremental change is sufficient to solve the
main problem. Besides, the change of the cross-node scheduling is
compatible with Andrew's scheduler structure. I really don't like the
timer-based cross-node balancing because it is too unflexible (no way
to have different balancing intervals for each node) and I'd really
like to get back to just one single point of entry for load balancing:
the routine load_balance(), no matter whether we balance inside a
timer interrupt or while the CPU is going idle.

Erich



Attachments:
(No filename) (1.79 kB)
1cpufix-2.6.0t4.patch (5.15 kB)
Download all attachments

2003-08-25 08:15:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 2.6.0t4] 1 cpu/node scheduler fix


On Sun, 24 Aug 2003, Erich Focht wrote:

> This simple patch is not meant as opposition to Andrew's attempt to
> NUMAize the whole scheduler. That one will definitely make NUMA coders'
> lives easier but I fear that it is a bit too complex for 2.6. The
> attached small incremental change is sufficient to solve the main
> problem. Besides, the change of the cross-node scheduling is compatible
> with Andrew's scheduler structure. I really don't like the timer-based
> cross-node balancing because it is too unflexible (no way to have
> different balancing intervals for each node) and I'd really like to get
> back to just one single point of entry for load balancing: the routine
> load_balance(), no matter whether we balance inside a timer interrupt or
> while the CPU is going idle.

your patch clearly simplifies things. Would you mind to also extend the
rebalance-frequency based balancing to the SMP scheduler, and see what
effect that has? Ie. to remove much of the 'tick' component from the SMP
scheduler as well, and make it purely frequency based.

I'm still afraid of balancing artifacts if we lose track of time (which
the tick thing is about, and which cache affinity is a function of), but
maybe not. It would certainly unify things more. If it doesnt work out
then we can still do your stuff for the cross-node balancing only.

Ingo

2003-08-25 15:55:04

by Andrew Theurer

[permalink] [raw]
Subject: Re: [patch 2.6.0t4] 1 cpu/node scheduler fix

> This simple patch is not meant as opposition to Andrew's attempt to
> NUMAize the whole scheduler. That one will definitely make NUMA
> coders' lives easier but I fear that it is a bit too complex for
> 2.6.

I would agree, it's probably too much to change this late in 2.6. Eventually
(2.7) I think we should revisit this and try for a more unified approach.

> The attached small incremental change is sufficient to solve the
> main problem. Besides, the change of the cross-node scheduling is
> compatible with Andrew's scheduler structure. I really don't like the
> timer-based cross-node balancing because it is too unflexible (no way
> to have different balancing intervals for each node) and I'd really
> like to get back to just one single point of entry for load balancing:
> the routine load_balance(), no matter whether we balance inside a
> timer interrupt or while the CPU is going idle.

Looks good to me. One other thing your patch fixes: Once in a while we
called load_balance in rebalance_tick with the wrong 'idle' value.
Occasionally we would be on an idle_node and idle_cpu rebalance tick, the
idle cpu would [possibly] pull a task, become non-idle, then we would call
load_balance again, but still have idle=1 for the intranode balance.

2003-08-25 17:50:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [patch 2.6.0t4] 1 cpu/node scheduler fix

> This is the 1 cpu/node fix of the NUMA scheduler rewritten for the new
> cpumask handling. The previous version was a bit too aggressive with
> cross node balancing so I changed the default timings a bit such that
> the behavior is very similar to the old one.
>
> Here is what the patch does:
> - Links the frequency of cross-node balances to the number of failed
> local balance attempts. This simplifies the code by removing the too
> rigid cross-node balancing dependency of the timer interrupts.
>
> - Fixes the 1 CPU/node issue, i.e. eliminates local balance attempts
> for the nodes which have only one CPU. Can happen on any NUMA
> platform (playing around with a 2 CPU/node box and have a flaky CPU,
> so I have sometimes a node with only one CPU), is a major issue on
> AMD64.
>
> - Makes the cross-node balance frequency tunable by the parameter
> NUMA_FACTOR_BONUS. Its default setting is such that the scheduler
> behaves like before: cross node balance every 5 local node balances on
> an idle CPU, every 2 local node balances on a busy CPU. This parameter
> should be tuned for each platform depending on its NUMA factor.

This seems to clear up the low end stuff I was seeing before - thanks.

Did you (or anyone else) get a chance to test this on AMD? Would
be nice to confirm that's fixed ...

M.

2003-09-02 10:58:52

by Erich Focht

[permalink] [raw]
Subject: Re: [patch 2.6.0t4] 1 cpu/node scheduler fix

Hi Ingo,

thanks for the reply and sorry for the late reaction. I've been away
for a week and (unexpectedly) didn't have email/internet connectivity
during this time.

On Monday 25 August 2003 10:13, Ingo Molnar wrote:
> On Sun, 24 Aug 2003, Erich Focht wrote:
> > different balancing intervals for each node) and I'd really like to get
> > back to just one single point of entry for load balancing: the routine
> > load_balance(), no matter whether we balance inside a timer interrupt or
> > while the CPU is going idle.
>
> your patch clearly simplifies things. Would you mind to also extend the
> rebalance-frequency based balancing to the SMP scheduler, and see what
> effect that has? Ie. to remove much of the 'tick' component from the SMP
> scheduler as well, and make it purely frequency based.

Actually I didn't remove the tick-based load-balancing, just
simplified it and wanted to have only one load_balance() call instead of
separate load_balance(), node_balance(),
load_balance_in_timer_interrupt_when_idle() and
load_balance_in_timer_interrupt_when_busy().

The current change keeps track of the number of failed load_balance
attempts and does cross-node balancing after a certain number of
failed attempts. If I understand you correctly you'd like to test this
concept also on a lower level? So keep track of the number of
schedules and context switches and call the SMP load balancer after a
certain number of schedules? This could work, though on idle CPUs I'm
more comfortable with the timer tick... The logic of cpu_idle() would
need a change, too. Do you expect an impact on the latency issue?
Would you mind doing these changes in two steps: first the simple NUMA
one, later the radical SMP change?

> I'm still afraid of balancing artifacts if we lose track of time (which
> the tick thing is about, and which cache affinity is a function of), but
> maybe not. It would certainly unify things more. If it doesnt work out
> then we can still do your stuff for the cross-node balancing only.

With the small fix for NUMA there's no problem because the
CAN_MIGRATE_TASK macro really compares times/jiffies for getting an
idea about the cache coolness. This wouldn't change even with the
elimination of the timer based load_balance calls.

Regards,
Erich