Hi,
after talking to several people at OLS about the current NUMA
scheduler the conclusion was:
(1) it sucks (for particular workloads),
(2) on x86_64 (embarassingly simple NUMA) it's useless, goto (1).
Fact is that the current separation of local and global balancing,
where global balancing is done only in the timer interrupt at a fixed
rate is way too unflexible. A CPU going idle inside a well balanced
node will stay idle for a while even if there's a lot of work to
do. Especially in the corner case of one CPU per node this is
condemning that CPU to idleness for at least 5 ms. So x86_64 platforms
(but not only those!) suffer and whish to switch off the NUMA
scheduler while keeping NUMA memory management on.
The attached patch is a simple solution which
- solves the 1 CPU / node problem,
- lets other systems behave (almost) as before,
- opens the way to other optimisations like multi-level node
hierarchies (by tuning the retry rate)
- simpifies the NUMA scheduler and deletes more lines of code than it
adds.
The timer interrupt based global rebalancing might appear to be a
simple and good idea but it takes the scheduler a lot of
flexibility. In the patch the global rebalancing is done after a
certain number of failed attempts to locally balance. The number of
attempts is proportional to the number of CPUs in the current
node. For only 1 CPU in the current node the scheduler doesn't even
try to balance locally, it wouldn't make sense anyway. Of course one
could instead set IDLE_NODE_REBALANCE_TICK = IDLE_REBALANCE_TICK, but
this is more ugly (IMHO) and only helps when all nodes have 1 CPU /
node.
Please consider this for inclusion.
Thanks,
Erich
> after talking to several people at OLS about the current NUMA
> scheduler the conclusion was:
> (1) it sucks (for particular workloads),
> (2) on x86_64 (embarassingly simple NUMA) it's useless, goto (1).
I really feel there's no point in a NUMA scheduler for the Hammer
style architectures. A config option to turn it off would seem like
a simpler way to go, unless people can see some advantage of the
full NUMA code?
The interesting thing is probably whether we want balance on exec
or not ... but that probably applies to UMA SMP as well ...
> Fact is that the current separation of local and global balancing,
> where global balancing is done only in the timer interrupt at a fixed
> rate is way too unflexible. A CPU going idle inside a well balanced
> node will stay idle for a while even if there's a lot of work to
> do. Especially in the corner case of one CPU per node this is
> condemning that CPU to idleness for at least 5 ms.
Surely it'll hit the idle local balancer and rebalance within the node?
Or are you thinking of a case with 3 tasks on a 4 cpu/node system?
> So x86_64 platforms
> (but not only those!) suffer and whish to switch off the NUMA
> scheduler while keeping NUMA memory management on.
Right - I have a patch to make it a config option (CONFIG_NUMA_SCHED)
... I'll feed that upstream this week.
> The attached patch is a simple solution which
> - solves the 1 CPU / node problem,
> - lets other systems behave (almost) as before,
> - opens the way to other optimisations like multi-level node
> hierarchies (by tuning the retry rate)
> - simpifies the NUMA scheduler and deletes more lines of code than it
> adds.
Looks simple, I'll test it out.
> The timer interrupt based global rebalancing might appear to be a
> simple and good idea but it takes the scheduler a lot of
> flexibility. In the patch the global rebalancing is done after a
> certain number of failed attempts to locally balance. The number of
> attempts is proportional to the number of CPUs in the current
> node.
Seems like a good plan.
M.
On Monday 28 July 2003 21:55, Martin J. Bligh wrote:
> > after talking to several people at OLS about the current NUMA
> > scheduler the conclusion was:
> > (1) it sucks (for particular workloads),
> > (2) on x86_64 (embarassingly simple NUMA) it's useless, goto (1).
>
> I really feel there's no point in a NUMA scheduler for the Hammer
> style architectures. A config option to turn it off would seem like
> a simpler way to go, unless people can see some advantage of the
> full NUMA code?
But the Hammer is a NUMA architecture and a working NUMA scheduler
should be flexible enough to deal with it. And: the corner case of 1
CPU per node is possible also on any other NUMA platform, when in some
of the nodes (or even just one) only one CPU is configured in. Solving
that problem automatically gives the Hammer what it needs.
> > Fact is that the current separation of local and global balancing,
> > where global balancing is done only in the timer interrupt at a fixed
> > rate is way too unflexible. A CPU going idle inside a well balanced
> > node will stay idle for a while even if there's a lot of work to
> > do. Especially in the corner case of one CPU per node this is
> > condemning that CPU to idleness for at least 5 ms.
>
> Surely it'll hit the idle local balancer and rebalance within the node?
> Or are you thinking of a case with 3 tasks on a 4 cpu/node system?
No, no, the local load balancer just doesn't make sense now if you
have one cpu per node. It is called although it will never find
another cpu in the own cell to steal from. There just is nothing
else... (or did I misunderstand your comment?)
> > So x86_64 platforms
> > (but not only those!) suffer and whish to switch off the NUMA
> > scheduler while keeping NUMA memory management on.
>
> Right - I have a patch to make it a config option (CONFIG_NUMA_SCHED)
> ... I'll feed that upstream this week.
That's one way, but the proposed patch just solves the problem (in a
more general way, also for other NUMA cases). If you deconfigure NUMA
for a NUMA platform, we'll have problems switching it back on when
adding smarter things like node affine or homenode extensions.
> > The attached patch is a simple solution which
> > - solves the 1 CPU / node problem,
> > - lets other systems behave (almost) as before,
> > - opens the way to other optimisations like multi-level node
> > hierarchies (by tuning the retry rate)
> > - simpifies the NUMA scheduler and deletes more lines of code than it
> > adds.
>
> Looks simple, I'll test it out.
Great! Thanks!
Erich
> But the Hammer is a NUMA architecture and a working NUMA scheduler
> should be flexible enough to deal with it. And: the corner case of 1
> CPU per node is possible also on any other NUMA platform, when in some
> of the nodes (or even just one) only one CPU is configured in. Solving
> that problem automatically gives the Hammer what it needs.
OK, well what we have now is a "multi-cpu-per-node-numa-scheduler" if
you really want to say all that ;-)
The question is "does Hammer benefit from the additional complexity"?
I'm guessing not ... if so, then yes, it's worth fixing. If not, it
would seem better to just leave it at SMP for the scheduler stuff.
Simpler, more shared code with common systems.
>> > Fact is that the current separation of local and global balancing,
>> > where global balancing is done only in the timer interrupt at a fixed
>> > rate is way too unflexible. A CPU going idle inside a well balanced
>> > node will stay idle for a while even if there's a lot of work to
>> > do. Especially in the corner case of one CPU per node this is
>> > condemning that CPU to idleness for at least 5 ms.
>>
>> Surely it'll hit the idle local balancer and rebalance within the node?
>> Or are you thinking of a case with 3 tasks on a 4 cpu/node system?
>
> No, no, the local load balancer just doesn't make sense now if you
> have one cpu per node. It is called although it will never find
> another cpu in the own cell to steal from. There just is nothing
> else... (or did I misunderstand your comment?)
Right, I realise that the 1 cpu per node case is broken. However, doesn't
this also affect the multi-cpu per node case in the manner I suggested
above? So even if we turn off NUMA scheduler for Hammer, this still
needs fixing, right?
> That's one way, but the proposed patch just solves the problem (in a
> more general way, also for other NUMA cases). If you deconfigure NUMA
> for a NUMA platform, we'll have problems switching it back on when
> adding smarter things like node affine or homenode extensions.
Yeah, maybe. I'd rather have the code in hand that needs it, but still ...
if Andi's happy that fixes it, then we're OK.
M.
On Monday 28 July 2003 15:37, Martin J. Bligh wrote:
> > But the Hammer is a NUMA architecture and a working NUMA scheduler
> > should be flexible enough to deal with it. And: the corner case of 1
> > CPU per node is possible also on any other NUMA platform, when in some
> > of the nodes (or even just one) only one CPU is configured in. Solving
> > that problem automatically gives the Hammer what it needs.
I am going to ask a silly question, do we have any data showing this really is
a problem on AMD? I would think, even if we have an idle cpu, sometimes a
little delay on task migration (on NUMA) may not be a bad thing. If it is
too long, can we just make the rebalance ticks arch specific?
I'd much rather have info related to the properties of the NUMA arch than
something that makes decisions based on nr_cpus_node(). For example, we may
want to inter-node balance as much or more often on ppc64 than even AMD, but
it has 8 cpus per node. On this patch it would has the lowest inter-node
balance frequency, even though it probably has one of the lowest latencies
between nodes and highest throughput interconnects.
> OK, well what we have now is a "multi-cpu-per-node-numa-scheduler" if
> you really want to say all that ;-)
>
> The question is "does Hammer benefit from the additional complexity"?
> I'm guessing not ... if so, then yes, it's worth fixing. If not, it
> would seem better to just leave it at SMP for the scheduler stuff.
> Simpler, more shared code with common systems.
>
> >> > Fact is that the current separation of local and global balancing,
> >> > where global balancing is done only in the timer interrupt at a fixed
> >> > rate is way too unflexible. A CPU going idle inside a well balanced
> >> > node will stay idle for a while even if there's a lot of work to
> >> > do. Especially in the corner case of one CPU per node this is
> >> > condemning that CPU to idleness for at least 5 ms.
> >>
> >> Surely it'll hit the idle local balancer and rebalance within the node?
> >> Or are you thinking of a case with 3 tasks on a 4 cpu/node system?
> >
> > No, no, the local load balancer just doesn't make sense now if you
> > have one cpu per node. It is called although it will never find
> > another cpu in the own cell to steal from. There just is nothing
> > else... (or did I misunderstand your comment?)
>
> Right, I realise that the 1 cpu per node case is broken. However, doesn't
> this also affect the multi-cpu per node case in the manner I suggested
> above? So even if we turn off NUMA scheduler for Hammer, this still
> needs fixing, right?
Maybe so, but if we start making idle rebalance more aggressive, I think we
would need to make CAN_MIGRATE more restrictive, taking memory placement of
the tasks in to account. On AMD with interleaved memory allocation, tasks
would move very easily, since their memory is spread out anyway. On "home
node" or node-local policy, we may not move a task (or maybe not on the first
attempt), even if there is an idle cpu in another node.
>> That's one way, but the proposed patch just solves the problem (in a
>> more general way, also for other NUMA cases). If you deconfigure NUMA
>> for a NUMA platform, we'll have problems switching it back on when
>> adding smarter things like node affine or homenode extensions.
>
>Yeah, maybe. I'd rather have the code in hand that needs it, but still ...
>if Andi's happy that fixes it, then we're OK.
Personally, I'd like to see all systems use NUMA sched, non NUMA systems being
a single node (no policy difference from non-numa sched), allowing us to
remove all NUMA ifdefs. I think the code would be much more readable.
-Andrew Theurer
On Tuesday 29 July 2003 04:24, Andrew Theurer wrote:
> On Monday 28 July 2003 15:37, Martin J. Bligh wrote:
> > > But the Hammer is a NUMA architecture and a working NUMA scheduler
> > > should be flexible enough to deal with it. And: the corner case of 1
> > > CPU per node is possible also on any other NUMA platform, when in some
> > > of the nodes (or even just one) only one CPU is configured in. Solving
> > > that problem automatically gives the Hammer what it needs.
>
> I am going to ask a silly question, do we have any data showing this really
> is a problem on AMD? I would think, even if we have an idle cpu, sometimes
> a little delay on task migration (on NUMA) may not be a bad thing. If it
> is too long, can we just make the rebalance ticks arch specific?
The fact that global rebalances are done only in the timer interrupt
is simply bad! It complicates rebalance_tick() and wastes the
opportunity to get feedback from the failed local balance attempts.
If you want data supporting my assumptions: Ted Ts'o's talk at OLS
shows the necessity to rebalance ASAP (even in try_to_wake_up). There
are plenty of arguments towards this, starting with the steal delay
parameter scans from the early days of multi-queue schedulers (Davide
Libenzi), over my experiments with NUMA schedulers and the observation
of Andi Kleen that on Opteron you better run from the wrong CPU than
wait (if the tasks returns to the right cpu, all's fine anyway).
> I'd much rather have info related to the properties of the NUMA arch than
> something that makes decisions based on nr_cpus_node(). For example, we
> may want to inter-node balance as much or more often on ppc64 than even
> AMD, but it has 8 cpus per node. On this patch it would has the lowest
> inter-node balance frequency, even though it probably has one of the lowest
> latencies between nodes and highest throughput interconnects.
We can still discuss on the formula. Currently there's a bug in the
scheduler and the corner case of 1 cpu/node is just broken. The
function local_balance_retries(attempts, cpus_in_this_node) must
return 0 for cpus_in_this_node=1 and should return larger values for
larger cpus_in_this_node. To have an upper limit is fine, but maybe
not necessary.
Regarding the ppc64 interconnect, I'm glad that you said "probably"
and "one of the ...". No need to point you to better ones ;-)
> > Right, I realise that the 1 cpu per node case is broken. However, doesn't
> > this also affect the multi-cpu per node case in the manner I suggested
> > above? So even if we turn off NUMA scheduler for Hammer, this still
> > needs fixing, right?
>
> Maybe so, but if we start making idle rebalance more aggressive, I think we
> would need to make CAN_MIGRATE more restrictive, taking memory placement of
> the tasks in to account. On AMD with interleaved memory allocation, tasks
> would move very easily, since their memory is spread out anyway. On "home
> node" or node-local policy, we may not move a task (or maybe not on the
> first attempt), even if there is an idle cpu in another node.
Aehm, that's another story and I'm sure we will fix that too. There
are a few ideas around. But you shouldn't expect to solve all problems
at once, after all optimal NUMA scheduling can still be considered a
research area.
> Personally, I'd like to see all systems use NUMA sched, non NUMA systems
> being a single node (no policy difference from non-numa sched), allowing us
> to remove all NUMA ifdefs. I think the code would be much more readable.
:-) Then you expect that everybody who reads the scheduler code knows
what NUMA stands for and what it means? Pretty optimistic, but yes,
I'd like that, too.
Erich
On Monday 28 July 2003 22:37, Martin J. Bligh wrote:
> > But the Hammer is a NUMA architecture and a working NUMA scheduler
> > should be flexible enough to deal with it. And: the corner case of 1
> > CPU per node is possible also on any other NUMA platform, when in some
> > of the nodes (or even just one) only one CPU is configured in. Solving
> > that problem automatically gives the Hammer what it needs.
>
> OK, well what we have now is a "multi-cpu-per-node-numa-scheduler" if
> you really want to say all that ;-)
And with the fix posted it will just be called "numa-scheduler". It's
a simplification, as you noticed.
> The question is "does Hammer benefit from the additional complexity"?
> I'm guessing not ... if so, then yes, it's worth fixing. If not, it
> would seem better to just leave it at SMP for the scheduler stuff.
> Simpler, more shared code with common systems.
Hammer will just go for the other nodes, as it should, when the own
node is free. Then you can benefit of NUMA improvements in the load
calculation and later of smarter distribution across the
nodes. Reverting to SMP is a quick fix but for improvements we'll have
to get back to NUMA. So why not fix it now and stay with NUMA.
> > No, no, the local load balancer just doesn't make sense now if you
> > have one cpu per node. It is called although it will never find
> > another cpu in the own cell to steal from. There just is nothing
> > else... (or did I misunderstand your comment?)
>
> Right, I realise that the 1 cpu per node case is broken. However, doesn't
> this also affect the multi-cpu per node case in the manner I suggested
> above? So even if we turn off NUMA scheduler for Hammer, this still
> needs fixing, right?
Yes.
> > That's one way, but the proposed patch just solves the problem (in a
> > more general way, also for other NUMA cases). If you deconfigure NUMA
> > for a NUMA platform, we'll have problems switching it back on when
> > adding smarter things like node affine or homenode extensions.
>
> Yeah, maybe. I'd rather have the code in hand that needs it, but still ...
> if Andi's happy that fixes it, then we're OK.
http://marc.theaimsgroup.com/?l=linux-kernel&m=105854610325226&w=2
Andi mentioned in his talk to follow that path, too. We'll have to
experiment a while with the ideas we currently have to get the initial
load balancing/ dynamic homenode issue solved, there's no "golden way"
right now. I'll post the stuff I mentioned in my talk soon, but want
to gain some experience with it, first.
Regards,
Erich
On Tuesday 29 July 2003 05:08, Erich Focht wrote:
> On Tuesday 29 July 2003 04:24, Andrew Theurer wrote:
> > On Monday 28 July 2003 15:37, Martin J. Bligh wrote:
> > > > But the Hammer is a NUMA architecture and a working NUMA scheduler
> > > > should be flexible enough to deal with it. And: the corner case of 1
> > > > CPU per node is possible also on any other NUMA platform, when in
> > > > some of the nodes (or even just one) only one CPU is configured in.
> > > > Solving that problem automatically gives the Hammer what it needs.
> >
> > I am going to ask a silly question, do we have any data showing this
> > really is a problem on AMD? I would think, even if we have an idle cpu,
> > sometimes a little delay on task migration (on NUMA) may not be a bad
> > thing. If it is too long, can we just make the rebalance ticks arch
> > specific?
>
> The fact that global rebalances are done only in the timer interrupt
> is simply bad!
Even with this patch it still seems that most balances are still timer based,
because we still call load_balance in rebalance_tick. Granted, we may
inter-node balance more often, well, maybe less often since
node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the advantage
of doing this at idle, but idle only, that's why I'd would be more inclined a
only a much more aggressive idle rebalance.
> It complicates rebalance_tick() and wastes the
> opportunity to get feedback from the failed local balance attempts.
What does "failed" really mean? To me, when *busiest=null, that means we
passed, the node itself is probably balanced, and there's nothing to do. It
gives no indication at all of the global load [im]balance. Shouldn't the
thing we are looking for is the imbalance among node_nr_running[]? Would it
make sense to go forward with a global balance based on that only?
> If you want data supporting my assumptions: Ted Ts'o's talk at OLS
> shows the necessity to rebalance ASAP (even in try_to_wake_up).
If this is the patch I am thinking of, it was the (attached) one I sent them,
which did a light "push" rebalance at try_to_wake_up. Calling load_balance
at try_to_wake_up seems very heavy-weight. This patch only looks for an idle
cpu (within the same node) to wake up on before task activation, only if the
task_rq(p)->nr_running is too long. So, yes, I do believe this can be
important, but I think it's only called for when we have an idle cpu.
> There
> are plenty of arguments towards this, starting with the steal delay
> parameter scans from the early days of multi-queue schedulers (Davide
> Libenzi), over my experiments with NUMA schedulers and the observation
> of Andi Kleen that on Opteron you better run from the wrong CPU than
> wait (if the tasks returns to the right cpu, all's fine anyway).
>
> > I'd much rather have info related to the properties of the NUMA arch than
> > something that makes decisions based on nr_cpus_node(). For example, we
> > may want to inter-node balance as much or more often on ppc64 than even
> > AMD, but it has 8 cpus per node. On this patch it would has the lowest
> > inter-node balance frequency, even though it probably has one of the
> > lowest latencies between nodes and highest throughput interconnects.
>
> We can still discuss on the formula. Currently there's a bug in the
> scheduler and the corner case of 1 cpu/node is just broken. The
> function local_balance_retries(attempts, cpus_in_this_node) must
> return 0 for cpus_in_this_node=1 and should return larger values for
> larger cpus_in_this_node. To have an upper limit is fine, but maybe
> not necessary.
>
> Regarding the ppc64 interconnect, I'm glad that you said "probably"
> and "one of the ...". No need to point you to better ones ;-)
OK, we wont get into a pissing match :) I just wanted to base the scheduler
decisions on data related to the hardware NUMA properties, not the cpu count.
> > > Right, I realise that the 1 cpu per node case is broken. However,
> > > doesn't this also affect the multi-cpu per node case in the manner I
> > > suggested above? So even if we turn off NUMA scheduler for Hammer, this
> > > still needs fixing, right?
> >
> > Maybe so, but if we start making idle rebalance more aggressive, I think
> > we would need to make CAN_MIGRATE more restrictive, taking memory
> > placement of the tasks in to account. On AMD with interleaved memory
> > allocation, tasks would move very easily, since their memory is spread
> > out anyway. On "home node" or node-local policy, we may not move a task
> > (or maybe not on the first attempt), even if there is an idle cpu in
> > another node.
>
> Aehm, that's another story and I'm sure we will fix that too. There
> are a few ideas around. But you shouldn't expect to solve all problems
> at once, after all optimal NUMA scheduling can still be considered a
> research area.
>
> > Personally, I'd like to see all systems use NUMA sched, non NUMA systems
> > being a single node (no policy difference from non-numa sched), allowing
> > us to remove all NUMA ifdefs. I think the code would be much more
> > readable.
> >
> :-) Then you expect that everybody who reads the scheduler code knows
>
> what NUMA stands for and what it means? Pretty optimistic, but yes,
> I'd like that, too.
Yes, at some point we have to. We cannot have two different schedulers. Non
numa should have the exact same scheduling policy as a numa system with one
node. I don't know if that's acceptable for 2.6, but I really want to go for
that in 2.7.
-Andrew Theurer
>> I am going to ask a silly question, do we have any data showing this really
>> is a problem on AMD? I would think, even if we have an idle cpu, sometimes
>> a little delay on task migration (on NUMA) may not be a bad thing. If it
>> is too long, can we just make the rebalance ticks arch specific?
>
> The fact that global rebalances are done only in the timer interrupt
> is simply bad! It complicates rebalance_tick() and wastes the
> opportunity to get feedback from the failed local balance attempts.
The whole point of the NUMA scheduler is to rebalance globally less
often. Now I'd agree that in the idle case we want to be more agressive,
as your patch does, but we need to be careful not to end up in a cacheline
thrashing war.
Aiming for perfect balance is not always a good idea - the expense of both
the calculation and the migrations has to be taken into account. For some
arches, it's less important than others ... one thing we're missing is
a per arch "node-balance-agressiveness" factor.
Having said that, the NUMA scheduler is clearly broken on Hammer at the
moment, from Andi's observations.
> If you want data supporting my assumptions: Ted Ts'o's talk at OLS
> shows the necessity to rebalance ASAP (even in try_to_wake_up). There
> are plenty of arguments towards this, starting with the steal delay
> parameter scans from the early days of multi-queue schedulers (Davide
> Libenzi), over my experiments with NUMA schedulers and the observation
> of Andi Kleen that on Opteron you better run from the wrong CPU than
> wait (if the tasks returns to the right cpu, all's fine anyway).
That's a drastic oversimplification. It may be better in some
circumstances, on some benchmarks. For now, let's just get your patch
tested on Hammer, and see if it works better to have the NUMA scheduler
on than off after your patch ...
M.
On Mon, Jul 28, 2003 at 10:18:19PM +0200, Erich Focht wrote:
> > > So x86_64 platforms
> > > (but not only those!) suffer and whish to switch off the NUMA
> > > scheduler while keeping NUMA memory management on.
> >
> > Right - I have a patch to make it a config option (CONFIG_NUMA_SCHED)
> > ... I'll feed that upstream this week.
>
> That's one way, but the proposed patch just solves the problem (in a
> more general way, also for other NUMA cases). If you deconfigure NUMA
> for a NUMA platform, we'll have problems switching it back on when
> adding smarter things like node affine or homenode extensions.
That's one important point IMHO. Currently Opteron does not really
need the NUMA scheduler, but it will be in future with such extensions.
This means it would be better if the current scheduler supports
it already so that it can be easily extended.
Thanks for the patch.
-Andi
On Tuesday 29 July 2003 15:33, Andrew Theurer wrote:
> > The fact that global rebalances are done only in the timer interrupt
> > is simply bad!
>
> Even with this patch it still seems that most balances are still timer
> based, because we still call load_balance in rebalance_tick. Granted, we
> may inter-node balance more often, well, maybe less often since
> node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the advantage
> of doing this at idle, but idle only, that's why I'd would be more inclined
> a only a much more aggressive idle rebalance.
Without this patch the probability to globally balance when going idle
is 0. Now it is 1/global_balance_rate(cpus_in_this_node) . This is
tunable and we can make this probability depending on the node load
imbalance. I'll try that in the next version, it sounds like a good
idea. Also changing this probability by some factor could give us a
way to handle the differences between the platforms.
Balancing globally only when idle isn't a good idea as long as we
don't have multiple steals per balance attempt. Even then, tasks don't
live all the same time, so you easilly end up with one node overloaded
and others just busy and not able to steal from the busiest node.
> > It complicates rebalance_tick() and wastes the
> > opportunity to get feedback from the failed local balance attempts.
>
> What does "failed" really mean? To me, when *busiest=null, that means we
> passed, the node itself is probably balanced, and there's nothing to do.
> It gives no indication at all of the global load [im]balance. Shouldn't
> the thing we are looking for is the imbalance among node_nr_running[]?
> Would it make sense to go forward with a global balance based on that only?
I'll try to include the node imbalance in the global balance rate
calculation, let's see how it works. Just wanted to fix one issue with
my patch, now it looks like it provides some simple ways to solve
other issues as well... Thanks for the idea!
Regards,
Erich
On Wednesday 30 July 2003 10:23, Erich Focht wrote:
> On Tuesday 29 July 2003 15:33, Andrew Theurer wrote:
> > > The fact that global rebalances are done only in the timer interrupt
> > > is simply bad!
> >
> > Even with this patch it still seems that most balances are still timer
> > based, because we still call load_balance in rebalance_tick. Granted, we
> > may inter-node balance more often, well, maybe less often since
> > node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the
> > advantage of doing this at idle, but idle only, that's why I'd would be
> > more inclined a only a much more aggressive idle rebalance.
>
> Without this patch the probability to globally balance when going idle
> is 0. Now it is 1/global_balance_rate(cpus_in_this_node) . This is
> tunable and we can make this probability depending on the node load
> imbalance. I'll try that in the next version, it sounds like a good
> idea. Also changing this probability by some factor could give us a
> way to handle the differences between the platforms.
>
> Balancing globally only when idle isn't a good idea as long as we
> don't have multiple steals per balance attempt. Even then, tasks don't
> live all the same time, so you easilly end up with one node overloaded
> and others just busy and not able to steal from the busiest node.
I think I would agree with this statements. Yes, your patch does fix the
"going idle" AMD issue, and yes, we don't want to balance globally on idle by
default. However I do still think there may be a better way to solve this.
I see this as "range" problem. When balancing, the NUMA scheduler's tactic is
to limit the range of cpus to find a busiest queue. First we start with
within a node, then once in a while we go to the next "level", and look at
cpus in all nodes in that "level". Someday we may even go up another "level"
and include even more cpus in our range. We limit how often we go up a
level, as to help memory locality of the running tasks. This is obviously
well known to you, since you wrote most of this :) What I would like to
propose is the following:
For load_balance, each architecture tells the scheduler what the range is,
where the range starts, and where it ends. For ex: Summit starts at level1,
the cpus in a node, and ends at level2, cpus in all level1 nodes. AMD starts
at level2, all the cpus in all level1 nodes and also ends at the level2.
8-way AMD would go to level3 :) NEC (I assume) would start at the level1,
and end at level3, the "super" nodes. This should solve the AMD problem very
easily.
As for idle balances, we may be able to go a step further: follow the range
rules, but do a more aggressive/frequent search. On busy rebalance, I'd
prefer to still uphold the node_rebalance_ticks, but on idle, I would think,
since we are idle, some more time spent on a thorough search would be OK: We
first start at the lower range and try to find something to steal (always
favor stealing a smaller range first), then as needed increase the range,
looking for a cpu to steal from. Do not use a node_idle_rebalance interval,
just keep increasing the range until we are successful.
I'll see if I can get something coded up here. I don't think these concepts
would be too difficult to add.
-Andrew Theurer
you're using node_to_cpu_mask for ia64 ... others were using
node_to_cpumask (1 less "_"), so this doesn't build ...
M.
On Thursday 31 July 2003 17:05, Martin J. Bligh wrote:
> you're using node_to_cpu_mask for ia64 ... others were using
> node_to_cpumask (1 less "_"), so this doesn't build ...
Ooops, you're right, of course. Sorry about this mistake :-(
Erich
> On Thursday 31 July 2003 17:05, Martin J. Bligh wrote:
>> you're using node_to_cpu_mask for ia64 ... others were using
>> node_to_cpumask (1 less "_"), so this doesn't build ...
>
> Ooops, you're right, of course. Sorry about this mistake :-(
np - was easy to fix up ;-) I did run some benchmarks on it ...
low end SDET seemed highly variable, but otherwise looked OK.
If I have only 4 tasks running on a 16x (4x4), what's the rate
limit on the idle cpus trying to steal now?
M.
On Friday 01 August 2003 02:26, Martin J. Bligh wrote:
> np - was easy to fix up ;-) I did run some benchmarks on it ...
> low end SDET seemed highly variable, but otherwise looked OK.
> If I have only 4 tasks running on a 16x (4x4), what's the rate
> limit on the idle cpus trying to steal now?
the rate is 3ms, that's a bit too fast, maybe. And the busy rate is
200ms (instead of 400). I made some experiments with 2 CPUs per node
and 1 ms seems to be definitely too fast. I'll tune the formula a
bit...
Regards,
Erich
On Mon, 28 Jul 2003, Andrew Theurer wrote:
> Personally, I'd like to see all systems use NUMA sched, non NUMA systems being
> a single node (no policy difference from non-numa sched), allowing us to
> remove all NUMA ifdefs. I think the code would be much more readable.
That sounds like a great idea, but I'm not sure it could be realized short
of a major rewrite. Look how hard Ingo and Con are working just to get a
single node doing a good job with interactive and throughput tradeoffs.
Once they get a good handle on identifying process behaviour, and I
believe they will, that information could be used in improving NUMA
performance, by sending not just 'a job" but "the right job" if it exists.
I'm sure there are still a few graduate theses possible on the topic!
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Wednesday 13 August 2003 15:49, Bill Davidsen wrote:
> On Mon, 28 Jul 2003, Andrew Theurer wrote:
> > Personally, I'd like to see all systems use NUMA sched, non NUMA systems
> > being a single node (no policy difference from non-numa sched), allowing
> > us to remove all NUMA ifdefs. I think the code would be much more
> > readable.
>
> That sounds like a great idea, but I'm not sure it could be realized short
> of a major rewrite. Look how hard Ingo and Con are working just to get a
> single node doing a good job with interactive and throughput tradeoffs.
Actually it's not too bad. Attached is a patch to do it. It also does
multi-level node support and makes all the load balance routines
runqueue-centric instead of cpu-centric, so adding something like shared
runqueues (for HT) should be really easy. Hmm, other things: inter-node
balance intervals are now arch specific (AMD is "1"). The default busy/idle
balance timers of 200/1 are not arch specific, but I'm thinking they should
be. And for non-numa, the scheduling policy is the same as it was with
vanilla O(1).
> Once they get a good handle on identifying process behaviour, and I
> believe they will, that information could be used in improving NUMA
> performance, by sending not just 'a job" but "the right job" if it exists.
> I'm sure there are still a few graduate theses possible on the topic!
I do agree, but I think _what_ we pick is definitely 2.7 work. All I'd like
to see for 2.6 is a very nice, unified framework to build on.
As for 1cpu/node case I think this second patch (on top of first) will take
care of it; however I have not had a chance to test on AMD yet.
-Andrew Theurer
Andrew Theurer wrote:
>On Wednesday 13 August 2003 15:49, Bill Davidsen wrote:
>
>>On Mon, 28 Jul 2003, Andrew Theurer wrote:
>>
>>>Personally, I'd like to see all systems use NUMA sched, non NUMA systems
>>>being a single node (no policy difference from non-numa sched), allowing
>>>us to remove all NUMA ifdefs. I think the code would be much more
>>>readable.
>>>
>>That sounds like a great idea, but I'm not sure it could be realized short
>>of a major rewrite. Look how hard Ingo and Con are working just to get a
>>single node doing a good job with interactive and throughput tradeoffs.
>>
>
>Actually it's not too bad. Attached is a patch to do it. It also does
>multi-level node support and makes all the load balance routines
>runqueue-centric instead of cpu-centric, so adding something like shared
>runqueues (for HT) should be really easy. Hmm, other things: inter-node
>balance intervals are now arch specific (AMD is "1"). The default busy/idle
>balance timers of 200/1 are not arch specific, but I'm thinking they should
>be. And for non-numa, the scheduling policy is the same as it was with
>vanilla O(1).
>
I'm not saying you're wrong, but do you have some numbers where this
helps? ie. two architectures that need very different balance numbers.
And what is the reason for making AMD's balance interval 1?
Also, things like nr_running_inc are supposed to be very fast. I am
a bit worried to see a loop and CPU shared atomics in there.
node_2_node is an odd sounding conversion too ;)
BTW. you should be CC'ing Ingo if you have any intention of scheduler
stuff getting into 2.6.
On Friday 22 August 2003 17:56, Nick Piggin wrote:
> Andrew Theurer wrote:
> >On Wednesday 13 August 2003 15:49, Bill Davidsen wrote:
> >>On Mon, 28 Jul 2003, Andrew Theurer wrote:
> >>>Personally, I'd like to see all systems use NUMA sched, non NUMA systems
> >>>being a single node (no policy difference from non-numa sched), allowing
> >>>us to remove all NUMA ifdefs. I think the code would be much more
> >>>readable.
> >>
> >>That sounds like a great idea, but I'm not sure it could be realized
> >> short of a major rewrite. Look how hard Ingo and Con are working just to
> >> get a single node doing a good job with interactive and throughput
> >> tradeoffs.
> >
> >Actually it's not too bad. Attached is a patch to do it. It also does
> >multi-level node support and makes all the load balance routines
> >runqueue-centric instead of cpu-centric, so adding something like shared
> >runqueues (for HT) should be really easy. Hmm, other things: inter-node
> >balance intervals are now arch specific (AMD is "1"). The default
> > busy/idle balance timers of 200/1 are not arch specific, but I'm thinking
> > they should be. And for non-numa, the scheduling policy is the same as
> > it was with vanilla O(1).
>
> I'm not saying you're wrong, but do you have some numbers where this
> helps? ie. two architectures that need very different balance numbers.
> And what is the reason for making AMD's balance interval 1?
AMD is 1 because there's no need to balance within a node, so I want the
inter-node balance frequency to be as often as it was with just O(1). This
interval would not work well with other NUMA boxes, so that's the main reason
to have arch specific intervals. And, as a general guideline, boxes with
different local-remote latency ratios will probably benefit from different
inter-node balance intervals. I don't know what these ratios are, but I'd
like the kernel to have the ability to change for one arch and not affect
another.
> Also, things like nr_running_inc are supposed to be very fast. I am
> a bit worried to see a loop and CPU shared atomics in there.
That has concerned me, too. So far I haven't been able to see a measurable
difference either way (within noise level), but it's possible. The other
alternative is to sum up node load at sched_best_cpu and find_busiest_node.
> node_2_node is an odd sounding conversion too ;)
I just went off the toplogy already there, so I left it.
> BTW. you should be CC'ing Ingo if you have any intention of scheduler
> stuff getting into 2.6.
OK, thanks!
Andrew Theurer wrote:
>On Friday 22 August 2003 17:56, Nick Piggin wrote:
>
>>Andrew Theurer wrote:
>>
>>>On Wednesday 13 August 2003 15:49, Bill Davidsen wrote:
>>>
>>>>On Mon, 28 Jul 2003, Andrew Theurer wrote:
>>>>
>>>>>Personally, I'd like to see all systems use NUMA sched, non NUMA systems
>>>>>being a single node (no policy difference from non-numa sched), allowing
>>>>>us to remove all NUMA ifdefs. I think the code would be much more
>>>>>readable.
>>>>>
>>>>That sounds like a great idea, but I'm not sure it could be realized
>>>>short of a major rewrite. Look how hard Ingo and Con are working just to
>>>>get a single node doing a good job with interactive and throughput
>>>>tradeoffs.
>>>>
>>>Actually it's not too bad. Attached is a patch to do it. It also does
>>>multi-level node support and makes all the load balance routines
>>>runqueue-centric instead of cpu-centric, so adding something like shared
>>>runqueues (for HT) should be really easy. Hmm, other things: inter-node
>>>balance intervals are now arch specific (AMD is "1"). The default
>>>busy/idle balance timers of 200/1 are not arch specific, but I'm thinking
>>>they should be. And for non-numa, the scheduling policy is the same as
>>>it was with vanilla O(1).
>>>
>>I'm not saying you're wrong, but do you have some numbers where this
>>helps? ie. two architectures that need very different balance numbers.
>>And what is the reason for making AMD's balance interval 1?
>>
>
>AMD is 1 because there's no need to balance within a node, so I want the
>inter-node balance frequency to be as often as it was with just O(1). This
>interval would not work well with other NUMA boxes, so that's the main reason
>to have arch specific intervals.
>
OK, I misread the patch. IIRC AMD has 1 CPU per node? If so, why doesn't
this simply prevent balancing within a node?
> And, as a general guideline, boxes with
>different local-remote latency ratios will probably benefit from different
>inter-node balance intervals. I don't know what these ratios are, but I'd
>like the kernel to have the ability to change for one arch and not affect
>another.
>
I fully appreciate there are huge differences... I am curious if
you can see much improvements in practice.
>
>>Also, things like nr_running_inc are supposed to be very fast. I am
>>a bit worried to see a loop and CPU shared atomics in there.
>>
>
>That has concerned me, too. So far I haven't been able to see a measurable
>difference either way (within noise level), but it's possible. The other
>alternative is to sum up node load at sched_best_cpu and find_busiest_node.
>
Hmm... get someone to try the scheduler benchmarks on a 32 way box ;)
>
>>node_2_node is an odd sounding conversion too ;)
>>
>
>I just went off the toplogy already there, so I left it.
>
>
>>BTW. you should be CC'ing Ingo if you have any intention of scheduler
>>stuff getting into 2.6.
>>
>
>OK, thanks!
>
>
Good luck with it. Definitely some good ideas.
On Sat, Aug 23, 2003 at 10:29:21AM +1000, Nick Piggin wrote:
> Hmm... get someone to try the scheduler benchmarks on a 32 way box ;)
What scheduler benchmark?
-- wli
>> node_2_node is an odd sounding conversion too ;)
>
> I just went off the toplogy already there, so I left it.
I thought we changed that to parent_node or something a while back?
Yes, node_2_node is rather nondescriptive ...
Matt?
M.
William Lee Irwin III wrote:
>On Sat, Aug 23, 2003 at 10:29:21AM +1000, Nick Piggin wrote:
>
>>Hmm... get someone to try the scheduler benchmarks on a 32 way box ;)
>>
>
>What scheduler benchmark?
>
>
I don't know! I don't care about high end scheduler performance.
Volanomark? Kernbench? SDET? Whatever you care about.
> >AMD is 1 because there's no need to balance within a node, so I want the
> >inter-node balance frequency to be as often as it was with just O(1).
> > This interval would not work well with other NUMA boxes, so that's the
> > main reason to have arch specific intervals.
>
> OK, I misread the patch. IIRC AMD has 1 CPU per node? If so, why doesn't
> this simply prevent balancing within a node?
Yes, one cpu/node. Oh, it does prevent it, but with the current intervals, we
end up not really balancing as often (since we need a inter-node balance),
and when we call load_balance in schedule when idle, we don't balance at all
since it's only a node local balance.
> > And, as a general guideline, boxes with
> >different local-remote latency ratios will probably benefit from different
> >inter-node balance intervals. I don't know what these ratios are, but I'd
> >like the kernel to have the ability to change for one arch and not affect
> >another.
>
> I fully appreciate there are huge differences... I am curious if
> you can see much improvements in practice.
I think AMD would be the first good test. Maybe Andi has some results on
numasched vs O(1), which would be a good indication.