LinuxLists.cc - Re: [rfc] balance-on-fork NUMA placement

2007-08-01 17:53:51

Subject: Re: [rfc] balance-on-fork NUMA placement

Nick Piggin wrote:
> On Tue, Jul 31, 2007 at 11:14:08AM +0200, Andi Kleen wrote:
>> On Tuesday 31 July 2007 07:41, Nick Piggin wrote:
>>
>>> I haven't given this idea testing yet, but I just wanted to get some
>>> opinions on it first. NUMA placement still isn't ideal (eg. tasks with
>>> a memory policy will not do any placement, and process migrations of
>>> course will leave the memory behind...), but it does give a bit more
>>> chance for the memory controllers and interconnects to get evenly
>>> loaded.
>> I didn't think slab honored mempolicies by default?
>> At least you seem to need to set special process flags.
>>
>>> NUMA balance-on-fork code is in a good position to allocate all of a new
>>> process's memory on a chosen node. However, it really only starts
>>> allocating on the correct node after the process starts running.
>>>
>>> task and thread structures, stack, mm_struct, vmas, page tables etc. are
>>> all allocated on the parent's node.
>> The page tables should be only allocated when the process runs; except
>> for the PGD.
>
> We certainly used to copy all page tables on fork. Not any more, but we
> must still copy anonymous page tables.

This topic seems to come up periodically every since we first introduced
the NUMA scheduler, and every time we decide it's a bad idea. What's
changed? What workloads does this improve (aside from some artificial
benchmark like stream)?

To repeat the conclusions of last time ... the primary problem is that
99% of the time, we exec after we fork, and it makes that fork/exec
cycle slower, not faster, so exec is generally a much better time to do
this. There's no good predictor of whether we'll exec after fork, unless
one has magically appeared since late 2.5.x ?

M.

2007-08-01 18:39:37

by Lee Schermerhorn

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

On Wed, 2007-08-01 at 10:53 -0700, Martin Bligh wrote:
> Nick Piggin wrote:
> > On Tue, Jul 31, 2007 at 11:14:08AM +0200, Andi Kleen wrote:
> >> On Tuesday 31 July 2007 07:41, Nick Piggin wrote:
> >>
> >>> I haven't given this idea testing yet, but I just wanted to get some
> >>> opinions on it first. NUMA placement still isn't ideal (eg. tasks with
> >>> a memory policy will not do any placement, and process migrations of
> >>> course will leave the memory behind...), but it does give a bit more
> >>> chance for the memory controllers and interconnects to get evenly
> >>> loaded.
> >> I didn't think slab honored mempolicies by default?
> >> At least you seem to need to set special process flags.
> >>
> >>> NUMA balance-on-fork code is in a good position to allocate all of a new
> >>> process's memory on a chosen node. However, it really only starts
> >>> allocating on the correct node after the process starts running.
> >>>
> >>> task and thread structures, stack, mm_struct, vmas, page tables etc. are
> >>> all allocated on the parent's node.
> >> The page tables should be only allocated when the process runs; except
> >> for the PGD.
> >
> > We certainly used to copy all page tables on fork. Not any more, but we
> > must still copy anonymous page tables.
>
> This topic seems to come up periodically every since we first introduced
> the NUMA scheduler, and every time we decide it's a bad idea. What's
> changed? What workloads does this improve (aside from some artificial
> benchmark like stream)?
>
> To repeat the conclusions of last time ... the primary problem is that
> 99% of the time, we exec after we fork, and it makes that fork/exec
> cycle slower, not faster, so exec is generally a much better time to do
> this. There's no good predictor of whether we'll exec after fork, unless
> one has magically appeared since late 2.5.x ?
>

As Nick points out, one reason to balance on fork() rather than exec()
is that with balance on exec you already have the new task's kernel
structs allocated on the "wrong" node. However, as you point out, this
slows down the fork/exec cycle. This is especially noticeable on larger
node-count systems in, e.g., shell scripts that spawn a lot of short
lived child processes. "Back in the day", we got bitten by this on the
Alpha EV7 [a.k.a. Marvel] platform with just ~64 nodes--small compared
to, say, the current Altix platform.

On the other hand, if you're launching a few larger, long-lived
applications with any significant %-age of system time, you might want
to consider spreading them out across nodes and having their warmer
kernel data structures close to them. A dilemma.

Altho' I was no longer working on this platform when this issue came up,
I believe that the kernel developers came up with something along these
lines:

+ define a "credit" member of the "task" struct, initialized to, say,
zero.

+ when "credit" is zero, or below some threshold, balance on fork--i.e.,
spread out the load--otherwise fork "locally" and decrement credit
[maybe not < 0].

+ when reaping dead children, if the poor thing's cpu utilization is
below some threshold, give the parent some credit. [blood money?]

And so forth. Initial forks will balance. If the children refuse to
die, forks will continue to balance. If the parent starts seeing short
lived children, fork()s will eventually start to stay local.

I believe that this solved the pathological behavior we were seeing with
shell scripts taking way longer on the larger, supposedly more powerful,
platforms.

Of course, that OS could migrate the equivalent of task structs and
kernel stack [the old Unix user struct that was traditionally swappable,
so fairly easy to migrate]. On Linux, all bets are off, once the
scheduler starts migrating tasks away from the node that contains their
task struct, ... [Remember Eric Focht's "NUMA Affine Scheduler" patch
with it's "home node"?]

Lee

2007-08-01 22:52:22

by Martin Bligh

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

>> This topic seems to come up periodically every since we first introduced
>> the NUMA scheduler, and every time we decide it's a bad idea. What's
>> changed? What workloads does this improve (aside from some artificial
>> benchmark like stream)?
>>
>> To repeat the conclusions of last time ... the primary problem is that
>> 99% of the time, we exec after we fork, and it makes that fork/exec
>> cycle slower, not faster, so exec is generally a much better time to do
>> this. There's no good predictor of whether we'll exec after fork, unless
>> one has magically appeared since late 2.5.x ?
>>
>
> As Nick points out, one reason to balance on fork() rather than exec()
> is that with balance on exec you already have the new task's kernel
> structs allocated on the "wrong" node. However, as you point out, this
> slows down the fork/exec cycle. This is especially noticeable on larger
> node-count systems in, e.g., shell scripts that spawn a lot of short
> lived child processes. "Back in the day", we got bitten by this on the
> Alpha EV7 [a.k.a. Marvel] platform with just ~64 nodes--small compared
> to, say, the current Altix platform.
>
> On the other hand, if you're launching a few larger, long-lived
> applications with any significant %-age of system time, you might want
> to consider spreading them out across nodes and having their warmer
> kernel data structures close to them. A dilemma.
>
> Altho' I was no longer working on this platform when this issue came up,
> I believe that the kernel developers came up with something along these
> lines:
>
> + define a "credit" member of the "task" struct, initialized to, say,
> zero.
>
> + when "credit" is zero, or below some threshold, balance on fork--i.e.,
> spread out the load--otherwise fork "locally" and decrement credit
> [maybe not < 0].
>
> + when reaping dead children, if the poor thing's cpu utilization is
> below some threshold, give the parent some credit. [blood money?]
>
> And so forth. Initial forks will balance. If the children refuse to
> die, forks will continue to balance. If the parent starts seeing short
> lived children, fork()s will eventually start to stay local.

Fork without exec is much more rare than without. Optimising for
the uncommon case is the Wrong Thing to Do (tm). What we decided
the last time(s) this came up was to allow userspace to pass
a hint in if they wanted to fork and not exec.

> I believe that this solved the pathological behavior we were seeing with
> shell scripts taking way longer on the larger, supposedly more powerful,
> platforms.
>
> Of course, that OS could migrate the equivalent of task structs and
> kernel stack [the old Unix user struct that was traditionally swappable,
> so fairly easy to migrate]. On Linux, all bets are off, once the
> scheduler starts migrating tasks away from the node that contains their
> task struct, ... [Remember Eric Focht's "NUMA Affine Scheduler" patch
> with it's "home node"?]

Task migration doesn't work well at all without userspace hints.
SGI tried for ages (with IRIX) and failed. There's long discussions
of all of these things back in the days when we merged the original
NUMA scheduler in late 2.5 ...

2007-08-02 01:36:43

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

On Wed, Aug 01, 2007 at 03:52:11PM -0700, Martin Bligh wrote:
>
> >And so forth. Initial forks will balance. If the children refuse to
> >die, forks will continue to balance. If the parent starts seeing short
> >lived children, fork()s will eventually start to stay local.
>
> Fork without exec is much more rare than without. Optimising for
> the uncommon case is the Wrong Thing to Do (tm). What we decided

It's only the wrong thing to do if it hurts the common case too
much. Considering we _already_ balance on exec, then adding another
balance on fork is not going to introduce some order of magnitude
problem -- at worst it would be 2x but it really isn't too slow
anyway (at least nobody complained when we added it).

One place where we found it helps is clone for threads.

If we didn't do such a bad job at keeping tasks together with their
local memory, then we might indeed reduce some of the balance-on-crap
and increase the aggressiveness of periodic balancing.

Considering we _already_ balance on fork/clone, I don't know what
your argument is against this patch is? Doing the balance earlier
and allocating more stuff on the local node is surely not a bad
idea.

> the last time(s) this came up was to allow userspace to pass
> a hint in if they wanted to fork and not exec.
>
> >I believe that this solved the pathological behavior we were seeing with
> >shell scripts taking way longer on the larger, supposedly more powerful,
> >platforms.
> >
> >Of course, that OS could migrate the equivalent of task structs and
> >kernel stack [the old Unix user struct that was traditionally swappable,
> >so fairly easy to migrate]. On Linux, all bets are off, once the
> >scheduler starts migrating tasks away from the node that contains their
> >task struct, ... [Remember Eric Focht's "NUMA Affine Scheduler" patch
> >with it's "home node"?]
>
> Task migration doesn't work well at all without userspace hints.
> SGI tried for ages (with IRIX) and failed. There's long discussions
> of all of these things back in the days when we merged the original
> NUMA scheduler in late 2.5 ...

Task migration? Automatic memory migration you mean? I think it deserves
another look regardless of what SGI could or could not do, and Lee and I
are slowly getting things in place. We'll see what happens...

2007-08-02 14:50:28

by Lee Schermerhorn

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

On Wed, 2007-08-01 at 15:52 -0700, Martin Bligh wrote:
> >> This topic seems to come up periodically every since we first introduced
> >> the NUMA scheduler, and every time we decide it's a bad idea. What's
> >> changed? What workloads does this improve (aside from some artificial
> >> benchmark like stream)?
> >>
> >> To repeat the conclusions of last time ... the primary problem is that
> >> 99% of the time, we exec after we fork, and it makes that fork/exec
> >> cycle slower, not faster, so exec is generally a much better time to do
> >> this. There's no good predictor of whether we'll exec after fork, unless
> >> one has magically appeared since late 2.5.x ?
> >>
> >
> > As Nick points out, one reason to balance on fork() rather than exec()
> > is that with balance on exec you already have the new task's kernel
> > structs allocated on the "wrong" node. However, as you point out, this
> > slows down the fork/exec cycle. This is especially noticeable on larger
> > node-count systems in, e.g., shell scripts that spawn a lot of short
> > lived child processes. "Back in the day", we got bitten by this on the
> > Alpha EV7 [a.k.a. Marvel] platform with just ~64 nodes--small compared
> > to, say, the current Altix platform.
> >
> > On the other hand, if you're launching a few larger, long-lived
> > applications with any significant %-age of system time, you might want
> > to consider spreading them out across nodes and having their warmer
> > kernel data structures close to them. A dilemma.
> >
> > Altho' I was no longer working on this platform when this issue came up,
> > I believe that the kernel developers came up with something along these
> > lines:
> >
> > + define a "credit" member of the "task" struct, initialized to, say,
> > zero.
> >
> > + when "credit" is zero, or below some threshold, balance on fork--i.e.,
> > spread out the load--otherwise fork "locally" and decrement credit
> > [maybe not < 0].
> >
> > + when reaping dead children, if the poor thing's cpu utilization is
> > below some threshold, give the parent some credit. [blood money?]
> >
> > And so forth. Initial forks will balance. If the children refuse to
> > die, forks will continue to balance. If the parent starts seeing short
> > lived children, fork()s will eventually start to stay local.
>
> Fork without exec is much more rare than without. Optimising for
> the uncommon case is the Wrong Thing to Do (tm). What we decided
> the last time(s) this came up was to allow userspace to pass
> a hint in if they wanted to fork and not exec.

I understand. Again, as Nick mentioned, at exec time, you use the
existing task struct, kernel stack, ... which might [probably will?] end
up on the wrong node. If the task uses a significant amount of system
time, this can hurt performance/scalability. And, for short lived, low
cpu usage tasks, such as you can get with shell scripts, you might not
even want to balance at exec time.

I agree with your assertion regarding optimizing for uncommon cases.
The mechanism I described [probably poorly, memory fades and it was only
a "hallway conversation" with the person who implemented it--in response
to a customer complaint] attempted to detect situations where local vs
balanced fork would be beneficial. I will note, however, that when
balancing, we did look across the entire system. Linux scheduling
domains has the intermediate "node" level that constrains this balancing
to a subset of the system.

I'm not suggesting we submit this, nor am I particulary interested in
investigating it myself. Just pointing out a solution to a workload
scalability issue on an existing, albeit dated, numa platform.

>
> > I believe that this solved the pathological behavior we were seeing with
> > shell scripts taking way longer on the larger, supposedly more powerful,
> > platforms.
> >
> > Of course, that OS could migrate the equivalent of task structs and
> > kernel stack [the old Unix user struct that was traditionally swappable,
> > so fairly easy to migrate]. On Linux, all bets are off, once the
> > scheduler starts migrating tasks away from the node that contains their
> > task struct, ... [Remember Eric Focht's "NUMA Affine Scheduler" patch
> > with it's "home node"?]
>
> Task migration doesn't work well at all without userspace hints.
> SGI tried for ages (with IRIX) and failed. There's long discussions
> of all of these things back in the days when we merged the original
> NUMA scheduler in late 2.5 ...

I'm not one to cast aspersions on the IRIX engineers. However, as I
recall [could be wrong here], they were trying to use hardware counters
to predict what pages to migrate. On the same OS discussed above, we
found that automatic, lazy migration of pages worked very well for some
workloads.

I have patches and data [presented at LCA 2007] that shows, on a heavily
loaded 4-node, 16-cpu ia64 numa platform, ~14% reduction in real time
for a kernel build [make -j 32] and something like 22% reduction in
system time and 4% reduction in user time. This with automatic, lazy
migration enabled vs not, on the same build of a 2.6.19-rc6-mm? kernel.
I'll also note that the reduction in system time was in spite of the
cost of the auto/lazy page migration whenever the tasks migrated to a
different node.

Later,
Lee

2007-08-02 18:33:51

by Martin Bligh

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

Nick Piggin wrote:
> On Wed, Aug 01, 2007 at 03:52:11PM -0700, Martin Bligh wrote:
>>> And so forth. Initial forks will balance. If the children refuse to
>>> die, forks will continue to balance. If the parent starts seeing short
>>> lived children, fork()s will eventually start to stay local.
>> Fork without exec is much more rare than without. Optimising for
>> the uncommon case is the Wrong Thing to Do (tm). What we decided
>
> It's only the wrong thing to do if it hurts the common case too
> much. Considering we _already_ balance on exec, then adding another
> balance on fork is not going to introduce some order of magnitude
> problem -- at worst it would be 2x but it really isn't too slow
> anyway (at least nobody complained when we added it).
>
> One place where we found it helps is clone for threads.
>
> If we didn't do such a bad job at keeping tasks together with their
> local memory, then we might indeed reduce some of the balance-on-crap
> and increase the aggressiveness of periodic balancing.
>
> Considering we _already_ balance on fork/clone, I don't know what
> your argument is against this patch is? Doing the balance earlier
> and allocating more stuff on the local node is surely not a bad
> idea.

I don't know who turned that on ;-( I suspect nobody bothered
actually measuring it at the time though, or used some crap
benchmark like stream to do so. It should get reverted.

2007-08-03 00:20:26

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

On Thu, Aug 02, 2007 at 11:33:39AM -0700, Martin Bligh wrote:
> Nick Piggin wrote:
> >On Wed, Aug 01, 2007 at 03:52:11PM -0700, Martin Bligh wrote:
> >>>And so forth. Initial forks will balance. If the children refuse to
> >>>die, forks will continue to balance. If the parent starts seeing short
> >>>lived children, fork()s will eventually start to stay local.
> >>Fork without exec is much more rare than without. Optimising for
> >>the uncommon case is the Wrong Thing to Do (tm). What we decided
> >
> >It's only the wrong thing to do if it hurts the common case too
> >much. Considering we _already_ balance on exec, then adding another
> >balance on fork is not going to introduce some order of magnitude
> >problem -- at worst it would be 2x but it really isn't too slow
> >anyway (at least nobody complained when we added it).
> >
> >One place where we found it helps is clone for threads.
> >
> >If we didn't do such a bad job at keeping tasks together with their
> >local memory, then we might indeed reduce some of the balance-on-crap
> >and increase the aggressiveness of periodic balancing.
> >
> >Considering we _already_ balance on fork/clone, I don't know what
> >your argument is against this patch is? Doing the balance earlier
> >and allocating more stuff on the local node is surely not a bad
> >idea.
>
> I don't know who turned that on ;-( I suspect nobody bothered
> actually measuring it at the time though, or used some crap
> benchmark like stream to do so. It should get reverted.

So you have numbers to show it hurts? I tested some things where it
is not supposed to help, and it didn't make any difference. Nobody
else noticed either.

If the cost of doing the double balance is _really_ that painful,
then we ccould skip balance-on-exec for domains with balance-on-fork
set.

2007-08-03 20:16:35

by Suresh Siddha

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

On Fri, Aug 03, 2007 at 02:20:10AM +0200, Nick Piggin wrote:
> On Thu, Aug 02, 2007 at 11:33:39AM -0700, Martin Bligh wrote:
> > Nick Piggin wrote:
> > >On Wed, Aug 01, 2007 at 03:52:11PM -0700, Martin Bligh wrote:
> > >>>And so forth. Initial forks will balance. If the children refuse to
> > >>>die, forks will continue to balance. If the parent starts seeing short
> > >>>lived children, fork()s will eventually start to stay local.
> > >>Fork without exec is much more rare than without. Optimising for
> > >>the uncommon case is the Wrong Thing to Do (tm). What we decided
> > >
> > >It's only the wrong thing to do if it hurts the common case too
> > >much. Considering we _already_ balance on exec, then adding another
> > >balance on fork is not going to introduce some order of magnitude
> > >problem -- at worst it would be 2x but it really isn't too slow
> > >anyway (at least nobody complained when we added it).
> > >
> > >One place where we found it helps is clone for threads.
> > >
> > >If we didn't do such a bad job at keeping tasks together with their
> > >local memory, then we might indeed reduce some of the balance-on-crap
> > >and increase the aggressiveness of periodic balancing.
> > >
> > >Considering we _already_ balance on fork/clone, I don't know what
> > >your argument is against this patch is? Doing the balance earlier
> > >and allocating more stuff on the local node is surely not a bad
> > >idea.
> >
> > I don't know who turned that on ;-( I suspect nobody bothered
> > actually measuring it at the time though, or used some crap
> > benchmark like stream to do so. It should get reverted.
>
> So you have numbers to show it hurts? I tested some things where it
> is not supposed to help, and it didn't make any difference. Nobody
> else noticed either.
>
> If the cost of doing the double balance is _really_ that painful,
> then we ccould skip balance-on-exec for domains with balance-on-fork
> set.

Nick, Even if it is not painful, can we skip balance-on-exec if
balance-on-fork is set. There is no need for double balance, right?

Especially with the optimization you are trying to do with this patch,
balance-on-exec may lead to wrong decision making this optimization
not work as expected.

or perhaps do balance-on-fork based on clone_flags..

thanks,
suresh

2007-08-06 01:20:37

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc] balance-on-fork NUMA placement

On Fri, Aug 03, 2007 at 01:10:13PM -0700, Suresh B wrote:
> On Fri, Aug 03, 2007 at 02:20:10AM +0200, Nick Piggin wrote:
> > On Thu, Aug 02, 2007 at 11:33:39AM -0700, Martin Bligh wrote:
> > > Nick Piggin wrote:
> > > >On Wed, Aug 01, 2007 at 03:52:11PM -0700, Martin Bligh wrote:
> > > >>>And so forth. Initial forks will balance. If the children refuse to
> > > >>>die, forks will continue to balance. If the parent starts seeing short
> > > >>>lived children, fork()s will eventually start to stay local.
> > > >>Fork without exec is much more rare than without. Optimising for
> > > >>the uncommon case is the Wrong Thing to Do (tm). What we decided
> > > >
> > > >It's only the wrong thing to do if it hurts the common case too
> > > >much. Considering we _already_ balance on exec, then adding another
> > > >balance on fork is not going to introduce some order of magnitude
> > > >problem -- at worst it would be 2x but it really isn't too slow
> > > >anyway (at least nobody complained when we added it).
> > > >
> > > >One place where we found it helps is clone for threads.
> > > >
> > > >If we didn't do such a bad job at keeping tasks together with their
> > > >local memory, then we might indeed reduce some of the balance-on-crap
> > > >and increase the aggressiveness of periodic balancing.
> > > >
> > > >Considering we _already_ balance on fork/clone, I don't know what
> > > >your argument is against this patch is? Doing the balance earlier
> > > >and allocating more stuff on the local node is surely not a bad
> > > >idea.
> > >
> > > I don't know who turned that on ;-( I suspect nobody bothered
> > > actually measuring it at the time though, or used some crap
> > > benchmark like stream to do so. It should get reverted.
> >
> > So you have numbers to show it hurts? I tested some things where it
> > is not supposed to help, and it didn't make any difference. Nobody
> > else noticed either.
> >
> > If the cost of doing the double balance is _really_ that painful,
> > then we ccould skip balance-on-exec for domains with balance-on-fork
> > set.
>
> Nick, Even if it is not painful, can we skip balance-on-exec if
> balance-on-fork is set. There is no need for double balance, right?

I guess we could. There is no need for the double balance if the exec
happens immediately after the fork which is surely the common case. I
think there can be some other weird cases (eg multi-threaded code) that
does funny things though...

> Especially with the optimization you are trying to do with this patch,
> balance-on-exec may lead to wrong decision making this optimization
> not work as expected.

That's true.