Date: Wed, 28 May 2003 10:13:26 -0700
From: "Martin J. Bligh" <mbligh@aracnet.com>
To: Erich Focht <efocht@hpce.nec.com>
cc: Andi Kleen <ak@suse.de>, LSE <lse-tech@lists.sourceforge.net>,
       linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [Lse-tech] Node affine NUMA scheduler extension
Message-ID: <26980000.1054142004@[10.10.2.4]>
In-Reply-To: <200305281902.35511.efocht@hpce.nec.com>
References: <200305271031.55554.efocht@hpce.nec.com> <200305272328.27269.efocht@hpce.nec.com> <2640000.1054072312@[10.10.2.4]> <200305281902.35511.efocht@hpce.nec.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2301
Lines: 50

>> And is this restricted to just clones? Doesn't
>> seem to be, unless that's implicit in homenode_unset?
> 
> Hmmm... That's actually the difficult issue with fork vs. clone and
> migrating or not. When you clone a small task, you'd usually like to
> keep it on the same node. If it's a big task and fault-in a lot of
> memory, it could make sense to let it migrate to another node. For
> forked tasks it's the same, you mighthave some more COW memory
> movement but after that you'll usually be happy if the child went away
> from the parent's node if it is a long-running child. Not so for a
> short running child. 

Can you define what you mean by big vs small? I presume you mean RSS?
There are several factors that come into play here, at least:

1. RSS (and which bits of this lie on which node)
2. CPU utilisation of the task
3. Task duration
4. Cache warmth
5. the current balance situation.

I'm not convinced there's anything we can do about 3, since it's mostly
unpredicatable, and I don't have crystal_ball.o ... we can kind of predict
2, at least for running tasks (not sure it makes much sense over fork).
For 1, we can kind of use that over a fork, though as we have no idea
whether it'll exec or not, I'm not sure it's much use *unless* we're
cloning. 

> The current compromise:
> - start children (no matter whether forked or cloned) on the same CPU
> - allow other CPUs to steal freshly forked children by reducing the
> cache affinity until the first steal/migration
> - ok with chidren who exec soon, they might exec before anyone can
> steal them, drop all the pages and find anew life on the optimal node
> 
> This way exec'd tasks get properly baklanced, forked/cloned ones get a
> better start than in the current scheduler. I'll post a patch on top
> of the node affine scheduler doing this in 1-2 days.

Turning on child-runs-first would instinctively be good as well, though
I have no benchmarks to prove it ... would be nice to get the load
balance situation fixed asap by running the exec. That's not on right
now, IIRC.

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/