From: Andrew Theurer <habanero@us.ibm.com>
To: Nick Piggin <piggin@cyberone.com.au>
Subject: Re: [PATCH] Minor scheduler fix to get rid of skipping in xmms
Date: Thu, 11 Sep 2003 09:37:29 -0500
User-Agent: KMail/1.5
Cc: linux-kernel@vger.kernel.org
References: <200309102155.16407.habanero@us.ibm.com> <200309110805.02334.habanero@us.ibm.com> <3F607E4F.8070200@cyberone.com.au>
In-Reply-To: <3F607E4F.8070200@cyberone.com.au>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200309110937.29165.habanero@us.ibm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2356
Lines: 45


> >>>I see Nick's balance patch as somewhat harmless, at least combined with
> >>> A3 patch.  However, one concern is that the "ping-pong" steal interval
> >>> is not really 200ms, but 200ms/(nr_cpus-1), which without A3, could
> >>> show up as a problem, especially on an 8 way box.  In addition, I do
> >>> think there's a problem with num tasks we steal.  It should not be
> >>> imbalance/2, it should be: max_load - (node_nr_running /
> >>> num_cpus_node).  If we steal any more than this, which is quite
> >>> possible with imbalance/2, then it's likely this_cpu now has too many
> >>> tasks, and some other cpu will steal again. Using *imbalance/2 works
> >>> fine on 2-way smp, but I'm pretty sure we "over steal" tasks on 4 way
> >>> and up.  Anyway, I'm getting off topic here...
> >>
> >>IIRC max_load is supposed to be the number of tasks on the runqueue
> >>being stolen from, isn't it?
> >
> >Yes, but I think I still got this wrong.  Ideally, once we finish
> > stealing, the busiest runqueue should not have more than
> > node_nr_runing/nr_cpus_node, but more importantly, this_cpu should not
> > have more than
> >node_nr_running/nr_cpus_node, so maybe it should be:
> >
> >min(a,b) where
> >a = max_load - load_average	How much we are over the load_average
> >b = load_average - this_load	How much we are under the load_average
> >load_average = node_nr_runing / nr_cpus_node.
> >node_nr_running can be summed as we look for the busiest queue, so it
> > should not be too costly.
> >if min(a,b) is neagtive (this_cpu's runqueue length was greater than
> >load_average) we don't steal at all.
>
> Oh OK you're thinking about balancing across the entire NUMA. I was just
> thinking it will eventually settle down, but you're right: its probably
> better to overdampen the balancing than to underdampen it.

Actually this is really geared towards within the node.  The goal for each cpu 
in the node (or just a non NUMA system) should be to steal just enough to 
have rq->nr_running to be nr_running()/nr_cpus.  I'm still not sure how many 
tasks we should really steal from internode balance.  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/