DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=google.com; s=beta;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type:content-transfer-encoding;
        b=C8UE8mzby0zWmBdkF59nns49FgxypG8tTk08fKr7kcEJru4oc3wBW7352Q7O8/rWEv
         6RLHitEXcocmA2uAb8Tw==
MIME-Version: 1.0
In-Reply-To: <1302261350.9086.120.camel@twins>
References: <20110408002322.3A0D812217F@elm.corp.google.com>
	<1302261350.9086.120.camel@twins>
Date: Fri, 8 Apr 2011 12:29:33 -0700
Message-ID: <BANLkTin4aQ7Gc+h0YCPo=Eokc2iz06WNbw@mail.gmail.com>
Subject: Re: [PATCH] sched: fix sched-domain avg_load calculation.
From: Ken Chen <kenchen@google.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: mingo@elte.hu, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1886
Lines: 39

On Fri, Apr 8, 2011 at 4:15 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-04-07 at 17:23 -0700, Ken Chen wrote:
>> In function find_busiest_group(), the sched-domain avg_load isn't
>> calculated at all if there is a group imbalance within the domain.
>> This will cause erroneous imbalance calculation. ?The reason is
>> that calculate_imbalance() sees sds->avg_load = 0 and it will dump
>> entire sds->max_load into imbalance variable, which is used later
>> on to migrate entire load from busiest CPU to the puller CPU. It
>> has two really bad effect:
>>
>> 1. stampede of task migration, and they won't be able to break out
>> ? ?of the bad state because of positive feedback loop: large load
>> ? ?delta -> heavier load migration -> larger imbalance and the cycle
>> ? ?goes on.
>>
>> 2. severe imbalance in CPU queue depth. ?This causes really long
>> ? ?scheduling latency blip which affects badly on application that
>> ? ?has tight latency requirement.
>>
>> The fix is to have kernel calculate domain avg_load in both cases.
>> This will ensure that imbalance calculation is always sensible and
>> the target is usually half way between busiest and puller CPU.
>
> Indeed so, it looks like I broke that in 866ab43efd32. Out of curiosity,
> what kind of workload did you observe this on?

This was observed on application that serves websearch query.  There
were uneven CPU queue depth in the system, which leads to long query
latency tail.  The latency tail were both high in occurring frequency
as well as streched out in time.

With this fix, both server throughput and latency response were improved.

- Ken
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/