Message-ID: <464DA995.8060400@bigpond.net.au>
Date: Fri, 18 May 2007 23:26:45 +1000
From: Peter Williams <pwil3058@bigpond.net.au>
User-Agent: Thunderbird 1.5.0.10 (X11/20070302)
MIME-Version: 1.0
To: Ingo Molnar <mingo@elte.hu>
CC: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [patch] CFS scheduler, -v12
References: <20070513153853.GA19846@elte.hu> <464A6698.3080400@bigpond.net.au> <20070516063625.GA9058@elte.hu> <464CE8FD.4070205@bigpond.net.au> <20070518071325.GB28702@elte.hu> <464DA61A.4040406@bigpond.net.au>
In-Reply-To: <464DA61A.4040406@bigpond.net.au>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3905
Lines: 82

Peter Williams wrote:
> Ingo Molnar wrote:
>> * Peter Williams <pwil3058@bigpond.net.au> wrote:
>>
>>> I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
>>> with and without CFS; and the problem is always present.  It's not 
>>> "nice" related as the all four tasks are run at nice == 0.
>>
>> could you try -v13 and did this behavior get better in any way?
> 
> It's still there but I've got a theory about what the problems is that 
> is supported by some other tests I've done.
> 
> What I'd forgotten is that I had gkrellm running as well as top (to 
> observe which CPU tasks were on) at the same time as the spinners were 
> running.  This meant that between them top, gkrellm and X were using 
> about 2% of the CPU -- not much but enough to make it possible that at 
> least one of them was running when the load balancer was trying to do 
> its thing.
> 
> This raises two possibilities: 1. the system looked balanced and 2. the 
> system didn't look balanced but one of  top, gkrellm or X was moved 
> instead of one of the spinners.
> 
> If it's 1 then there's not much we can do about it except say that it 
> only happens in these strange circumstances.  If it's 2 then we may have 
> to modify the way move_tasks() selects which tasks to move (if we think 
> that the circumstances warrant it -- I'm not sure that this is the case).
> 
> To examine these possibilities I tried two variations of the test.
> 
> a. run the spinners at nice == -10 instead of nice == 0.  When I did 
> this the load balancing was perfect on 10 consecutive runs which 
> according to my calculations makes it 99.9999997 certain that this 
> didn't happen by chance.  This supports theory 2 above.
> 
> b. run the tests without gkrellm running but use nice == 0 for the 
> spinners.  When I did this the load balancing was mostly perfect but was 
> quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
> to CPUs) but the %CPU allocation was quite good with the spinners all 
> getting approximately 49% of a CPU each.  This also supports theory 2 
> above and gives weak support to theory 1 above.
> 
> This leaves the question of what to do about it.  Given that most CPU 
> intensive tasks on a real system probably only run for a few tens of 
> milliseconds it probably won't matter much on a real system except that 
> a malicious user could exploit it to disrupt a system.
> 
> So my opinion is that we probably do need to do something about it but 
> that it's not urgent.
> 
> One thing that might work is to jitter the load balancing interval a 
> bit.  The reason I say this is that one of the characteristics of top 
> and gkrellm is that they run at a more or less constant interval (and, 
> in this case, X would also be following this pattern as it's doing 
> screen updates for top and gkrellm) and this means that it's possible 
> for the load balancing interval to synchronize with their intervals 
> which in turn causes the observed problem.  A jittered load balancing 
> interval should break the synchronization.  This would certainly be 
> simpler than trying to change the move_task() logic for selecting which 
> tasks to move.

I should have added that the reason I think this mooted synchronization 
is the cause of the problem is that I can think of no other way that 
tasks with such low activity (2% between the 3 of them) could cause the 
imbalance of the spinner to CPU allocation to be so persistent.

> 
> What do you think?


Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/