Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752039AbZKKGWN (ORCPT ); Wed, 11 Nov 2009 01:22:13 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751727AbZKKGWM (ORCPT ); Wed, 11 Nov 2009 01:22:12 -0500 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:33854 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751581AbZKKGWM (ORCPT ); Wed, 11 Nov 2009 01:22:12 -0500 X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1 Date: Wed, 11 Nov 2009 15:21:33 +0900 From: Yasunori Goto To: Peter Zijlstra Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group Cc: Miao Xie , Linux-Kernel , containers , Ingo Molnar In-Reply-To: <1257846518.4648.18.camel@twins> References: <4AEF94E8.3030403@cn.fujitsu.com> <1257846518.4648.18.camel@twins> X-Mailer-Plugin: BkASPil for Becky!2 Ver.2.068 Message-Id: <20091111134910.5F42.E1E9C6FF@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.50.07 [ja] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3885 Lines: 111 Hi. I have concern about these issues. > On Tue, 2009-11-03 at 11:26 +0900, Miao Xie wrote: > > Hi, Peter. > > > > I found two problems about cpu controller: > > 1) cpu controller didn't provide fair CPU time to groups when the tasks > > attached into those groups were bound to the same logic CPU. > > 2) cpu controller didn't provide fair CPU time to groups when shares of > > each group <= 2 * nr_cpus. > 3) if you nest them too deep you're too going to see similar funnies. > > Too sodding bad gcc messed up unsigned long long for LP64 mode, so we're > stuck with 64bit fixed point math where otherwise we could have used > 128bit things. > > Also, I don't really care much about fairness vs affinity, if you're > going to constrain the load-balancer and make his life impossible by > using affinities you get to keep the pieces. I think 1) differs from 2) and 3). 1) should be fixed at least. Because 1) is hard to understand for normal users. ------ > > a. create 2 cpu controller groups. > > b. attach a task into one group and 2 tasks into the other. > > c. bind three tasks to the same logic cpu. > > +--------+ +--------+ > > | group1 | | group2 | > > +--------+ +--------+ > > | | > > CPU0 Task A Task B & Task C (snip) > > some time later, I found the the task in the group1 got the 35% CPU time not > > 50% CPU time. It was very strange that this result against the expected. ------ Probably, if normal users start to use cpu controller, then they must make same test by way of trial. They will find same results which differ from their expectation in spite of SIMPLE test case. Then, they will think cpu controller must have many bugs, and will never use it. When 2) and 3), I can explain users the reason like "It's due to accurate error of internal calculation." Beacause the test case is a bit fussy, and user will understand it. However, I can't explain anything for 1), because it seems not fussy case. > But you've got a point, since you can probably see the same issue (1) > with cpusets, and that is because the whole cpu-controller vs cpusets > thing was done wrong. When users use cpuset/cpu affinity, then they would like to controll cpu affinity. Not CPU time. When users use cpu controller, they would like to controll cpu time, right? However, the cpu time is far from their expectation. I think it is strange. > > Someone needs to fix that if they really care. To be honest, I don't have any good idea because I'm not familiar with schduler's code. But I have one question. 1618 static int tg_shares_up(struct task_group *tg, void *data) 1619 { 1620 unsigned long weight, rq_weight = 0, shares = 0; (snip) 1632 for_each_cpu(i, sched_domain_span(sd)) { 1633 weight = tg->cfs_rq[i]->load.weight; 1634 usd->rq_weight[i] = weight; 1635 1636 /* 1637 * If there are currently no tasks on the cpu pretend there 1638 * is one of average load so that when a new task gets to 1639 * run here it will not get delayed by group starvation. 1640 */ 1641 if (!weight) 1642 weight = NICE_0_LOAD; ---------(*) I heard from test team when (*) was removed, 1) didn't occur. The comment said (*) is to avoid starvation condition. However, I don't understand why NICE_0_LOAD must be specified. Could you tell me why small value (like 2 or 3) is not used for (*)? What is side effect? Thanks. -- Yasunori Goto -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/