Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp11157770ybi; Thu, 25 Jul 2019 11:00:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqx2A+W5pvoepC0rDIVvq/kQa/hFfdKmdY6Gr3MpvBb++bVA8FV0nw3Yjo116SyWJ6dPrYop X-Received: by 2002:a17:90a:8a84:: with SMTP id x4mr92422060pjn.105.1564077640829; Thu, 25 Jul 2019 11:00:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564077640; cv=none; d=google.com; s=arc-20160816; b=bbnV0SbR76MLT2zLANXDyQc6ZwFhtAmEPxIKqjSlxyCZdmWPLkSSlGWgCexsut/QDC xPTjto2C+QuIP0/av5m4rwP6dqIk/zhb4cE5v71/kW8rCusvXMyDjKRR5+Pr746sk2vt ZvFBXE359p60y2A9ArZu/GzjbQ4MwO3zP/u9nHnr2l5stG8eNT5zPYP/PhcZVz2rXt/c 3p5fgkuRqF0nOdGeP1ZWMBWJKy7VyM5/twgVDP6fuG+6tfPdZqgLpkVPhpxsaj39Voos 1v5wyONdBXdC74heizZRfi+iQ6qVEPHMAEXwNoUdgwVckP+DGQq3Vw42LJVaMGKHgOO0 YErw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=0jXduAwMcjjNuH18dGXazsedF7DqxDnVkqBCq+mYg1Y=; b=ehfkXUd9/PUsgaaj624dXbZf/DIzKJYCA8S0kuzPZdU0oH4whzDfdFNdedcHbvR8wu EjRpJT3Z02LlUARzGdYEMFpyzJcmnq4WOK4+hREvZe5CIjcEvbqrCj/xvM8vbxHv2FMk izkzTOQ/nWgdBKM1drSrUgUbqtIrGl3c/YxUyz6iwct3eB8u/00cDPBrrFy4UiQCRgSs KqQF621zh2IXyV+oQk1Q+Nc19Vf+u7kAPqMK5DIp0CDizYoW3ZRQGnbuRwntyZ/5w+Co lx6/UM5qZulRzcoiiCkjnaZuQXz9yAdTVVU5lIEaIXV56KLqkkue95wiBL+IVXLjt0Oa VZfg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x7si19634694pfi.257.2019.07.25.11.00.25; Thu, 25 Jul 2019 11:00:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388090AbfGYOaS (ORCPT + 99 others); Thu, 25 Jul 2019 10:30:18 -0400 Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45]:42316 "EHLO out30-45.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727167AbfGYOaQ (ORCPT ); Thu, 25 Jul 2019 10:30:16 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01422;MF=aaron.lu@linux.alibaba.com;NM=1;PH=DS;RN=20;SR=0;TI=SMTPD_---0TXn-jI9_1564065003; Received: from aaronlu(mailfrom:aaron.lu@linux.alibaba.com fp:SMTPD_---0TXn-jI9_1564065003) by smtp.aliyun-inc.com(127.0.0.1); Thu, 25 Jul 2019 22:30:11 +0800 Date: Thu, 25 Jul 2019 22:30:03 +0800 From: Aaron Lu To: Aubrey Li Cc: Julien Desfossez , Subhra Mazumdar , Vineeth Remanan Pillai , Nishanth Aravamudan , Peter Zijlstra , Tim Chen , Ingo Molnar , Thomas Gleixner , Paul Turner , Linus Torvalds , Linux List Kernel Mailing , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Kees Cook , Greg Kerr , Phil Auld , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3 Message-ID: <20190725143003.GA992@aaronlu> References: <20190531210816.GA24027@sinkpad> <20190606152637.GA5703@sinkpad> <20190612163345.GB26997@sinkpad> <635c01b0-d8f3-561b-5396-10c75ed03712@oracle.com> <20190613032246.GA17752@sinkpad> <20190619183302.GA6775@sinkpad> <20190718100714.GA469@aaronlu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 22, 2019 at 06:26:46PM +0800, Aubrey Li wrote: > The granularity period of util_avg seems too large to decide task priority > during pick_task(), at least it is in my case, cfs_prio_less() always picked > core max task, so pick_task() eventually picked idle, which causes this change > not very helpful for my case. > > -0 [057] dN.. 83.716973: __schedule: max: sysbench/2578 > ffff889050f68600 > -0 [057] dN.. 83.716974: __schedule: > (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0) > -0 [057] dN.. 83.716975: __schedule: > (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0) > -0 [057] dN.. 83.716975: cfs_prio_less: picked > sysbench/2578 util_avg: 20 527 -507 <======= here=== > -0 [057] dN.. 83.716976: __schedule: pick_task cookie > pick swapper/5/0 ffff889050f68600 I tried a different approach based on vruntime with 3 patches following. When the two tasks are on the same CPU, no change is made, I still route the two sched entities up till they are in the same group(cfs_rq) and then do the vruntime comparison. When the two tasks are on differen threads of the same core, the root level sched_entities to which the two tasks belong will be used to do the comparison. An ugly illustration for the cross CPU case: cpu0 cpu1 / | \ / | \ se1 se2 se3 se4 se5 se6 / \ / \ se21 se22 se61 se62 Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while task B's se is se61. To compare priority of task A and B, we compare priority of se2 and se6. The smaller vruntime wins. To make this work, the root level ses on both CPU should have a common cfs_rq min vuntime, which I call it the core cfs_rq min vruntime. This is mostly done in patch2/3. Test: 1 wrote an cpu intensive program that does nothing but while(1) in main(), let's call it cpuhog; 2 start 2 cgroups, with one cgroup's cpuset binding to CPU2 and the other binding to cpu3. cpu2 and cpu3 are smt siblings on the test VM; 3 enable cpu.tag for the two cgroups; 4 start one cpuhog task in each cgroup; 5 kill both cpuhog tasks after 10 seconds; 6 check each cgroup's cpu usage. If the task is scheduled fairly, then each cgroup's cpu usage should be around 5s. With v3, the cpu usage of both cgroups are sometimes 3s, 7s; sometimes 1s, 9s. With the 3 patches applied, the numbers are mostly around 5s, 5s. Another test is starting two cgroups simultaneously with cpu.tag set, with one cgroup running: will-it-scale/page_fault1_processes -t 16 -s 30, the other running: will-it-scale/page_fault2_processes -t 16 -s 30. With v3, like I said last time, the later started page_fault processes can't start running. With the 3 patches applied, both running at the same time with each CPU having a relatively fair score: output line of 16 page_fault1 processes in 1 second interval: min:105225 max:131716 total:1872322 output line of 16 page_fault2 processes in 1 second interval: min:86797 max:110554 total:1581177 Note the value in min and max, the smaller the gap is, the better the faireness is. Aubrey, I haven't been able to run your workload yet...