Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp2083162ybi; Thu, 18 Jul 2019 03:08:43 -0700 (PDT) X-Google-Smtp-Source: APXvYqyhX42VKAiYr1OkVxrITMwhwLR5qYkTmjOqxFoz9/JdgJX3aeB2sC2NEEbtKkXwszmomg3F X-Received: by 2002:a17:902:ab83:: with SMTP id f3mr49134859plr.122.1563444523899; Thu, 18 Jul 2019 03:08:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563444523; cv=none; d=google.com; s=arc-20160816; b=OwQaCrON1ROGZzuXh8PUC1Akiu2kT3JTPg49dEpr7Hf6g09ewrWKSS40FEMCWWPGji SZBdW5xC8UVtvN3iOq2JzX32Vg02BGN74egHlqvUOz6GRsZvX8dGVWXkXKAx/O7+l5u2 7xS7NSnPFUPDEF2rrMM0P6C97AmkNlOH3s8yTc2XFurbnavpcPCAjRSIryJYe4FVhzdN 7VzeXU8qNNvLUz0XxJ4RtYaAY67uUoY/1f9w5nEA7N6b5Z+N+zkuP8TpCWNib3Y6VN0H 6KQRJXyssW29C/33xbk9egA6R5GgCZMrN6akD7vDOzz451mNt/yd6Gi/tPWARRKwIFYA 1ygg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=RSvjcD7wWY/cSW6lvnSqbDcjWboJjlZq4YOM3PZLovY=; b=y0C2/8spei9/GLCXYerIy6y3/Pj8rLvLxmtovPnWAqaoCCj9SkQ1ZlCFgL1AQU3uDn 3oKUyappRvd8SPKEfCzPBiQcCc9WNnAhF/vRJDEKNiqahlMabpImCckezE4Y+4qbGHAH f7jy1MXafX4RvOMJ/OJnqSL3ZBecuQrKIZXaDHQr2bZGUy8FNyYZwIQRdKwtZcyycAZd kxMgKR6wpf4nzAjBqUXtHkzHIdc60GjsHoDVu9qseHR4EhLKHsg+bQPB/M3M8+1GTLkc rxT04OI9ik0qHjb4QYrto1lT9WP3sPmG+s3TFuiYgwjmRboLxbBWuWF+e8+4bRNj81Kw gjXw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o23si237828pfp.131.2019.07.18.03.08.28; Thu, 18 Jul 2019 03:08:43 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389847AbfGRKH1 (ORCPT + 99 others); Thu, 18 Jul 2019 06:07:27 -0400 Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:38249 "EHLO out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726423AbfGRKH1 (ORCPT ); Thu, 18 Jul 2019 06:07:27 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R351e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=aaron.lu@linux.alibaba.com;NM=1;PH=DS;RN=20;SR=0;TI=SMTPD_---0TXCBWtI_1563444434; Received: from aaronlu(mailfrom:aaron.lu@linux.alibaba.com fp:SMTPD_---0TXCBWtI_1563444434) by smtp.aliyun-inc.com(127.0.0.1); Thu, 18 Jul 2019 18:07:21 +0800 Date: Thu, 18 Jul 2019 18:07:14 +0800 From: Aaron Lu To: Julien Desfossez Cc: Aubrey Li , Subhra Mazumdar , Vineeth Remanan Pillai , Nishanth Aravamudan , Peter Zijlstra , Tim Chen , Ingo Molnar , Thomas Gleixner , Paul Turner , Linus Torvalds , Linux List Kernel Mailing , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Kees Cook , Greg Kerr , Phil Auld , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3 Message-ID: <20190718100714.GA469@aaronlu> References: <20190531210816.GA24027@sinkpad> <20190606152637.GA5703@sinkpad> <20190612163345.GB26997@sinkpad> <635c01b0-d8f3-561b-5396-10c75ed03712@oracle.com> <20190613032246.GA17752@sinkpad> <20190619183302.GA6775@sinkpad> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190619183302.GA6775@sinkpad> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote: > On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote: > > The result looks still unfair, and particularly, the variance is too high, > > I just want to confirm that I am also seeing the same issue with a > similar setup. I also tried with the priority boost fix we previously > posted, the results are slightly better, but we are still seeing a very > high variance. > > On average, the results I get for 10 30-seconds runs are still much > better than nosmt (both sysbench pinned on the same sibling) for the > memory benchmark, and pretty similar for the CPU benchmark, but the high > variance between runs is indeed concerning. I was thinking to use util_avg signal to decide which task win in __prio_less() in the cross cpu case. The reason util_avg is chosen is because it represents how cpu intensive the task is, so the end result is, less cpu intensive task will preempt more cpu intensive task. Here is the test I have done to see how util_avg works (on a single node, 16 cores, 32 cpus vm): 1 Start tmux and then start 3 windows with each running bash; 2 Place two shells into two different cgroups and both have cpu.tag set; 3 Switch to the 1st tmux window, start will-it-scale/page_fault1_processes -t 16 -s 30 in the first tagged shell; 4 Switch to the 2nd tmux window; 5 Start will-it-scale/page_fault1_processes -t 16 -s 30 in the 2nd tagged shell; 6 Switch to the 3rd tmux window; 7 Do some simple things in the 3rd untagged shell like ls to see if untagged task is able to proceed; 8 Wait for the two page_fault workloads to finish. With v3 here, I can not do step 4 and later steps, i.e. the 16 page_fault1 processes started in step 3 will occupy all 16 cores and other tasks do not have a chance to run, including tmux, which made switching tmux window impossible. With the below patch on top of v3 that makes use of util_avg to decide which task win, I can do all 8 steps and the final scores of the 2 workloads are: 1796191 and 2199586. The score number are not close, suggesting some unfairness, but I can finish the test now... Here is the diff(consider it as a POC): --- kernel/sched/core.c | 35 ++--------------------------------- kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 2 ++ 3 files changed, 40 insertions(+), 33 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26fea68f7f54..7557a7bbb481 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ return !dl_time_before(a->dl.deadline, b->dl.deadline); - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 a_vruntime = a->se.vruntime; - u64 b_vruntime = b->se.vruntime; - - /* - * Normalize the vruntime if tasks are in different cpus. - */ - if (task_cpu(a) != task_cpu(b)) { - b_vruntime -= task_cfs_rq(b)->min_vruntime; - b_vruntime += task_cfs_rq(a)->min_vruntime; - - trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n", - a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime, - b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime); - - } - - return !((s64)(a_vruntime - b_vruntime) <= 0); - } + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ + return cfs_prio_less(a, b); return false; } @@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma if (!class_pick) return NULL; - if (!cookie) { - /* - * If class_pick is tagged, return it only if it has - * higher priority than max. - */ - bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ? - max && !prio_less(max, class_pick) : - max && prio_less(class_pick, max); - if (class_pick->core_cookie && max_is_higher) - return idle_sched_class.pick_task(rq); - - return class_pick; - } - /* * If class_pick is idle or matches cookie, return early. */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 26d29126d6a5..06fb00689db1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10740,3 +10740,39 @@ __init void init_sched_fair_class(void) #endif /* SMP */ } + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) +{ + struct sched_entity *sea = &a->se; + struct sched_entity *seb = &b->se; + bool samecore = task_cpu(a) == task_cpu(b); + struct task_struct *p; + s64 delta; + + if (samecore) { + /* vruntime is per cfs_rq */ + while (!is_same_group(sea, seb)) { + int sea_depth = sea->depth; + int seb_depth = seb->depth; + + if (sea_depth >= seb_depth) + sea = parent_entity(sea); + if (sea_depth <= seb_depth) + seb = parent_entity(seb); + } + + delta = (s64)(sea->vruntime - seb->vruntime); + } + + /* across cpu: use util_avg to decide which task to run */ + delta = (s64)(sea->avg.util_avg - seb->avg.util_avg); + + p = delta > 0 ? b : a; + trace_printk("picked %s/%d %s: %Lu %Lu %Ld\n", p->comm, p->pid, + samecore ? "vruntime" : "util_avg", + samecore ? sea->vruntime : sea->avg.util_avg, + samecore ? seb->vruntime : seb->avg.util_avg, + delta); + + return delta > 0; +} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e91c188a452c..02a6d71704f0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2454,3 +2454,5 @@ static inline bool sched_energy_enabled(void) static inline bool sched_energy_enabled(void) { return false; } #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */ + +bool cfs_prio_less(struct task_struct *a, struct task_struct *b); -- 2.19.1.3.ge56e4f7