Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp8379854ybi; Thu, 6 Jun 2019 11:15:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqxXEHc58+GqT/e0e7EnjPTHbjpUYQiaX619ykvhr31W7ewKZlgGUnw1voR7mUaMLT+e+sPK X-Received: by 2002:a17:902:8a94:: with SMTP id p20mr32588540plo.312.1559844940658; Thu, 06 Jun 2019 11:15:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559844940; cv=none; d=google.com; s=arc-20160816; b=N8GfW23oyIlPp7CJiMe09k3hFrPG7bpYaoL9BlAeLjyJadVi8e1miSmS2sIvHH+Suu cM4QomvXGD1FQt7r2PDG/ZlLqCCozkZ/EI81UTYkS7XHHvgdn73+ctWi+XoDnVoEm6Ge 1GAiS3WIZSU1ZSoWdFZOLWXp5CiPFXiZGj4++t0volpUXa8H1eozr22IgXZeB7u1DKlr DlRCm03wlL/WRjdX+/TEi/p6ScJf0CBIOZxCLowgG+JwgwRwO4yY6hN5UnEGw9niuPev 5pzpfxdF8U58vPYh1Qhzc+WhAdX4ge9urFNxn3izSaiNbmvKAbW8akFHSu57GY6s+7eE 389w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=7IPeRG+XRH2ia6q2FaZ0+0nEBCt2oXJyyHCFUpk0wYA=; b=lGEn6VfqwcZNtMFJoTth6pyGN6RNGC/h/0RPGGQ1o3vbljV1txLpq1o3DjH2UzVB1i LrIM3+ZHzJ4MowXRmm4/ZeHU91FezZ/qHkHwTLFNFAELX9bfCGycYmSVXxL6MUytMmrp RrShU62kzWuyPaxziMoxCdp785Pauuiceu+EyaNwgXzkabnC7zPtsb+5EpHtdvYZFXR4 H491cSisxH14bm46xPd1gOUQmOHxP9VnritiobZ//EsZ+sYjPA2KHH9kohSFGbYgrIgY Yu3+8ZzcwOP7a90AQQ3a3F4EYgUC3H92MjhesXn25rwrfgBUZq1K5rX/VRXCbBrAeB8M bS3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=WEx1hidF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o6si2641068pgk.52.2019.06.06.11.15.23; Thu, 06 Jun 2019 11:15:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=WEx1hidF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729328AbfFFP0u (ORCPT + 99 others); Thu, 6 Jun 2019 11:26:50 -0400 Received: from mail-qk1-f196.google.com ([209.85.222.196]:44219 "EHLO mail-qk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729077AbfFFP0u (ORCPT ); Thu, 6 Jun 2019 11:26:50 -0400 Received: by mail-qk1-f196.google.com with SMTP id w187so1671767qkb.11 for ; Thu, 06 Jun 2019 08:26:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=digitalocean.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=7IPeRG+XRH2ia6q2FaZ0+0nEBCt2oXJyyHCFUpk0wYA=; b=WEx1hidFm1qwEHyuxngLYoJZ+Y5p+ZlwoQBCUqQ8ulFW6uIj3mTnFkHiOO+ZCE5N61 GhUcqJPHpwVVV5Unadw2n36NCegtBzOaGm1tc1or/fhdB2wc3bJtDnG2W0JYZj3awzFD 8ju6VjanD1VfQnfLPehzeTbyTtZqS01JExW98= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=7IPeRG+XRH2ia6q2FaZ0+0nEBCt2oXJyyHCFUpk0wYA=; b=oKktuY1VSr5xjZ02A6sr7sQdU54698oehriekNYyj/YQZvn9AiIDmncRF53LzFyjow hqYSSTtj4yo95tJZSLquH8b205ShuP428VXC1eGMXMmrP7TvQY+Aig1cGHU/FEZP7PTW 3V8BuazGkFrV2Qd3m+lNXiZndgYZ8cjhmIFJwpMrRXTSOFeude2H2CD/EjB8PAWi9dCG Gf/FaPknFVPQp6eSZNpCVXmUMJnyJXAwrTZVWvO8vliaww2VSau90kWFAbV+c5iSyzBh 9NjZPCExnJ2/PLvfZb+6hK8upbs9T8KT1AnoDZS2IihnrudxDOQCN4iefrK5iB3JN92i um0w== X-Gm-Message-State: APjAAAVB+E5Hn7xZfHeADFHA49zF2Ja4ZkyRTN9M3NPD7DpxLieF+nPw eyvQpfnB7ht7dt4RhudXo3zyGA== X-Received: by 2002:a05:620a:1285:: with SMTP id w5mr29816790qki.302.1559834808895; Thu, 06 Jun 2019 08:26:48 -0700 (PDT) Received: from sinkpad (192-222-189-155.qc.cable.ebox.net. [192.222.189.155]) by smtp.gmail.com with ESMTPSA id j26sm1325248qtj.70.2019.06.06.08.26.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Jun 2019 08:26:47 -0700 (PDT) Date: Thu, 6 Jun 2019 11:26:37 -0400 From: Julien Desfossez To: Aaron Lu Cc: Aubrey Li , Vineeth Remanan Pillai , Nishanth Aravamudan , Peter Zijlstra , Tim Chen , Ingo Molnar , Thomas Gleixner , Paul Turner , Linus Torvalds , Linux List Kernel Mailing , Subhra Mazumdar , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Kees Cook , Greg Kerr , Phil Auld , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3 Message-ID: <20190606152637.GA5703@sinkpad> References: <20190531210816.GA24027@sinkpad> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190531210816.GA24027@sinkpad> X-Mailer: Mutt 1.5.24 (2015-08-30) User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 31-May-2019 05:08:16 PM, Julien Desfossez wrote: > > My first reaction is: when shell wakes up from sleep, it will > > fork date. If the script is untagged and those workloads are > > tagged and all available cores are already running workload > > threads, the forked date can lose to the running workload > > threads due to __prio_less() can't properly do vruntime comparison > > for tasks on different CPUs. So those idle siblings can't run > > date and are idled instead. See my previous post on this: > > > > https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/ > > (Now that I re-read my post, I see that I didn't make it clear > > that se_bash and se_hog are assigned different tags(e.g. hog is > > tagged and bash is untagged). > > > > Siblings being forced idle is expected due to the nature of core > > scheduling, but when two tasks belonging to two siblings are > > fighting for schedule, we should let the higher priority one win. > > > > It used to work on v2 is probably due to we mistakenly > > allow different tagged tasks to schedule on the same core at > > the same time, but that is fixed in v3. > > I confirm this is indeed what is happening, we reproduced it with a > simple script that only uses one core (cpu 2 and 38 are sibling on this > machine): > > setup: > cgcreate -g cpu,cpuset:test > cgcreate -g cpu,cpuset:test/set1 > cgcreate -g cpu,cpuset:test/set2 > echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems > echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus > echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems > echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems > echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag > > In one terminal: > sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30 > --test=cpu run > > In another one: > sudo cgexec -g cpu,cpuset:test/set2 date > > It's very clear that 'date' hangs until sysbench is done. > > We started experimenting with marking a task on the forced idle sibling > if normalized vruntimes are equal. That way, at the next compare, if the > normalized vruntimes are still equal, it prefers the task on the forced > idle sibling. It still needs more work, but in our early tests it helps. As mentioned above, we have come up with a fix for the long starvation of untagged interactive threads competing for the same core with tagged threads at the same priority. The idea is to detect the stall and boost the stalling threads priority so that it gets a chance next time. Boosting is done by a new counter(min_vruntime_boost) for every task which we subtract from vruntime before comparison. The new logic looks like this: If we see that normalized runtimes are equal, we check the min_vruntimes of their runqueues and give a chance for the task in the runqueue with less min_vruntime. That will help it to progress its vruntime. While doing this, we boost the priority of the task in the sibling so that, we don’t starve the task in the sibling until the min_vruntime of this runqueue catches up. If min_vruntimes are also equal, we do as before and consider the task ‘a’ of higher priority. Here we boost the task ‘b’ so that it gets to run next time. The min_vruntime_boost is reset to zero once the task in on cpu. So only waiting tasks will have a non-zero value if it is starved while matching a task on the other sibling. The attached patch has a sched_feature to enable the above feature so that you can compare the results with and without this feature. What we observe with this patch is that it helps for untagged interactive tasks and fairness in general, but this increases the overhead of core scheduling when there is contention for the CPU with tasks of varying cpu usage. The general trend we see is that if there is a cpu intensive thread and multiple relatively idle threads in different tags, the cpu intensive tasks continuously yields to be fair to the relatively idle threads when it becomes runnable. And if the relatively idle threads make up for most of the tasks in a system and are tagged, the cpu intensive tasks sees a considerable drop in performance. If you have any feedback or creative ideas to help improve, let us know ! Thanks diff --git a/include/linux/sched.h b/include/linux/sched.h index 1a309e8..56cad0e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -642,6 +642,7 @@ struct task_struct { struct rb_node core_node; unsigned long core_cookie; unsigned int core_occupation; + unsigned int core_vruntime_boost; #endif #ifdef CONFIG_CGROUP_SCHED diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 73329da..c302853 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) int pa = __task_prio(a), pb = __task_prio(b); + trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n", + a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline, + b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline); + if (-pa < -pb) return true; @@ -102,21 +106,36 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) return !dl_time_before(a->dl.deadline, b->dl.deadline); if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 vruntime = b->se.vruntime; - - trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n", - a->comm, a->pid, pa, a->se.vruntime, task_cfs_rq(a)->min_vruntime, - b->comm, b->pid, pb, b->se.vruntime, task_cfs_rq(b)->min_vruntime); + u64 a_vruntime = a->se.vruntime - a->core_vruntime_boost; + u64 b_vruntime = b->se.vruntime - b->core_vruntime_boost; /* * Normalize the vruntime if tasks are in different cpus. */ if (task_cpu(a) != task_cpu(b)) { - vruntime -= task_cfs_rq(b)->min_vruntime; - vruntime += task_cfs_rq(a)->min_vruntime; + s64 min_vruntime_diff = task_cfs_rq(a)->min_vruntime - + task_cfs_rq(b)->min_vruntime; + b_vruntime += min_vruntime_diff; + + trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n", + a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime, + b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime); + + if (sched_feat(CORESCHED_STALL_FIX) && + a_vruntime == b_vruntime) { + bool less_prio = min_vruntime_diff > 0; + + if (less_prio) + a->core_vruntime_boost++; + else + b->core_vruntime_boost++; + + return less_prio; + + } } - return !((s64)(a->se.vruntime - vruntime) <= 0); + return !((s64)(a_vruntime - b_vruntime) <= 0); } return false; @@ -2456,6 +2475,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_COMPACTION p->capture_control = NULL; #endif +#ifdef CONFIG_SCHED_CORE + p->core_vruntime_boost = 0UL; +#endif init_numa_balancing(clone_flags, p); } @@ -3723,6 +3745,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) next->comm, next->pid, next->core_cookie); + next->core_vruntime_boost = 0UL; return next; } @@ -3835,6 +3858,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie); if (old_max) { + if (old_max->core_vruntime_boost) + old_max->core_vruntime_boost--; + for_each_cpu(j, smt_mask) { if (j == i) continue; @@ -3905,6 +3931,7 @@ next_class:; done: set_next_task(rq, next); + next->core_vruntime_boost = 0UL; return next; } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..332a092 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true) * UtilEstimation. Use estimated CPU utilization. */ SCHED_FEAT(UTIL_EST, true) + +/* + * Prevent task stall due to vruntime comparison limitation across + * cpus. + */ +SCHED_FEAT(CORESCHED_STALL_FIX, false)