Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp5944681ybi; Wed, 12 Jun 2019 11:08:32 -0700 (PDT) X-Google-Smtp-Source: APXvYqxs2OoCoSoIvXEReM+VXFxdJmLU+f+7+cIdzLF6C25mvGADzyIO32DbhAPbDsflBrtlXfjt X-Received: by 2002:a17:902:28e9:: with SMTP id f96mr36726464plb.114.1560362912300; Wed, 12 Jun 2019 11:08:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560362912; cv=none; d=google.com; s=arc-20160816; b=tIsd7GKP2etPR6ipf+CxB9LfOM8vO2EOVBxnRbmYwhPQl/1h6x12ayg+ShOHkUpgp7 ik9sRrjszgPEyEeWjvxw7RwHxhrW8SQ6IQsdQUTSu0mEFcv+QJuchq7YiLslUTxtnudl T1S4k7sSsplbLlD4wdsTzPHJ5c/jerCRcpCAXKc5rYxJCj33q0KCDQShnjIfHe3HKCF2 4D/EAYzUwLfApD+TkRuJ3Z2OI5UdB3t6+B5m95sEy5XLF+W2jk+INPSSdCkjCnZ36NRd Fbevq9dTGhdhEyI+MzZN/qbNLvhsFmHggCFWrL6rUOs0seu23rGMvznEFfrQRSmhfk1J a7Gg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=zBrNd1uYidSfD9I+qTo/PD9aW/Xy1RzqoQq/bRnt2Q4=; b=ek3iDQbjhu+CBDGu8D/3w5fLspIEE722uvt+BV2cWLq0+NPkmJGIn4fL4rXpbUTspG iBlhBZjAzg5NCtZdYxJE4Bbnxtv7ogBVeP/0KGDLwnhnhFetTaO6WP8zIuNfbgk3PfF7 KVgBLQtZkvdDh/X3VQrpMb2RDdshs3cJAQ3aPqA1y0C2BFwvG71AnWVqRTlaw/RvyX71 eB01gZVo8C6a3xiQ3adjHUAaT7oYp6i6uBnNVCrazRekqpoTAv/0UeuxGBQq1EbKWQwq K0TMwPND/4VS6YO2Gm3htetKcDaaHdvyxZ35K9FhsY71elmzlCcTK35ufN4ZXCZ7pGsi /ILA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=MalzMoAL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y11si302648plr.381.2019.06.12.11.08.18; Wed, 12 Jun 2019 11:08:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b=MalzMoAL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2440312AbfFLQdu (ORCPT + 99 others); Wed, 12 Jun 2019 12:33:50 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:42192 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2437983AbfFLQdu (ORCPT ); Wed, 12 Jun 2019 12:33:50 -0400 Received: by mail-qt1-f196.google.com with SMTP id s15so19154639qtk.9 for ; Wed, 12 Jun 2019 09:33:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=digitalocean.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=zBrNd1uYidSfD9I+qTo/PD9aW/Xy1RzqoQq/bRnt2Q4=; b=MalzMoALHWzRuLtvxnxmG35ccZVP+ZlPQyrZt39IxboK5uGZzLakApBh79dt0y8Qfl JIiQGF+YQliZ15W8klo+lSvE5YptH79D2MvH8e3pxHQPVAzKQYsGBK6xkYVL/ybU4Dba wmd1Bt9ue/Up8h8BwNfm/OVz7rIRjNeJu5QY0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=zBrNd1uYidSfD9I+qTo/PD9aW/Xy1RzqoQq/bRnt2Q4=; b=jVmu37mTULm9IkRfPWZ0zsPdvi5Zf9w44iwR0vds4IGnCmKzi8AOp8aSKSZKK6BwdS 71swB3pCV/iU0Aj4JwFpgkKPiOPrtQiLWiQw7LcnssFghxi7OG6gn9mI5FfSZrREQY5I TYsqM/InmZQ46KESPthXxRz8WpyHwFi+j1XnUQeB9f76cFTk/2pxurEndyiyOiTiSp/4 m6yyNsSmI8JTGYaT5tDfXcF39vyfKgDsdu14XtP1oeFogsSgAnwTGm/BIXewf+qdSqup wlspt8wj1bGRd7F3eAGl0UA57fUVQOJigh6VykNsZNOxhGdwgM7VoyZUbzfCE0a7pvIc AulA== X-Gm-Message-State: APjAAAVVzuRWmI5e7YncbRKEP8Sr3+S8ZcFleFAYtKDZzjaOQ74Gkup4 BjQjjaNmU41aOiyJfcOBLgZuDA== X-Received: by 2002:ac8:2bdc:: with SMTP id n28mr33615988qtn.197.1560357229124; Wed, 12 Jun 2019 09:33:49 -0700 (PDT) Received: from sinkpad (modemcable060.224-21-96.mc.videotron.ca. [96.21.224.60]) by smtp.gmail.com with ESMTPSA id r5sm85640qkc.42.2019.06.12.09.33.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jun 2019 09:33:47 -0700 (PDT) Date: Wed, 12 Jun 2019 12:33:45 -0400 From: Julien Desfossez To: Aaron Lu Cc: Aubrey Li , Vineeth Remanan Pillai , Nishanth Aravamudan , Peter Zijlstra , Tim Chen , Ingo Molnar , Thomas Gleixner , Paul Turner , Linus Torvalds , Linux List Kernel Mailing , Subhra Mazumdar , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Kees Cook , Greg Kerr , Phil Auld , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3 Message-ID: <20190612163345.GB26997@sinkpad> References: <20190531210816.GA24027@sinkpad> <20190606152637.GA5703@sinkpad> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190606152637.GA5703@sinkpad> X-Mailer: Mutt 1.5.24 (2015-08-30) User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org After reading more traces and trying to understand why only untagged tasks are starving when there are cpu-intensive tasks running on the same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In the case where ‘core_cookie’ is 0, we are supposed to only prefer the tagged task if it’s priority is higher, but when the priorities are equal we prefer it as well which causes the starving. ‘pick_task’ is biased toward selecting its first parameter in case of equality which in this case was the ‘class_pick’ instead of ‘max’. Reversing the order of the parameter solves this issue and matches the expected behavior. So we can get rid of this vruntime_boost concept. We have tested the fix below and it seems to work well with tagged/untagged tasks. Here are our initial test results. When core scheduling is enabled, each VM (and associated vhost threads) are in their own cgroup/tag. 1 12-vcpu VM MySQL TPC-C benchmark (IO + CPU) with 96 mostly-idle 1-vcpu VMs on each NUMA node (72 logical CPUs total with SMT on): +-------------+----------+--------------+------------+--------+ | | baseline | coresched | coresched | nosmt | | | no tag | VMs tagged | VMs tagged | no tag | | | v5.1.5 | no stall fix | stall fix | | +-------------+----------+--------------+------------+--------+ |average TPS | 1474 | 1289 | 1264 | 1339 | |stdev | 48 | 12 | 17 | 24 | |overhead | N/A | -12% | -14% | -9% | +-------------+----------+--------------+------------+--------+ 3 12-vcpu VMs running linpack (cpu-intensive), all pinned on the same NUMA node (36 logical CPUs with SMT enabled on that NUMA node): +---------------+----------+--------------+-----------+--------+ | | baseline | coresched | coresched | nosmt | | | no tag | VMs tagged | VMs tagged| no tag | | | v5.1.5 | no stall fix | stall fix | | +---------------+----------+--------------+-----------+--------+ |average gflops | 177.9 | 171.3 | 172.7 | 81.9 | |stdev | 2.6 | 10.6 | 6.4 | 8.1 | |overhead | N/A | -3.7% | -2.9% | -53.9% | +---------------+----------+--------------+-----------+--------+ This fix can be toggled dynamically with the ‘CORESCHED_STALL_FIX’ sched_feature so it’s easy to test before/after (it is disabled by default). The up-to-date git tree can also be found here in case it’s easier to follow: https://github.com/digitalocean/linux-coresched/commits/vpillai/coresched-v3-v5.1.5-test Feedback welcome ! Thanks, Julien diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6e79421..26fea68 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3668,8 +3668,10 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma * If class_pick is tagged, return it only if it has * higher priority than max. */ - if (max && class_pick->core_cookie && - prio_less(class_pick, max)) + bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ? + max && !prio_less(max, class_pick) : + max && prio_less(class_pick, max); + if (class_pick->core_cookie && max_is_higher) return idle_sched_class.pick_task(rq); return class_pick; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..332a092 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true) * UtilEstimation. Use estimated CPU utilization. */ SCHED_FEAT(UTIL_EST, true) + +/* + * Prevent task stall due to vruntime comparison limitation across + * cpus. + */ +SCHED_FEAT(CORESCHED_STALL_FIX, false)