Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp854638imm; Tue, 3 Jul 2018 00:51:57 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfA6+8SFqEwnKQay6fPuxcZjozkIiBE8Qxru7XP3g/BllHSyMGlYwESqenPMRou4KCw6kgH X-Received: by 2002:a62:6d42:: with SMTP id i63-v6mr28524812pfc.41.1530604317093; Tue, 03 Jul 2018 00:51:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530604317; cv=none; d=google.com; s=arc-20160816; b=fLEdGGe0Ttm8sWC3X+6nD81SahPEArO6t9eOP5AHvp+Ll1vRmAF6Ec7k8KCDACiofs HiPtUZP/Ci19Wjbn38yvXRAFgjOK8fiBWngEs6awmqy+Pe323vK8DvxCD34wqn/dNudH l3SQHpNeMeiQ317SGvdufQlH4jx+PHQYs0hrVbHjO3Ob//n7BH/t2pTHY7iGlqY3YMeD nlGqZ4L176UNfTBMA7PBrC0cALkB88IWTp3OPU8V3rvGgISZv6A3EhyRUiUCrwUbo+Xu WFNtSaGT78YGjY0MeoDswi2lv19bu0d5pvKrYli/zJL4LkKkxDbsyLwHdzb0xK4/v/pi G5yg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-disposition :content-transfer-encoding:mime-version:robot-unsubscribe:robot-id :git-commit-id:subject:to:references:in-reply-to:reply-to:cc :message-id:from:date:arc-authentication-results; bh=aak+lqiLZBLcmu6dSmnW1Iypm81nqRA1f1eLpakG9Rk=; b=wqpQuc/JwuwOEG32GzZGk+6IsWMvBcnKX9S/MnYXvwDx+Qu2oq//rJ0HdRnppmbAad Q9avd/LRfEydq8nDqnU9NjT6XQAoYBh5KlSdQlJfv3lHgkiJ6MktHEiM2Lw//Hmtj+W+ caTPmDNgFk6Fqm8evb1Z3ANkHzV3KCZTbnOjpFX9iH3HzKwmLgahypIEy4bSfLQ09vGn FTGVvDuUe+JT8L7ZxC6uALRMYY9AEiTvXpIL/P7LgxOxNwOBl6pmqRoOUjZqrjibaaJl 0847j3THP216jDpcL+7+VX+9M5XMkXCYVE4YW1CXHTzOYAhwQixZYuK0EeeRuhn1vouZ h2sw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d14-v6si535948plr.244.2018.07.03.00.51.42; Tue, 03 Jul 2018 00:51:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933574AbeGCHua (ORCPT + 99 others); Tue, 3 Jul 2018 03:50:30 -0400 Received: from terminus.zytor.com ([198.137.202.136]:60879 "EHLO terminus.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932212AbeGCHu2 (ORCPT ); Tue, 3 Jul 2018 03:50:28 -0400 Received: from terminus.zytor.com (localhost [127.0.0.1]) by terminus.zytor.com (8.15.2/8.15.2) with ESMTPS id w637o3HC201783 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 3 Jul 2018 00:50:03 -0700 Received: (from tipbot@localhost) by terminus.zytor.com (8.15.2/8.15.2/Submit) id w637o2xI201780; Tue, 3 Jul 2018 00:50:02 -0700 Date: Tue, 3 Jul 2018 00:50:02 -0700 X-Authentication-Warning: terminus.zytor.com: tipbot set sender to tipbot@zytor.com using -f From: tip-bot for Xunlei Pang Message-ID: Cc: torvalds@linux-foundation.org, xlpang@linux.alibaba.com, bsegall@google.com, peterz@infradead.org, mingo@kernel.org, hpa@zytor.com, linux-kernel@vger.kernel.org, tglx@linutronix.de Reply-To: mingo@kernel.org, torvalds@linux-foundation.org, xlpang@linux.alibaba.com, bsegall@google.com, peterz@infradead.org, linux-kernel@vger.kernel.org, tglx@linutronix.de, hpa@zytor.com In-Reply-To: <20180620101834.24455-1-xlpang@linux.alibaba.com> References: <20180620101834.24455-1-xlpang@linux.alibaba.com> To: linux-tip-commits@vger.kernel.org Subject: [tip:sched/core] sched/fair: Fix bandwidth timer clock drift condition Git-Commit-ID: 512ac999d2755d2b7109e996a76b6fb8b888631d X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, DATE_IN_FUTURE_96_Q autolearn=ham autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on terminus.zytor.com Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Commit-ID: 512ac999d2755d2b7109e996a76b6fb8b888631d Gitweb: https://git.kernel.org/tip/512ac999d2755d2b7109e996a76b6fb8b888631d Author: Xunlei Pang AuthorDate: Wed, 20 Jun 2018 18:18:33 +0800 Committer: Ingo Molnar CommitDate: Tue, 3 Jul 2018 09:17:29 +0200 sched/fair: Fix bandwidth timer clock drift condition I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 14 ++++++++------ kernel/sched/sched.h | 6 ++++-- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1866e64792a7..791707c56886 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); u64 amount = 0, min_amount, expires; + int expires_seq; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } + expires_seq = cfs_b->expires_seq; expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) * spread between our sched_clock and the one on which runtime was * issued. */ - if ((s64)(expires - cfs_rq->runtime_expires) > 0) + if (cfs_rq->expires_seq != expires_seq) { + cfs_rq->expires_seq = expires_seq; cfs_rq->runtime_expires = expires; + } return cfs_rq->runtime_remaining > 0; } @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) * has not truly expired. * * Fortunately we can check determine whether this the case by checking - * whether the global deadline has advanced. It is valid to compare - * cfs_b->runtime_expires without any locks since we only care about - * exact equality, so a partial write will still work. + * whether the global deadline(cfs_b->expires_seq) has advanced. */ - - if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { + if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 27ddec334601..c7742dcc136c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -334,9 +334,10 @@ struct cfs_bandwidth { u64 runtime; s64 hierarchical_quota; u64 runtime_expires; + int expires_seq; - int idle; - int period_active; + short idle; + short period_active; struct hrtimer period_timer; struct hrtimer slack_timer; struct list_head throttled_cfs_rq; @@ -551,6 +552,7 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; + int expires_seq; u64 runtime_expires; s64 runtime_remaining;