Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3310485imu; Sun, 11 Nov 2018 12:06:41 -0800 (PST) X-Google-Smtp-Source: AJdET5cyVg6nVcuxZMFTlaV2M0cB/ic5qTpz+D82oQulzJ7IrjpSo85cGnNqKjBiehm3PEaCtREf X-Received: by 2002:a17:902:9a04:: with SMTP id v4-v6mr16928698plp.247.1541966801602; Sun, 11 Nov 2018 12:06:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541966801; cv=none; d=google.com; s=arc-20160816; b=KA5i0kZv9sdMpt0lHdN/rCaGbhZJnisZyE+YLZaYEhXzJPMaHnWL4DnRyo0hsYZqSu UgQFFC0zXrPFGIOOkmziAHxPWY3wM5OHgXVdImcwMdyVKF25RFK6MqdM+0hLWsW4YdlQ 6yonypeXd3RjpmO3MJIK/q+ECTFIx6Yxdtj+Q2jl+TUQDpOocjS5nYiZyVUPGc55VLHP dEqSAXDIwd+tlB2qnSZ9/rZ7Ax+t/oKDuCDu9HjrERA+1yEmehMA9NWr/cx1rhWuvau/ Cow4ahPkklTzaz0l+/vYYfoCl1FlPKIA1aK585dLXG3p2CKvqtm6YlSa8S2x/88ww8iY seMg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:subject:message-id:date:cc:to :from:mime-version:content-transfer-encoding:content-disposition; bh=/aOhsg4Hmv5Bq3uiknjkft2bGftVo1cFWpj/R2IPd30=; b=RqrsKudl3wA9PaPPu8h7bKcu06+slMzZJFG5UUon3p8PZTwwjSf1gIKFpzRZqUtxeh 5RKWdwhx95u6gcMNAnpW9DayWvjtRNGu7GCDtfpLHa7yJHjwOhQv5lfTLq/FL9qlfXGr BO+Qi/gZdNenjTCk9gMpNOYM2LGbRag7205J7DPhSrykoJ6CDVy7rpr7lyZMH7oRNMje KXz769TvxxqB/YXae4eUhGX/AlSJ/G+mfGVyVxTu/bEKWkzUg2r9gJJjWCb9UsEixd1t OsM30V681YZ85QsAP828YUH0GHeAPpwjz8yewIXzRm128ofj9yhNSX7y6d4L3WrFfV7E mopQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g3-v6si2088787plt.208.2018.11.11.12.06.26; Sun, 11 Nov 2018 12:06:41 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731354AbeKLFzh (ORCPT + 99 others); Mon, 12 Nov 2018 00:55:37 -0500 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:51910 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730720AbeKLFzf (ORCPT ); Mon, 12 Nov 2018 00:55:35 -0500 Received: from [192.168.4.242] (helo=deadeye) by shadbolt.decadent.org.uk with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1gLvsu-0000lK-Kx; Sun, 11 Nov 2018 19:59:04 +0000 Received: from ben by deadeye with local (Exim 4.91) (envelope-from ) id 1gLvsV-0001fA-8Q; Sun, 11 Nov 2018 19:58:39 +0000 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, "Linus Torvalds" , "Xunlei Pang" , "Peter Zijlstra (Intel)" , "Ben Segall" , "Ingo Molnar" , "Thomas Gleixner" Date: Sun, 11 Nov 2018 19:49:05 +0000 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) Subject: [PATCH 3.16 218/366] sched/fair: Fix bandwidth timer clock drift condition In-Reply-To: X-SA-Exim-Connect-IP: 192.168.4.242 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 3.16.61-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Xunlei Pang commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream. I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar [bwh: Backported to 3.16: - Drop changes to other member types in struct cfs_bandwidth - Adjust context] Signed-off-by: Ben Hutchings --- kernel/sched/fair.c | 14 ++++++++------ kernel/sched/sched.h | 6 ++++-- 2 files changed, 12 insertions(+), 8 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3143,6 +3143,7 @@ void __refill_cfs_bandwidth_runtime(stru now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -3165,6 +3166,7 @@ static int assign_cfs_rq_runtime(struct struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); u64 amount = 0, min_amount, expires; + int expires_seq; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -3190,6 +3192,7 @@ static int assign_cfs_rq_runtime(struct cfs_b->idle = 0; } } + expires_seq = cfs_b->expires_seq; expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); @@ -3199,8 +3202,10 @@ static int assign_cfs_rq_runtime(struct * spread between our sched_clock and the one on which runtime was * issued. */ - if ((s64)(expires - cfs_rq->runtime_expires) > 0) + if (cfs_rq->expires_seq != expires_seq) { + cfs_rq->expires_seq = expires_seq; cfs_rq->runtime_expires = expires; + } return cfs_rq->runtime_remaining > 0; } @@ -3226,12 +3231,9 @@ static void expire_cfs_rq_runtime(struct * has not truly expired. * * Fortunately we can check determine whether this the case by checking - * whether the global deadline has advanced. It is valid to compare - * cfs_b->runtime_expires without any locks since we only care about - * exact equality, so a partial write will still work. + * whether the global deadline(cfs_b->expires_seq) has advanced. */ - - if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { + if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -186,6 +186,7 @@ struct cfs_bandwidth { u64 quota, runtime; s64 hierarchal_quota; u64 runtime_expires; + int expires_seq; int idle, timer_active; struct hrtimer period_timer, slack_timer; @@ -375,6 +376,7 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; + int expires_seq; u64 runtime_expires; s64 runtime_remaining;