Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp582342imm; Wed, 20 Jun 2018 03:20:24 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLE+69H/74AqAzRtranGcG4Z1w5DyHvt1T37SJy8fz5i5SHCw28MfRBCVoOVnmhbsr8lO8f X-Received: by 2002:a17:902:82cc:: with SMTP id u12-v6mr22973737plz.83.1529490024519; Wed, 20 Jun 2018 03:20:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529490024; cv=none; d=google.com; s=arc-20160816; b=djq62xPiCo8r2cu5WiHypLgv2/ln7j2PZ3iGqab51ir6LNH83StOGUtOUAwQMXGmbG V+AjLLcYQEMte0IiYacW7k1uSB6Met5YRr1tUW42SbnhhbBTDlyUsophPGiw6gjPSKIr lJnU9VCg4c64627PuFd5ysgADkv7+gs/KdCqELbKkcUhOY6EtpS/sGKxqyUid6KKOXCo vYwdgiFllSxgqci/PkNYZ11l1Pqm9vmf8d/F+2JdoP7IY34XMfcGfWYQAhCxcfTU8Rsa RY7Zwu1sXbGBFpvyqMlWqVYMsqjTyoL1eBsRDc69u8k0fLcqQIuBWYshureNu5IkJUIl k7Ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=ilOlXk9YVKdFZT3ZMDQysExrL8KFndv45vkQMKTgvH8=; b=c8x+1tXLr0g5Jg/cJ4jUHzoJKtVGq7WIV1z4AfLJvIMXSVFxQz18CyxIUBrbXZTKdo eKrurq4dJr8J6uZrgJfhg+eIruphqOQ0Gx0O8Rsv60HjHZGYdclU6abWqWYRMFnV1zwG rd4BqJ+XdgLggfRqM7Sg730vzucBeia5jEplfMlQngzcI0FMi6NZLcN/5YuxbFpOiczb aQJE6+QZnRKjVIunAaPG2JxWqad4MpXTeRmo+sv1ZK0bbW9IYV8wMSYcj5Z41f9Ku8vn Jm3CDOsITZ5bQx8lUQRjn795Zp1DUMz9ei4ySWw7KxBLumgJSvvkaDZEvmYz84UsPbNr 7qfQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d125-v6si1630140pgc.94.2018.06.20.03.20.10; Wed, 20 Jun 2018 03:20:24 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753541AbeFTKSn (ORCPT + 99 others); Wed, 20 Jun 2018 06:18:43 -0400 Received: from out30-132.freemail.mail.aliyun.com ([115.124.30.132]:52678 "EHLO out30-132.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751376AbeFTKSl (ORCPT ); Wed, 20 Jun 2018 06:18:41 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R771e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07417;MF=xlpang@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0T322Cw8_1529489914; Received: from localhost(mailfrom:xlpang@linux.alibaba.com fp:SMTPD_---0T322Cw8_1529489914) by smtp.aliyun-inc.com(127.0.0.1); Wed, 20 Jun 2018 18:18:34 +0800 From: Xunlei Pang To: Peter Zijlstra , Ingo Molnar , Ben Segall Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition Date: Wed, 20 Jun 2018 18:18:33 +0800 Message-Id: <20180620101834.24455-1-xlpang@linux.alibaba.com> X-Mailer: git-send-email 2.14.1.40.g8e62ba1 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I noticed the group constantly got throttled even it consumed low cpu usage, this caused some jitters on the response time to some of our business containers enabling cpu quota. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat |grep nr_throttled // nr_throttled will increase After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"), but was changed to be the current one due to its locking issue. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and use them to figure out if clock drift happens(true if they equal). Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Cc: Ben Segall Signed-off-by: Xunlei Pang --- kernel/sched/fair.c | 14 ++++++++------ kernel/sched/sched.h | 6 ++++-- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e497c05aab7f..e6bb68d52962 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); u64 amount = 0, min_amount, expires; + int expires_seq; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } + expires_seq = cfs_b->expires_seq; expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) * spread between our sched_clock and the one on which runtime was * issued. */ - if ((s64)(expires - cfs_rq->runtime_expires) > 0) + if (cfs_rq->expires_seq != expires_seq) { + cfs_rq->expires_seq = expires_seq; cfs_rq->runtime_expires = expires; + } return cfs_rq->runtime_remaining > 0; } @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) * has not truly expired. * * Fortunately we can check determine whether this the case by checking - * whether the global deadline has advanced. It is valid to compare - * cfs_b->runtime_expires without any locks since we only care about - * exact equality, so a partial write will still work. + * whether the global deadline(cfs_b->expires_seq) has advanced. */ - - if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { + if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6601baf2361c..e977e04f8daf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -334,9 +334,10 @@ struct cfs_bandwidth { u64 runtime; s64 hierarchical_quota; u64 runtime_expires; + int expires_seq; - int idle; - int period_active; + short idle; + short period_active; struct hrtimer period_timer; struct hrtimer slack_timer; struct list_head throttled_cfs_rq; @@ -551,6 +552,7 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; + int expires_seq; u64 runtime_expires; s64 runtime_remaining; -- 2.14.1.40.g8e62ba1