Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp587972imm; Wed, 20 Jun 2018 03:27:15 -0700 (PDT) X-Google-Smtp-Source: ADUXVKLw5SB8/HUeQnrGQsHhqK1RrxZxwuNxDdf3u4GkHxuzoTPSORKt9tZ505aBs1zon3kGv8QY X-Received: by 2002:a63:a809:: with SMTP id o9-v6mr18643844pgf.313.1529490435842; Wed, 20 Jun 2018 03:27:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529490435; cv=none; d=google.com; s=arc-20160816; b=rpgh3pNhYOh8GlmAdfzWaZ03EsOE5fgYJAiXT1CPIquM2lVySDu5SpIzEmnkdtTf5d RgYW1RPUCDHYeeCGe8ixp6qtblWNka9ZLcRcEJbCl3aFLSOgI8ytUzrkdYyvYoNsUp6y Zh0JFAQ2kVcetMuKsvTmNZPhuIees9RbC5+U6zVzvMevBUkuQ7ZJpOBo1RaXhhfwVRjB //w58qouDQsYcYWSfULaBoTBY6mZyK+KhxtkkWGRLDvKjqp5uMGNZ5lCTWpaoXwQocRF LlZbI1nYDYP4z8thyH5owoQaUlmIrhZdJUpBl1aOZQgEC+JmkvTpXxZh1CcjnKs9jed0 XAXw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:references:cc:to:from :subject:reply-to:arc-authentication-results; bh=bqHiIE/Z4u7fjFd3USLe10BLHqKbc92+RyZaNndPUzE=; b=J/LDADQggpsoG6nu55SjOx3VBJAR1d5tLPLWNXyw/IdO7Olq58wfYSPHJ17Lk8rVzr MGR7d9x1npYEXXtnqJ/UjcPmSXF+fE4KX9TUmy8+wfwi7Au04T4/GbWMtwaBG70MnWuj lEyPhPp9gRjimw1/CFM3I3aEOvMzZj18DbiWQW4nfvQBVrt/mrG8thi//t7rMGIVkVG5 gqxkt/AOUkRvQNUyii0xZWuJ6dYc49kVsW8kvnJfIyz1Mhg+3GbJ3GVb8B2FKiz8gO6p Ul3mlKA+O9Mn+fFCQoNdRjsaig+iYHbsEp4STDcL2dclQPmsO6ix6Y/bDJ1k9ueT0WgT 0Ncg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k1-v6si1787082pgc.502.2018.06.20.03.27.01; Wed, 20 Jun 2018 03:27:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753760AbeFTK0Q (ORCPT + 99 others); Wed, 20 Jun 2018 06:26:16 -0400 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:42189 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751553AbeFTK0O (ORCPT ); Wed, 20 Jun 2018 06:26:14 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=xlpang@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0T322De2_1529490370; Received: from xunleideMacBook-Pro.local(mailfrom:xlpang@linux.alibaba.com fp:SMTPD_---0T322De2_1529490370) by smtp.aliyun-inc.com(127.0.0.1); Wed, 20 Jun 2018 18:26:11 +0800 Reply-To: xlpang@linux.alibaba.com Subject: Re: [PATCH 2/2] sched/fair: Advance global expiration when period timer is restarted From: Xunlei Pang To: bsegall@google.com Cc: Peter Zijlstra , Ingo Molnar , linux-kernel@vger.kernel.org References: <20180618091657.21939-1-xlpang@linux.alibaba.com> <5f5243c0-ff8c-fc11-9320-61313fd40214@linux.alibaba.com> <6023891f-894a-f251-1d39-939f13143886@linux.alibaba.com> Message-ID: Date: Wed, 20 Jun 2018 18:26:10 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <6023891f-894a-f251-1d39-939f13143886@linux.alibaba.com> Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/20/18 2:06 PM, Xunlei Pang wrote: > On 6/20/18 1:49 AM, bsegall@google.com wrote: >> Xunlei Pang writes: >> >>> On 6/19/18 2:58 AM, bsegall@google.com wrote: >>>> Xunlei Pang writes: >>>> >>>>> I noticed the group frequently got throttled even it consumed >>>>> low cpu usage, this caused some jitters on the response time >>>>> to some of our business containers enabling cpu quota. >>>>> >>>>> It's very easy to reproduce: >>>>> mkdir /sys/fs/cgroup/cpu/test >>>>> cd /sys/fs/cgroup/cpu/test >>>>> echo 100000 > cpu.cfs_quota_us >>>>> echo $$ > tasks >>>>> then repeat: >>>>> cat cpu.stat |grep nr_throttled // nr_throttled will increase >>>>> >>>>> After some analysis, we found that cfs_rq::runtime_remaining will >>>>> be cleared by expire_cfs_rq_runtime() due to two equal but stale >>>>> "cfs_{b|q}->runtime_expires" after period timer is re-armed. >>>> >>>> If this is after the first patch, then this is no longer what should >>>> happen, and instead it would incorrectly /keep/ old local cfs_rq >>>> runtime, and not __refill global runtime until the period. >>> >>> Exactly, I can improve the changelog. >>> >>>> >>>> >>>>> >>>>> The global expiration should be advanced accordingly when the >>>>> bandwidth period timer is restarted. >>>>> >>>>> Signed-off-by: Xunlei Pang >>>>> --- >>>>> kernel/sched/fair.c | 15 ++++++++++----- >>>>> 1 file changed, 10 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index 9f384264e832..bb006e671609 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) >>>>> >>>>> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) >>>>> { >>>>> + u64 overrun; >>>>> + >>>>> lockdep_assert_held(&cfs_b->lock); >>>>> >>>>> - if (!cfs_b->period_active) { >>>>> - cfs_b->period_active = 1; >>>>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); >>>>> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); >>>>> - } >>>>> + if (cfs_b->period_active) >> >>>>> + return; >>>>> + >>>>> + cfs_b->period_active = 1; >>>>> + overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); >>>>> + cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); >>>> >>>> I think we actually want if (overrun) __refill_cfs_bandwidth_runtime(), >>>> much like tg_set_cfs_bandwidth >>> >>> It's a little different, e.g. periods like below: >>> >>> Pm >>> | >>> v >>> |-----|-----| >>> Pn-1 Pn Pn+1 >>> >>> Assuming we start_cfs_bandwidth() at some moment Pm between "Pn" and >>> "Pn+1", cfs_b->runtime_expires we want here should be "Pn+1", while >>> __refill_cfs_bandwidth_runtime() forces it to be "Pm+period"(hrtimer >>> expires at "Pn+1"). >>> >>> Regards, >>> Xunlei >>> >> >> Hmm, yeah, it would wind up out of sync with the hrtimer unless we >> restarted the period boundaries. >> >> Now that I look at it even more closely, I don't see how you're getting >> problems like that, besides wasting tiny amounts of cpu time in expire: >> we only disable the period timer when there has been a full period with >> no activity, so we should have a full cfs_b->runtime, and a >> runtime_expires from the period that was spent entirely idle (so it >> should always be in the past). But with a working extend-local-deadline >> check like after patch 1 it shouldn't /matter/ what the actual value of >> runtime_expires is, we'll just waste tiny amounts of cpu time >> incrementing runtime_expires every account_cfs_rq_runtime call rather >> than aborting early. >> >> Actually, I think arguably the real issue may be in patch 1: > > Ben, > > Thanks for the comments. > > Yes, the problematic throttle issue is solved by patch 1, and it's a > different issue in patch 2. It wants to address another stale runtime > issue(I will update the changelog in v2), specifically: > An ever-running cfs_rq tends to retain some runtime(min_cfs_rq_runtime), > it could be even more(sysctl_sched_cfs_bandwidth_slice) in the worst > case, these runtime should be discarded once the period is advanced. > > Without this patch expire_cfs_rq_runtime() will see the two equal > cfs_rq{b}->expires_seq after period gets restarted after some idle, > so it will fall into the "clock drift" branch and go on consuming > the old runtime if any. > > With the expiration update done in patch 2, we can ensure the stale > cfs_rq->runtime_remaining gets cleared at expire_cfs_rq_runtime() > once a new period begins. I did some tests, as you said the throttle problem gets resovled by patch 1. And stale cfs_rq->runtime_remaining can also be cleared properly due to different expires_seq, now patch 2 is mainly to avoid needless expire_cfs_rq_runtime() execution due to old expiration time. I've sent v2, please have a look, thanks! -Xunlei > >> >>>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) >>>> * spread between our sched_clock and the one on which runtime was >>>> * issued. >>>> */ >>>> - if ((s64)(expires - cfs_rq->runtime_expires) > 0) >>>> + if ((s64)(expires - cfs_rq->runtime_expires) > 0) { >>>> cfs_rq->runtime_expires = expires; >>>> + cfs_rq->expires_seq = expires_seq; >>>> + } >>>> >>>> return cfs_rq->runtime_remaining > 0; >>>> } >> >> >> We don't want to clobber a tick-extended runtime_expires with the >> original one, but we /do/ want update if expires_seq changes, no matter >> what (rather than skipping updating both if we've tick-extended past the >> /next/ period's runtime_expires). >> >> I think this should actually be >> >> if (cfs_rq->expires_seq != expires_seq) { >> cfs_rq->runtime_expires = expires; >> cfs_rq->expires_seq = expires_seq; >> } >> >> Can you easily check and see if this fixes the issue without this second >> patch? That would clear up how the actual problem is happening, and I >> think it's a change we'd want anyways. > > Agree, I will have it in v2. > >> >> That said, we should also just update runtime_expires like this patch >> does regardless, and as mentioned we should have a full cfs_b->runtime >> so we don't need a __refill like I was originally thinking. >> > Regards, > Xunlei >