Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1641772rwd; Thu, 18 May 2023 15:14:43 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7l2/C9aXyMQUtRXFlEyfAnTUwmwBoGAMFSmXePn9w1/EH4cqDMyx20EG7Cd1OoY80oa+J1 X-Received: by 2002:a17:902:f682:b0:1a9:6dfb:4b09 with SMTP id l2-20020a170902f68200b001a96dfb4b09mr443307plg.67.1684448082871; Thu, 18 May 2023 15:14:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684448082; cv=none; d=google.com; s=arc-20160816; b=SoMvNthafecJf+GJOyDj2Tv9tDeyzBqwIgGWNr2HS8yIIQAmsOlhw9kdMKv1nSgRi2 Mh9pkq+cKmlneNEtgrun0hByxLkCjy8RQNG2/Qcj3IxMXMfWk+0T4zZu/gL0bMtVZQnY SvJM+FSD1i5/DD75GHXOofe0uAmBsllkIKKeDN6D/pc7a0kzqhZ19/NLwBqADmPANNh8 ynopORogFzFMxclsPOHM4KS9c4jSdxSZGNvZ37AgenYCamuiSZyNS1LzY/q91XAwZe5F b483w8pi7B0C2ruK6/mt1xCvNVQTZTCM+K0sK2rM5VBl8T3ALMY84InCoRBeT2rcnkhp RtbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=OBIz8BPXt7G5B0IU4THBM/FvMX06qOOWNZyF8nh3G7I=; b=AMfdghSorf7hI91Wbn/E39klUOuEqh2LQrENofFV9MFoqnjGhpvM8Vv5nZ/JZ3TK8a lYY30gmxlDL96ZhBcWMFBlKNrEOmtgtox5fDWCLIiwVG+m2OwuDkrgBvix+hZr9Zobyv yRAb/qR9hyucJdfqNmiDoO/LirBygi0XYsSvOF/gz+ddXhIS13UcwFjDPqt5WEb6S0GC 0kefSZ0DWmPVntazmgMS2UV5R7iVMXNBheyaG3TGCtESTkMAMDaD3y9Xv3aB2Qc3LxAo P1xJQ/7qPAk93LVZpte9x1b+yjpiMuT5CaD0f2IKpEPk6NA++qxUxQOuVYv03iKFl52x WxOQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WpqDnNSc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a11-20020a170902eccb00b001aafaf1a917si2313002plh.472.2023.05.18.15.14.28; Thu, 18 May 2023 15:14:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WpqDnNSc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229957AbjERWCP (ORCPT + 99 others); Thu, 18 May 2023 18:02:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229487AbjERWCO (ORCPT ); Thu, 18 May 2023 18:02:14 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AB5F5E7 for ; Thu, 18 May 2023 15:01:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684447289; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=OBIz8BPXt7G5B0IU4THBM/FvMX06qOOWNZyF8nh3G7I=; b=WpqDnNScReLzMbpeYXSyHF0Uttx7+T9z7ZC6p92ua8z0o5Xkfgr2WnUMWHhBJCoDhxJaAH PoJM5VBCfInubxpv+0dEFoKCjWh0+/I9PpYhwfVABFxXecKIdtJmDV7I0id4iw6E6RPzYS GF/tKyLFvf7HjiB25I9TC9j5VlnYhyM= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-28-ExO7J6DTNfWoDQlKK9za4Q-1; Thu, 18 May 2023 18:01:25 -0400 X-MC-Unique: ExO7J6DTNfWoDQlKK9za4Q-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F0D8B1C05AFD; Thu, 18 May 2023 22:01:24 +0000 (UTC) Received: from lorien.usersys.redhat.com (unknown [10.22.9.172]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7553A2026D16; Thu, 18 May 2023 22:01:24 +0000 (UTC) Date: Thu, 18 May 2023 18:01:22 -0400 From: Phil Auld To: Benjamin Segall Cc: linux-kernel@vger.kernel.org, Juri Lelli , Ingo Molnar , Daniel Bristot de Oliveira , Peter Zijlstra , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Mel Gorman Subject: Re: [PATCH RESEND] sched/nohz: Add HRTICK_BW for using cfs bandwidth with nohz_full Message-ID: <20230518220122.GB120739@lorien.usersys.redhat.com> References: <20230518132038.3534728-1-pauld@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 18, 2023 at 02:29:04PM -0700 Benjamin Segall wrote: > Phil Auld writes: > > > CFS bandwidth limits and NOHZ full don't play well together. Tasks > > can easily run well past their quotas before a remote tick does > > accounting. This leads to long, multi-period stalls before such > > tasks can run again. Use the hrtick mechanism to set a sched > > tick to fire at remaining_runtime in the future if we are on > > a nohz full cpu, if the task has quota and if we are likely to > > disable the tick (nr_running == 1). This allows for bandwidth > > accounting before tasks go too far over quota. > > > > A number of container workloads use a dynamic number of real > > nohz tasks but also have other work that is limited which ends > > up running on the "spare" nohz cpus. This is an artifact of > > having to specify nohz_full cpus at boot. Adding this hrtick > > resolves the issue of long stalls on these tasks. > > > > Add the sched_feat HRTICK_BW off by default to allow users to > > enable this only when needed. > > > > Signed-off-by: Phil Auld > > Suggested-by: Juri Lelli > > Cc: Ingo Molnar > > Cc: Peter Zijlstra > > Cc: Vincent Guittot > > Cc: Juri Lelli > > Cc: Dietmar Eggemann > > Cc: Valentin Schneider > > Cc: Ben Segall > > --- > > > > Resend with LKML address instead of rh list... > > > > kernel/sched/core.c | 2 +- > > kernel/sched/fair.c | 20 ++++++++++++++++++++ > > kernel/sched/features.h | 1 + > > kernel/sched/sched.h | 12 ++++++++++++ > > 4 files changed, 34 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index a68d1276bab0..76425c377245 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -6562,7 +6562,7 @@ static void __sched notrace __schedule(unsigned int sched_mode) > > > > schedule_debug(prev, !!sched_mode); > > > > - if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) > > + if (sched_feat(HRTICK) || sched_feat(HRTICK_DL) || sched_feat(HRTICK_BW)) > > hrtick_clear(rq); > > > > local_irq_disable(); > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 373ff5f55884..0dd1f6a874bc 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5309,6 +5309,22 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > return ret; > > } > > > > +#if defined(CONFIG_SCHED_HRTICK) && defined(CONFIG_NO_HZ_FULL) > > +static void start_hrtick_cfs_bw(struct rq *rq, struct cfs_rq *cfs_rq) > > +{ > > + if (!tick_nohz_full_cpu(rq->cpu) || !cfs_bandwidth_used() || !cfs_rq->runtime_enabled) > > + return; > > + > > + /* runtime_remaining should never be negative here but just in case */ > > + if (rq->nr_running == 1 && cfs_rq->runtime_remaining > 0) > > + hrtick_start(rq, cfs_rq->runtime_remaining); > > +} > > +#else > > +static void start_hrtick_cfs_bw(struct rq *rq, struct cfs_rq *cfs_rq) > > +{ > > +} > > +#endif > > + > > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > > { > > /* dock delta_exec before expiring quota (as it could span periods) */ > > @@ -5481,6 +5497,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) > > */ > > cfs_rq->throttled = 1; > > cfs_rq->throttled_clock = rq_clock(rq); > > + > > return true; > > } > > > > @@ -8096,6 +8113,9 @@ done: __maybe_unused; > > if (hrtick_enabled_fair(rq)) > > hrtick_start_fair(rq, p); > > > > + if (hrtick_enabled_bw(rq)) > > + start_hrtick_cfs_bw(rq, task_cfs_rq(p)); > > + > > update_misfit_status(p, rq); > > Implementation-wise this winds up with a tick of > sysctl_sched_cfs_bandwidth_slice, which I suppose the admin could _also_ > configure depending on how much NOHZ benefit vs cfsb issues they want. > If the task is using all its bw then yes. Really though, by specifying cfsb they are saying the nohz part is not that important it seems to me. Short of doing the remote tick way more often you can't really have both. The default setup favors nohz and lets the tasks run until a remote tick gets them and then they are some ~5-10 periods in the hole. I'm open to better solutions. > It's not great that this implementation winds up going all the way > through schedule() for each 5ms-default tick, though. > Yeah. We initially had the start_hrtick in __account_cfs_runtime but that was too late. It doesn't catch when we put the task on the cpu initially. Now that you mention this I do think we need that as well so that we get a new timer if the task gets more bandwidth and can stay on cpu. Not stopping the tick looks better and better... Thanks for taking a look. Cheers, Phil --