Received: by 2002:a05:622a:1442:b0:3a5:28ea:c4b9 with SMTP id v2csp946053qtx; Mon, 31 Oct 2022 18:33:54 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7CooUHy4mIFQhtUeFn5BTvDH0orI9LVuqYhXibylHFf/9BqZDqsQHCagPj4D08hwSAYF31 X-Received: by 2002:a17:906:9f1b:b0:7ad:c66e:ad7d with SMTP id fy27-20020a1709069f1b00b007adc66ead7dmr10039653ejc.432.1667266433863; Mon, 31 Oct 2022 18:33:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667266433; cv=none; d=google.com; s=arc-20160816; b=e+uZhp4Sm5lWp5FDyPxWAA91w8johCh8C1gdCLgWG4w41LnJjttb6cPV5xj0e10Di0 ACQczA40YDrOwOjH94NQ8SZr1EJh+3J2GgBPu0/p3SQ7Q0V2wKn2oGUZVcy5sAD92g/2 9VaWEblJv3y60T16zorqzF41XaqY/R/3WSWcD/4PrewIh/q4vK52wdPGdrrcsJwijrr/ FyNiwaZzHU+fTHTHK1ZwOmrfBPNagFai3WIHIxUkdt+Fs6I7tBUkr2MMbDdFtAuAVvhy RtoiwUw3a9iMyl/sGoNk9F2oO5+LUmCAsYa3g2u+Z1wIITEBEp7XWXGFmr0wlVgfkHge 3dkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Lxu/7fH4qnqZaMZrs8KHKBvUGF8oky4q+GIQntFw9/A=; b=z4l9T97OcVMoz+9wmY/Bnzy4o2B7xQGwb34MP0tm2ihB5/Ac51aWOYRTZi0ootvjih j+BFpjcGFo58EISppJ7mXMngUU8dQ8zlmO8oiW8F6BbAf9DBiZsAgfWTCTlzsFRMWy2V P1Y6eOwBkyArsdvuNIAz5dvy9eQBhuW3uDH1Gj55hwR4r7Gdw7i/JbRa8js95+yMtfR/ 4ig1xT6TASDgMVx37qg1NcX1Hzit7t/zTS8vjtyEW9muri4D3u5ZeM6Js+QbIFQ85Zwj qiH8Z25mgOvkU5luiN5os+Vo2OATa5RVAxouC9CD0yA3AcJWW4wx9zE8tDZcH4CJkq/E 8PeQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="QA6GV/Cu"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qf24-20020a1709077f1800b00797e151e584si9280534ejc.605.2022.10.31.18.33.30; Mon, 31 Oct 2022 18:33:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="QA6GV/Cu"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229670AbiKABBi (ORCPT + 97 others); Mon, 31 Oct 2022 21:01:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229553AbiKABBf (ORCPT ); Mon, 31 Oct 2022 21:01:35 -0400 Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4C92F1570A for ; Mon, 31 Oct 2022 18:01:34 -0700 (PDT) Received: by mail-lf1-x132.google.com with SMTP id d6so21701933lfs.10 for ; Mon, 31 Oct 2022 18:01:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Lxu/7fH4qnqZaMZrs8KHKBvUGF8oky4q+GIQntFw9/A=; b=QA6GV/CuWstIBhWhwRbt6ufAHldWFtyTey2Ww2+WXlE00lq1rDankUIEr7+UoKgUBM gHBiFRtfHWEDUnYKcQTYTfKvceEYrSstfqQNhyE5RGjlICXQpiCR95192f1Q5yTkKxJ0 shesLqNR4DmVaNKYDneBkJIC98ZkBLd94BqGrxR+8u5F2FNV0jOElITi1/vj2VrF+iqs tn3FYAQoio6zY2qpTAP/wRpJyskSLr1f4Cr/is8MEQWEJnqQC2HGNGC2xSsBmqxrVP7z AsE+MGNeH8xQNcs7ISQ4u6rYp+xjy4qFfjfxwNPKH9Yl+KUhhYJACBp8spqkMrAtkmWz Q8Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Lxu/7fH4qnqZaMZrs8KHKBvUGF8oky4q+GIQntFw9/A=; b=7Q62tMFf58cOhk8bhkHDhSwIvBrXqx4bEzyifbAPbYBYVaam4/L5I2FuDvtNQRC5/v c4xmUMxi+N3CraNO4bg63UTmBc8VuDc184D5TBpFFGgzoLzqpI2RgDKAy5qzL97Lx/ly 9RYOtL0DC8hgrl3GrE7pDACSDkCEWSoiNEIyuqj6/kcr9A2RB+TxWcJn0OTy/sbV9ccA uUhh+Q/b4+idopsoqgUAIkp35BbC66RymKSUDGaVY+ue6R825Z074xfZpJk6GAbVGZDd Zj8cJNx1k5qyVXbYGFA2Qtk6NIdFjaJuD6CmGDqV1elNBNhbsJZtOHDLWeIFOIEIswLu DW6A== X-Gm-Message-State: ACrzQf2dUIICMZF9kswv2DQCuRZ9b+LYxleheTDFDLCF1gJ8rU4XQ/W5 s3SvtY6DiUYzuiFv85dgcPzdLzIL4wGDuD4nKemsFQ== X-Received: by 2002:a05:6512:2356:b0:4a2:693b:2bc3 with SMTP id p22-20020a056512235600b004a2693b2bc3mr6329022lfu.545.1667264492450; Mon, 31 Oct 2022 18:01:32 -0700 (PDT) MIME-Version: 1.0 References: <20221026224449.214839-1-joshdon@google.com> In-Reply-To: From: Josh Don Date: Mon, 31 Oct 2022 18:01:19 -0700 Message-ID: Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth To: Tejun Heo Cc: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , linux-kernel@vger.kernel.org, Joel Fernandes Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 31, 2022 at 4:53 PM Tejun Heo wrote: > > Hello, > > On Mon, Oct 31, 2022 at 04:15:54PM -0700, Josh Don wrote: > > On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo wrote: > > Yes, but schemes such as shares and idle can still end up creating > > some severe inversions. For example, a SCHED_IDLE thread on a cpu with > > many other threads. Eventually the SCHED_IDLE thread will get run, but > > the round robin times can easily get pushes out to several hundred ms > > (or even into the seconds range), due to min granularity. cpusets > > combined with the load balancer's struggle to find low weight tasks > > exacerbates such situations. > > Yeah, especially with narrow cpuset (or task cpu affinity) configurations, > it can get pretty bad. Outside that tho, at least I haven't seen a lot of > problematic cases as long as the low priority one isn't tightly entangled > with high priority tasks, mostly because 1. if the resource the low pri one > is holding affects large part of the system, the problem is self-solving as > the system quickly runs out of other things to do 2. if the resource isn't > affecting large part of the system, their blast radius is usually reasonably > confined to things tightly coupled with it. I'm sure there are exceptions > and we definitely wanna improve the situation where it makes sense. cgroup_mutex and kernfs rwsem beg to differ :) These are shared with control plane threads, so it is pretty easy to starve those out even while the system has plenty of work to do. > > > > chatted with the folks working on the proxy execution patch series, > > > > and it seems like that could be a better generic solution to these > > > > types of issues. > > > > > > Care to elaborate? > > > > https://lwn.net/Articles/793502/ gives some historical context, see > > also https://lwn.net/Articles/910302/. > > Ah, full blown priority inheritance. They're great to pursue but I think we > wanna fix cpu bw control regardless. It's such an obvious and basic issue > and given how much problem we have with actually understanding resource and > control dependencies with all the custom synchronization contstructs in the > kernel, fixing it will be useful even in the future where we have a better > priority inheritance mechanism. Sure, even something like not throttling when there exist threads in kernel mode (while not a complete solution), helps get some of the way towards improving that case. > > > I don't follow. If you only throttle at predefined safe spots, the easiest > > > place being the kernel-user boundary, you cannot get system-wide stalls from > > > BW restrictions, which is something the kernel shouldn't allow userspace to > > > cause. In your example, a thread holding a kernel mutex waking back up into > > > a hierarchy that is currently throttled should keep running in the kernel > > > until it encounters such safe throttling point where it would have released > > > the kernel mutex and then throttle. > > > > Agree except that for the task waking back up, it isn't on cpu, so > > there is no "wait to throttle it until it returns to user", since > > throttling happens in the context of the entire cfs_rq. We'd have to > > Oh yeah, we'd have to be able to allow threads running in kernel regardless > of cfq_rq throttled state and then force charge the cpu cycles to be paid > later. It would definitely require quite a bit of work. > > > treat threads in a bandwidth hierarchy that are also in kernel mode > > specially. Mechanically, it is more straightforward to implement the > > mechanism to wait to throttle until the cfs_rq has no more threads in > > kernel mode, than it is to exclude a woken task from the currently > > throttled period of its cfs_rq, though this is incomplete. > > My hunch is that bunching them together is likely gonna create too many > escape scenarios and control artifacts and it'd be better to always push > throttling decisions to the leaves (tasks) so that each task can be > controlled separately. That'd involve architectural changes but the eventual > behavior would be a lot better. Also a tradeoff, since it is extra overhead to handle individually at the leaf level vs dequeuing a single cfs_rq. > > What you're suggesting would also require that we find a way to > > preempt the current thread to start running the thread that woke up in > > kernel (and this becomes more complex when the current thread is also > > in kernel, or if there are n other waiting threads that are also in > > kernel). > > I don't think it needs that. What allows userspace to easily trigger > pathological scenarios is the ability to force the machine idle when there's > something which is ready to run in the kernel. If you take that away, most > of the problems disappear. It's not perfect but reasonable enough and not > worse than a system without cpu bw control. > > Thanks. > > -- > tejun