Received: by 2002:a05:6358:111d:b0:dc:6189:e246 with SMTP id f29csp3043734rwi; Tue, 1 Nov 2022 15:15:39 -0700 (PDT) X-Google-Smtp-Source: AMsMyM61gUSFrKSJcaEWScEDxEV4MuUP7X6ma2aly2tjBzISUDZzrOMlOH2Vwx4aA1Mdsyz8XPTr X-Received: by 2002:a17:90b:264e:b0:212:d06f:35ad with SMTP id pa14-20020a17090b264e00b00212d06f35admr22906262pjb.2.1667340939689; Tue, 01 Nov 2022 15:15:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667340939; cv=none; d=google.com; s=arc-20160816; b=ZrFmw9kXDtZjZPQSIMdtVqsH11/AIKLQfeSdCw6L+UNOwhjuXScYN4HweCD8OlKIhA KBfRFt/JX2rDWeR29cBya7YgiToe13+T9+xi6liOQb5mOuocnUeOu7KQK/Go7l5ETch3 /hhba4u+waGiZ0axqy601LJp1PBBd2fLMxMRffQoPSEmt72rq5b/87iXKX1KMxAftBVx pXz5c1mJ/zCIVSDDeFcX6upF+dnX5wXdw5+2vQP34xYECKqDwLwdSqCEfyUS47TauTT5 LXq7ICzGZzQwiFKFVlaoNu3Bwgv6QUTjodW1pcBzrAl6hLQrVLZRucDT5a1IMadSYZ1Z 72Aw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=gL2t03t+uHvknGRuDP57y8VggTih0Gz+SNceIWbvmBw=; b=JVx+SNu2fYrMQaQzmjkY0l+OySdeFzTyuHNe1m1NFdm9kdoB0PhAoNina32t+Hvtbc qNm7wCB2+ZBI/u2PIQbcn6NWu/f/JReXM6d+Eg65GtiAM5ml2I2w6PYUD9md6fPkBzdp nCH3ZpRgoXzf3aHH9UQF82hAEPcABfjP9VTqQ3v+KITioLvmfMoEevM9H0e07EvjuLen JtMJrOBEmWBC5RZ1g1vFRd7D8icyu1CKXEmvTeMtTzq0o5kg4JCLSDWH6miHcXAbrlT/ CA9+9qgFDFBaBaRVW85y2P2RU0vJWpv6kclrtP7g8e6TuueV4Dqcm9imFNAdn0JSTCBx RfXg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NPqqY39k; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e68-20020a636947000000b0046ebcc1c65bsi13571194pgc.739.2022.11.01.15.15.26; Tue, 01 Nov 2022 15:15:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=NPqqY39k; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230296AbiKAWBa (ORCPT + 96 others); Tue, 1 Nov 2022 18:01:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39632 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230463AbiKAWAl (ORCPT ); Tue, 1 Nov 2022 18:00:41 -0400 Received: from mail-lj1-x22b.google.com (mail-lj1-x22b.google.com [IPv6:2a00:1450:4864:20::22b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 172E463CA for ; Tue, 1 Nov 2022 15:00:11 -0700 (PDT) Received: by mail-lj1-x22b.google.com with SMTP id u11so22858544ljk.6 for ; Tue, 01 Nov 2022 15:00:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=gL2t03t+uHvknGRuDP57y8VggTih0Gz+SNceIWbvmBw=; b=NPqqY39koC9Bv3K4wli6wHCs00eOMfUTOE2VUdrtR9lDmkwJSNG9lXx/gp2s22NFU9 0q7SGdFOxk5h3ntucLCGpRaxvu95y8yhJZBstTAguGTR1O4KCH3pxO1FENMuddy5O76S khLitA+akMbINZvT2Kml1V6C7tyQIDnOd4Zk8ivyLLUNKEiGjO0M2q0GfQeNTCMAcsR9 FwPo0Abi+lgvIq+YoUejgqfYN8KGN/nLJk1cOmkZ0mA+iYrF0zoJBsljuDhvVZLtxh6w KD8vWnsWs7RtH974Y1lyLDztP+GT5s0rstoR7WjyTd+mKrDcPnNF1GRbb64RtmXzkahb cp9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=gL2t03t+uHvknGRuDP57y8VggTih0Gz+SNceIWbvmBw=; b=2d52bJoxqKJn9eFxxx5nlaJPjfPRWOJw6jq0yg7JnCkEiPiLERgG3/MNSSmBrVH37H x++FYwrGJiuHwcrtahUR28n6QTsW7iznBknnPkF8/9j1C3nOwML5RAmA/L34LbYVgY60 tSi+kpVfJ9b3XVJVH3WtlYVrohfL9cPAinlBfW/Tte4ea1oXyKyfMrGMXlVpSpAkuc1u Z/x/15FXRw8C7LaY1D/bPSx6WHRVjXzqRyJNhd0Qsz+cOT0cjD990nWXBPjLfQxXxkeE 6+J/O1iNNMsQpBZwaUdZBJOchs/Wv256LOz8IsTJGad6mnTN1+bijNMZEzEfs9B2WAiP 1Tpg== X-Gm-Message-State: ACrzQf16mz9zVCj+MGrDTL7qWEJitMKYknu3LuB1JfS4saQraB3cEVPT PDjAp8zwkwrQr0He7aHIZu9o3jMY5x4Xtyqvu8w90w== X-Received: by 2002:a05:651c:1073:b0:277:35ca:5eef with SMTP id y19-20020a05651c107300b0027735ca5eefmr7866493ljm.150.1667340009166; Tue, 01 Nov 2022 15:00:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Josh Don Date: Tue, 1 Nov 2022 14:59:56 -0700 Message-ID: Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth To: Tejun Heo Cc: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , linux-kernel@vger.kernel.org, Joel Fernandes Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 1, 2022 at 2:50 PM Tejun Heo wrote: > > Hello, > > On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote: > > Maybe walking through an example would be helpful? I don't know if > > there's anything super specific. For cgroup_mutex for example, the > > same global mutex is being taken for things like cgroup mkdir and > > cgroup proc attach, regardless of which part of the hierarchy is being > > modified. So, we end up sharing that mutex between random job threads > > (ie. that may be manipulating their own cgroup sub-hierarchy), and > > control plane threads, which are attempting to manage root-level > > cgroups. Bad things happen when the cgroup_mutex (or similar) is held > > by a random thread which blocks and is of low scheduling priority, > > since when it wakes back up it may take quite a while for it to run > > again (whether that low priority be due to CFS bandwidth, sched_idle, > > or even just O(hundreds) of threads on a cpu). Starving out the > > control plane causes us significant issues, since that affects machine > > health. cgroup manipulation is not a hot path operation, but the > > control plane tends to hit it fairly often, and so those things > > combine at our scale to produce this rare problem. > > I keep asking because I'm curious about the specific details of the > contentions. Control plane locking up is obviously bad but they can usually > tolerate some latencies - stalling out multiple seconds (or longer) can be > catastrophic but tens or hundreds or millisecs occasionally usually isn't. > > The only times we've seen latency spikes from CPU side which is enough to > cause system-level failures were when there were severe restrictions through > bw control. Other cases sure are possible but unless you grab these mutexes > while IDLE inside a heavily contended cgroup (which is a bit silly) you > gotta push *really* hard. > > If most of the problems were with cpu bw control, fixing that should do for > the time being. Otherwise, we'll have to think about finishing kernfs > locking granularity improvements and doing something similar to cgroup > locking too. Oh we've easily hit stalls measured in multiple seconds. We extensively use cpu.idle to group batch tasks. One of the memory bandwidth mitigations implemented in userspace is cpu jailing, which can end up pushing lots and lots of these batch threads onto a small number of cpus. 5ms min gran * 200 threads is already one second :) We're in the process of transitioning to using bw instead for this instead in order to maintain parallelism. Fixing bw is definitely going to be useful, but I'm afraid we'll still likely have some issues from low throughput for non-bw reasons (some of which we can't directly control, since arbitrary jobs can spin up and configure their hierarchy/threads in antagonistic ways, in effect pushing out the latency of some of their threads). > > Thanks. > > -- > tejun