Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <Y1/HzzA1FIawYM11@hirez.programming.kicks-ass.net>
 <CABk29Nu=XcjwRxnGBtKHfknxnDPpspghou06+W0fufnkGF6NkA@mail.gmail.com>
 <Y2BDFNpkSawKnE9S@slm.duckdns.org> <CABk29Nta-RJpTcybgOk9u4DH=1mwQFZsOxFuQ-UpCcTwzjzAuA@mail.gmail.com>
 <Y2Bf+CeQ8x2jKQ3S@slm.duckdns.org> <CABk29Nvqv-T1JuAq2cf9=AwRu=y1+YOR4xS2qnVo6+XpWd2UNQ@mail.gmail.com>
 <Y2B6V1PPuCcTXGp6@slm.duckdns.org> <CABk29Ns1VWEVRYENud4CW3JQPrcr79i_F2PBTANqt3t-LaYCfQ@mail.gmail.com>
 <Y2FwVX42LIKXSTz3@slm.duckdns.org> <CABk29Nua8ZsDfhY+x+VfYDkbkjfXLXTZ5JMVR9uiBygraxDM+g@mail.gmail.com>
 <Y2GUg8CiI68ZBznr@slm.duckdns.org>
In-Reply-To: <Y2GUg8CiI68ZBznr@slm.duckdns.org>
From:   Josh Don <joshdon@google.com>
Date:   Tue, 1 Nov 2022 14:59:56 -0700
Message-ID: <CABk29Nvj8nRyD0HGo+gZ4CEr0kOJSsUbJnSNFs62D66EDTMGog@mail.gmail.com>
Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth
To:     Tejun Heo <tj@kernel.org>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        linux-kernel@vger.kernel.org,
        Joel Fernandes <joel@joelfernandes.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Tue, Nov 1, 2022 at 2:50 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote:
> > Maybe walking through an example would be helpful? I don't know if
> > there's anything super specific. For cgroup_mutex for example, the
> > same global mutex is being taken for things like cgroup mkdir and
> > cgroup proc attach, regardless of which part of the hierarchy is being
> > modified. So, we end up sharing that mutex between random job threads
> > (ie. that may be manipulating their own cgroup sub-hierarchy), and
> > control plane threads, which are attempting to manage root-level
> > cgroups. Bad things happen when the cgroup_mutex (or similar) is held
> > by a random thread which blocks and is of low scheduling priority,
> > since when it wakes back up it may take quite a while for it to run
> > again (whether that low priority be due to CFS bandwidth, sched_idle,
> > or even just O(hundreds) of threads on a cpu). Starving out the
> > control plane causes us significant issues, since that affects machine
> > health. cgroup manipulation is not a hot path operation, but the
> > control plane tends to hit it fairly often, and so those things
> > combine at our scale to produce this rare problem.
>
> I keep asking because I'm curious about the specific details of the
> contentions. Control plane locking up is obviously bad but they can usually
> tolerate some latencies - stalling out multiple seconds (or longer) can be
> catastrophic but tens or hundreds or millisecs occasionally usually isn't.
>
> The only times we've seen latency spikes from CPU side which is enough to
> cause system-level failures were when there were severe restrictions through
> bw control. Other cases sure are possible but unless you grab these mutexes
> while IDLE inside a heavily contended cgroup (which is a bit silly) you
> gotta push *really* hard.
>
> If most of the problems were with cpu bw control, fixing that should do for
> the time being. Otherwise, we'll have to think about finishing kernfs
> locking granularity improvements and doing something similar to cgroup
> locking too.

Oh we've easily hit stalls measured in multiple seconds. We
extensively use cpu.idle to group batch tasks. One of the memory
bandwidth mitigations implemented in userspace is cpu jailing, which
can end up pushing lots and lots of these batch threads onto a small
number of cpus. 5ms min gran * 200 threads is already one second :)
We're in the process of transitioning to using bw instead for this
instead in order to maintain parallelism. Fixing bw is definitely
going to be useful, but I'm afraid we'll still likely have some issues
from low throughput for non-bw reasons (some of which we can't
directly control, since arbitrary jobs can spin up and configure their
hierarchy/threads in antagonistic ways, in effect pushing out the
latency of some of their threads).

>
> Thanks.
>
> --
> tejun