Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20221026224449.214839-1-joshdon@google.com> <Y1/HzzA1FIawYM11@hirez.programming.kicks-ass.net>
 <CABk29Nu=XcjwRxnGBtKHfknxnDPpspghou06+W0fufnkGF6NkA@mail.gmail.com> <Y2BDFNpkSawKnE9S@slm.duckdns.org>
In-Reply-To: <Y2BDFNpkSawKnE9S@slm.duckdns.org>
From:   Josh Don <joshdon@google.com>
Date:   Mon, 31 Oct 2022 16:15:54 -0700
Message-ID: <CABk29Nta-RJpTcybgOk9u4DH=1mwQFZsOxFuQ-UpCcTwzjzAuA@mail.gmail.com>
Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth
To:     Tejun Heo <tj@kernel.org>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        linux-kernel@vger.kernel.org,
        Joel Fernandes <joel@joelfernandes.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Oct 31, 2022 at 02:22:42PM -0700, Josh Don wrote:
> > > So, TJ has been complaining about us throttling in kernel-space, causing
> > > grief when we also happen to hold a mutex or some other resource and has
> > > been prodding us to only throttle at the return-to-user boundary.
> >
> > Yea, we've been having similar priority inversion issues. It isn't
> > limited to CFS bandwidth though, such problems are also pretty easy to
> > hit with configurations of shares, cpumasks, and SCHED_IDLE. I've
>
> We need to distinguish between work-conserving and non-work-conserving
> control schemes. Work-conserving ones - such as shares and idle - shouldn't
> affect the aggregate amount of work the system can perform. There may be
> local and temporary priority inversions but they shouldn't affect the
> throughput of the system and the scheduler should be able to make the
> eventual resource distribution conform to the configured targtes.
>
> CPU affinity and bw control are not work conserving and thus cause a
> different class of problems. While it is possible to slow down a system with
> overly restrictive CPU affinities, it's a lot harder to do so severely
> compared to BW control because no matter what you do, there's still at least
> one CPU which can make full forward progress. BW control, it's really easy
> to stall the entire system almost completely because we're giving userspace
> the ability to stall tasks for an arbitrary amount of time at random places
> in the kernel. This is what cgroup1 freezer did which had exactly the same
> problems.

Yes, but schemes such as shares and idle can still end up creating
some severe inversions. For example, a SCHED_IDLE thread on a cpu with
many other threads. Eventually the SCHED_IDLE thread will get run, but
the round robin times can easily get pushes out to several hundred ms
(or even into the seconds range), due to min granularity. cpusets
combined with the load balancer's struggle to find low weight tasks
exacerbates such situations.

> > chatted with the folks working on the proxy execution patch series,
> > and it seems like that could be a better generic solution to these
> > types of issues.
>
> Care to elaborate?

https://lwn.net/Articles/793502/ gives some historical context, see
also https://lwn.net/Articles/910302/.

> > Throttle at return-to-user seems only mildly beneficial, and then only
> > really with preemptive kernels. Still pretty easy to get inversion
> > issues, e.g. a thread holding a kernel mutex wake back up into a
> > hierarchy that is currently throttled, or a thread holding a kernel
> > mutex exists in the hierarchy being throttled but is currently waiting
> > to run.
>
> I don't follow. If you only throttle at predefined safe spots, the easiest
> place being the kernel-user boundary, you cannot get system-wide stalls from
> BW restrictions, which is something the kernel shouldn't allow userspace to
> cause. In your example, a thread holding a kernel mutex waking back up into
> a hierarchy that is currently throttled should keep running in the kernel
> until it encounters such safe throttling point where it would have released
> the kernel mutex and then throttle.

Agree except that for the task waking back up, it isn't on cpu, so
there is no "wait to throttle it until it returns to user", since
throttling happens in the context of the entire cfs_rq. We'd have to
treat threads in a bandwidth hierarchy that are also in kernel mode
specially. Mechanically, it is more straightforward to implement the
mechanism to wait to throttle until the cfs_rq has no more threads in
kernel mode, than it is to exclude a woken task from the currently
throttled period of its cfs_rq, though this is incomplete.

What you're suggesting would also require that we find a way to
preempt the current thread to start running the thread that woke up in
kernel (and this becomes more complex when the current thread is also
in kernel, or if there are n other waiting threads that are also in
kernel).

>
> Thanks.
>
> --
> tejun

Best,
Josh