Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Subject: Re: CFS flat runqueue proposal fixes/update
To:     Rik van Riel <riel@surriel.com>,
        Peter Zijlstra <peterz@infradead.org>
Cc:     Paul Turner <pjt@google.com>,
        "vincent.guittot" <vincent.guittot@linaro.org>, kernel-team@fb.com,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "dietmar.eggeman" <dietmar.eggeman@arm.com>
References: <1609106d05a6a4a5938233e993548510f599d7d9.camel@surriel.com>
From:   Dietmar Eggemann <dietmar.eggemann@arm.com>
Message-ID: <1eaeeb1d-96c7-4819-8e1e-8f190a650030@arm.com>
Date:   Fri, 7 Aug 2020 16:14:19 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <1609106d05a6a4a5938233e993548510f599d7d9.camel@surriel.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:
> Hello,
> 
> last year at Linux Plumbers conference, I presented on my work
> of turning the hierarchical CFS runqueue into a flat runqueue,
> and Paul Turner pointed out some corner cases that could not
> work with my design as it was last year.
> 
> Paul pointed out two corner cases, and I have come up with a
> third one myself, but I believe they all revolve around the
> same issue, admission control, and can all be solved with the
> same solution.

[...]

>     Possible solution
> 
> A possible solution to the problems above consists of
> three parts:
> - Instead of accounting all used wall clock CPU time into
>   vruntime immediately, at the task's current hierarchical
>   priority, the vruntime can be accounted for piecemeal, for
>   example in amounts corresponding to "one timeslice at full
>   priority".
> - If there is only one runnable task on the runqueue,
>   all the runtime can be accounted into vruntime in one go.
> - Tasks that cannot account all of their used CPU time
>   into vruntime at once can be removed from the root runqueue,
>   and placed into the cgroup runqueue. A heap of cgroup
>   runqueues with "overloaded" tasks can be attached to the
>   main runqueue, where the left-most task from that heap of
>   heaps gets some vruntime accounted every time we go into
>   pick_next_task.
> - The difference between the vruntime of each task in that
>   heap and the vruntime of the root runqueue can help determine
>   how much vruntime can be accounted to that task at once.
> - If the task, or its runqueue, is no longer the left-most
>   in the heap after getting vruntime accounted, that runqueue
>   and the queue of runqueues can be resorted.
> - Once a task has accounted all of its outstanding delta
>   exec runtime into vruntime, it can be moved back to the
>   main runqueue.
> - This should solve the unequal task weight scenario Paul
>   Turner pointed out last year, because after task t1 goes
>   to sleep and only t2 and t3 remain on the CPU, t2 will
>   get its delta exec runtime converted into vruntime at
>   its new priority (equal to t3).
> - By only accounting delta exec runtime to vruntime for
>   the left-most task in the "overloaded" heap at one time,
>   we guarantee only one task at a time will be added back
>   into the root runqueue.
> - Every time a task is added to the root runqueue, that
>   slows down the rate at which vruntime advances, which
>   in turn reduces the rate at which tasks get added back
>   into the runqueue, and makes it more likely that a currently
>   running task with low hierarchical priority gets booted
>   off into the "overloaded" heap.
> 
> To tackle the thundering herd at task wakeup time, another
> strategy may be needed. One thing we may be able to do
> there is place tasks into the "overloaded" heap immediately
> on wakeup, if the hierarchical priority of the task is so
> low that if the task were to run a minimal timeslice length,
> it would be unable to account that time into its vruntime
> in one go, AND the CPU already has a larger number of tasks
> on it.
> 
> Because the common case on most systems is having just
> 0, 1, or 2 runnable tasks on a CPU, this fancy scheme
> should rarely end up being used, and even when it is the
> overhead should be reasonable because most of the
> overloaded tasks will just sit there until pick_next_task
> gets around to them.
> 
> Does this seem like something worth trying out?
> 
> Did I overlook any other corner cases that would make this
> approach unworkable?
> 
> Did I forget to explain anything that is needed to help
> understand the problem and the proposed solution better?

I imagine that I can see what you want to achieve here ;-)

But it's hard since your v5 RFC
https://lkml.kernel.org/r/20190906191237.27006-1-riel@surriel.com is
pretty old by now.

Do you have a version of the patch-set against tip/sched/core? Quite a
lot has changed (runnable load avg replaced by runnable avg, rq->load is
gone, CFS load balance rework).

IIRC, the 'CFS flat runqueue design' has the advantage to reduce the
overhead in taskgroup heavy environments like systemd. And I recall that
v5 doesn't cover CFS bandwidth control yet.

IMHO it would be extremely helpful to have a current patch-set to
discuss how these other problems can be covered by patches on top.