2020-07-31 07:43:00

by Rik van Riel

[permalink] [raw]
Subject: CFS flat runqueue proposal fixes/update

Hello,

last year at Linux Plumbers conference, I presented on my work
of turning the hierarchical CFS runqueue into a flat runqueue,
and Paul Turner pointed out some corner cases that could not
work with my design as it was last year.

Paul pointed out two corner cases, and I have come up with a
third one myself, but I believe they all revolve around the
same issue, admission control, and can all be solved with the
same solution.

This is a fairly complex thing, so this email is long and
in 3 parts:
- hierarchical runqueue problem statement
- quick design overview of the flat runqueue for CFS
- overview of the problems pointed out by Paul Turner
- description of a possible solution


hierarchical runqueue problem statement

Currently CFS with the cgroups CPU controller uses a
hierarchical design, which means that a task is enqueued
on its cgroup's runqueu, that cgroup is enqueued on its
parent's runqueue, etc all the way up to the root runqueue.
This also means that every time a task in a cgroup wakes
up or goes to sleep, the common case (1 task on the CPU)
is that the entire hierarchy is enqueued or dequeued,
resulting in significant overhead for workloads with lots
of context switches.

For example, the hierarchy in a system with equal priority
cgroups A and B, with cgroup A subdivided in two unequal
priority cgroups A1 and B2, and a task in each leaf cgroup
would look like this:

/\
/ \
A B
/ \ \
A1 A2 t3
/ \
t1 t2

The common case on most systems is that CPUs are idle
some of the time, and CPUs usually have 0, 1, or 2
running tasks at a time.

That means when task t1 wakes up when the CPU is idle,
t1 first gets enqueued on A1, which then gets enqueued
on A2, which then gets enqueued on the root runqueue.
When t1 goes back to sleep, the whole thing is torn back
down.

In case all three tasks are runnable at the same time,
the scheduler will first choose between A and B in the
root runqueue, and, in case it chose A, between A1 and A2
the next level down.


CFS flat runqueue design

The code I presented last year operates on the principle
of separating determining hierarchical priority of a task,
which is done periodically, from the runqueues, and using
the hierarchical priority to scale the vruntime of a task
like is done for nice levels.

The hierarchical priority of a tasks is the fraction of
the weight at each level in the runqueue hierarchy of the
path to that task. In the example above, with equal priority
levels for cgroups A, B, A1, and A2, and a total weight 1000
for the root runqueue, tasks t1, t2, and t3 will have hierarchical
weights of 250, 250, and 500 respectively.

CFS tracks both wall clock run time and vruntime for each
task, where vruntime is the wall clock run time scaled
according to the task's priority. The vruntime of every
entity on a runqueue advances at the same rate.

The vruntime is calculated as follows:

vruntime = exec_time * FIXED_POINT_ONE / priority

That is, if the priority of the task is equal to the
fixed point constant used, the vruntime corresponds
to the executed wall clock time, while a lower priority
task has its vruntime advance at a faster rate, and
a higher priority task has its vruntime advance slower.

With every entity on a runqueue having its vruntime
advance at the same rate, that translates into higher
priority tasks receiving more CPU time, and lower
priority tasks receiving less CPU time.


Problems identified by Paul Turner

Lets revisit the hierarchy from above, and assign priorities
to the cgroups, with the fixed point one being 1000. Lets
say cgroups A, A1, and B have priority 1000, while cgroup
A2 has priority 1.

/\
/ \
A B
/ \ \
A1 A2 t3
/ \
t1 t2

One consequence of this is that when t1, t2, and t3 each
get a time slice, the vruntime of tasks t1 and t3 advances
at roughly the same speed as the clock time, while the
vruntime of task t2 advances 1000x faster.

This is fine if all three tasks continue to be runnable,
since t1, t2 and t3 each get their fair share of CPU time.

However, if t1 goes to sleep, t2 is the only thing running
inside cgroup A, which has the same priority as cgroup B,
and tasks t2 and t3 should be getting the same amount of
CPU time.

They eventually will, but not before task t3 has used up
enough CPU time to catch up with the enormous vruntime
advance that t2 just suffered.

That needs to be fixed, to get near-immediate convergence,
and not convergence after some unknown (potentially long)
period of time.


There are similar issues with thundering herds and cgroup
cpu.max throttling, where a large number of tasks can be
woken up simultaneously.

/\
/ \
A B
/ \
t1 t2 t3 t4 t5 ...
t42

In the current, hierarchical runqueue setup, CFS will
time slice between cgroups A and B at the top level,
which means task t1 will never go long without being
scheduled. Only the tasks in cgroup B suffer the
latency effects of cgroup B being overloaded with way
too many runnable tasks.

A similar issue can happen when a cgroup is throttled
with cpu.max, and wakes up with a large number of
runnable tasks inside.

It is desirable that fairness, both from a latency and
a throughput point of view, converge quickly rather
than slowly.


Possible solution

A possible solution to the problems above consists of
three parts:
- Instead of accounting all used wall clock CPU time into
vruntime immediately, at the task's current hierarchical
priority, the vruntime can be accounted for piecemeal, for
example in amounts corresponding to "one timeslice at full
priority".
- If there is only one runnable task on the runqueue,
all the runtime can be accounted into vruntime in one go.
- Tasks that cannot account all of their used CPU time
into vruntime at once can be removed from the root runqueue,
and placed into the cgroup runqueue. A heap of cgroup
runqueues with "overloaded" tasks can be attached to the
main runqueue, where the left-most task from that heap of
heaps gets some vruntime accounted every time we go into
pick_next_task.
- The difference between the vruntime of each task in that
heap and the vruntime of the root runqueue can help determine
how much vruntime can be accounted to that task at once.
- If the task, or its runqueue, is no longer the left-most
in the heap after getting vruntime accounted, that runqueue
and the queue of runqueues can be resorted.
- Once a task has accounted all of its outstanding delta
exec runtime into vruntime, it can be moved back to the
main runqueue.
- This should solve the unequal task weight scenario Paul
Turner pointed out last year, because after task t1 goes
to sleep and only t2 and t3 remain on the CPU, t2 will
get its delta exec runtime converted into vruntime at
its new priority (equal to t3).
- By only accounting delta exec runtime to vruntime for
the left-most task in the "overloaded" heap at one time,
we guarantee only one task at a time will be added back
into the root runqueue.
- Every time a task is added to the root runqueue, that
slows down the rate at which vruntime advances, which
in turn reduces the rate at which tasks get added back
into the runqueue, and makes it more likely that a currently
running task with low hierarchical priority gets booted
off into the "overloaded" heap.

To tackle the thundering herd at task wakeup time, another
strategy may be needed. One thing we may be able to do
there is place tasks into the "overloaded" heap immediately
on wakeup, if the hierarchical priority of the task is so
low that if the task were to run a minimal timeslice length,
it would be unable to account that time into its vruntime
in one go, AND the CPU already has a larger number of tasks
on it.

Because the common case on most systems is having just
0, 1, or 2 runnable tasks on a CPU, this fancy scheme
should rarely end up being used, and even when it is the
overhead should be reasonable because most of the
overloaded tasks will just sit there until pick_next_task
gets around to them.

Does this seem like something worth trying out?

Did I overlook any other corner cases that would make this
approach unworkable?

Did I forget to explain anything that is needed to help
understand the problem and the proposed solution better?

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-08-07 14:15:52

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: CFS flat runqueue proposal fixes/update

Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:
> Hello,
>
> last year at Linux Plumbers conference, I presented on my work
> of turning the hierarchical CFS runqueue into a flat runqueue,
> and Paul Turner pointed out some corner cases that could not
> work with my design as it was last year.
>
> Paul pointed out two corner cases, and I have come up with a
> third one myself, but I believe they all revolve around the
> same issue, admission control, and can all be solved with the
> same solution.

[...]

> Possible solution
>
> A possible solution to the problems above consists of
> three parts:
> - Instead of accounting all used wall clock CPU time into
> vruntime immediately, at the task's current hierarchical
> priority, the vruntime can be accounted for piecemeal, for
> example in amounts corresponding to "one timeslice at full
> priority".
> - If there is only one runnable task on the runqueue,
> all the runtime can be accounted into vruntime in one go.
> - Tasks that cannot account all of their used CPU time
> into vruntime at once can be removed from the root runqueue,
> and placed into the cgroup runqueue. A heap of cgroup
> runqueues with "overloaded" tasks can be attached to the
> main runqueue, where the left-most task from that heap of
> heaps gets some vruntime accounted every time we go into
> pick_next_task.
> - The difference between the vruntime of each task in that
> heap and the vruntime of the root runqueue can help determine
> how much vruntime can be accounted to that task at once.
> - If the task, or its runqueue, is no longer the left-most
> in the heap after getting vruntime accounted, that runqueue
> and the queue of runqueues can be resorted.
> - Once a task has accounted all of its outstanding delta
> exec runtime into vruntime, it can be moved back to the
> main runqueue.
> - This should solve the unequal task weight scenario Paul
> Turner pointed out last year, because after task t1 goes
> to sleep and only t2 and t3 remain on the CPU, t2 will
> get its delta exec runtime converted into vruntime at
> its new priority (equal to t3).
> - By only accounting delta exec runtime to vruntime for
> the left-most task in the "overloaded" heap at one time,
> we guarantee only one task at a time will be added back
> into the root runqueue.
> - Every time a task is added to the root runqueue, that
> slows down the rate at which vruntime advances, which
> in turn reduces the rate at which tasks get added back
> into the runqueue, and makes it more likely that a currently
> running task with low hierarchical priority gets booted
> off into the "overloaded" heap.
>
> To tackle the thundering herd at task wakeup time, another
> strategy may be needed. One thing we may be able to do
> there is place tasks into the "overloaded" heap immediately
> on wakeup, if the hierarchical priority of the task is so
> low that if the task were to run a minimal timeslice length,
> it would be unable to account that time into its vruntime
> in one go, AND the CPU already has a larger number of tasks
> on it.
>
> Because the common case on most systems is having just
> 0, 1, or 2 runnable tasks on a CPU, this fancy scheme
> should rarely end up being used, and even when it is the
> overhead should be reasonable because most of the
> overloaded tasks will just sit there until pick_next_task
> gets around to them.
>
> Does this seem like something worth trying out?
>
> Did I overlook any other corner cases that would make this
> approach unworkable?
>
> Did I forget to explain anything that is needed to help
> understand the problem and the proposed solution better?

I imagine that I can see what you want to achieve here ;-)

But it's hard since your v5 RFC
https://lkml.kernel.org/r/[email protected] is
pretty old by now.

Do you have a version of the patch-set against tip/sched/core? Quite a
lot has changed (runnable load avg replaced by runnable avg, rq->load is
gone, CFS load balance rework).

IIRC, the 'CFS flat runqueue design' has the advantage to reduce the
overhead in taskgroup heavy environments like systemd. And I recall that
v5 doesn't cover CFS bandwidth control yet.

IMHO it would be extremely helpful to have a current patch-set to
discuss how these other problems can be covered by patches on top.

2020-08-07 19:43:57

by Rik van Riel

[permalink] [raw]
Subject: Re: CFS flat runqueue proposal fixes/update

On Fri, 2020-08-07 at 16:14 +0200, Dietmar Eggemann wrote:
> On 31/07/2020 09:42, Rik van Riel wrote:
> > Possible solution
> > ...
> I imagine that I can see what you want to achieve here ;-)
>
> But it's hard since your v5 RFC
> https://lkml.kernel.org/r/[email protected] is
> pretty old by now.

The v5 patches also do not implement this new idea, and
still suffer from the corner cases that Paul Turner
pointed out last year.

> Do you have a version of the patch-set against tip/sched/core? Quite
> a
> lot has changed (runnable load avg replaced by runnable avg, rq->load
> is
> gone, CFS load balance rework).

Not yet. We got a baby this spring, so I've been busy
with things like milk and diapers, instead of with
code.

I wanted to get this proposal out before Plumbers, so
we could at least talk about it, and maybe find flaws
with this idea before I spend weeks/months implementing it :)

> IMHO it would be extremely helpful to have a current patch-set to
> discuss how these other problems can be covered by patches on top.

Agreed. I hope to get some time to work on that, but
no guarantees about getting it ready before Plumbers :)

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-08-20 14:58:37

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: CFS flat runqueue proposal fixes/update

Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:

[...]

> Lets revisit the hierarchy from above, and assign priorities
> to the cgroups, with the fixed point one being 1000. Lets
> say cgroups A, A1, and B have priority 1000, while cgroup
> A2 has priority 1.
>
> /\
> / \
> A B
> / \ \
> A1 A2 t3
> / \
> t1 t2
>
> One consequence of this is that when t1, t2, and t3 each
> get a time slice, the vruntime of tasks t1 and t3 advances
> at roughly the same speed as the clock time, while the
> vruntime of task t2 advances 1000x faster.
>
> This is fine if all three tasks continue to be runnable,
> since t1, t2 and t3 each get their fair share of CPU time.
>
> However, if t1 goes to sleep, t2 is the only thing running
> inside cgroup A, which has the same priority as cgroup B,
> and tasks t2 and t3 should be getting the same amount of
> CPU time.
>
> They eventually will, but not before task t3 has used up
> enough CPU time to catch up with the enormous vruntime
> advance that t2 just suffered.
>
> That needs to be fixed, to get near-immediate convergence,
> and not convergence after some unknown (potentially long)
> period of time.

I'm trying to understand this issue in detail ...

Since t1 and t2 are single tasks in A1 and A2, this taskgroup level
shouldn't matter for tick preemption after t1 went to sleep?

check_preempt_tick() is only invoked for 'cfs_rq->nr_running > 1' from
entity_tick().

IMHO, tick preemption is handled between A and B and since they have the
same cpu.weight (cpu.shares) t2 and t3 get the same time slice after t1
went to sleep.

I think that here tick preemption happens in the 'if (delta_exec >
ideal_runtime)' condition w/ delta_exec = curr->sum_exec_runtime -
curr->prev_sum_exec_runtime.

Did I miss anything?

[...]

2020-08-20 21:54:47

by Rik van Riel

[permalink] [raw]
Subject: Re: CFS flat runqueue proposal fixes/update

On Thu, 2020-08-20 at 16:56 +0200, Dietmar Eggemann wrote:
> Hi Rik,
>
> On 31/07/2020 09:42, Rik van Riel wrote:
>
> [...]
>
> > Lets revisit the hierarchy from above, and assign priorities
> > to the cgroups, with the fixed point one being 1000. Lets
> > say cgroups A, A1, and B have priority 1000, while cgroup
> > A2 has priority 1.
> >
> > /\
> > / \
> > A B
> > / \ \
> > A1 A2 t3
> > / \
> > t1 t2
> >
> > One consequence of this is that when t1, t2, and t3 each
> > get a time slice, the vruntime of tasks t1 and t3 advances
> > at roughly the same speed as the clock time, while the
> > vruntime of task t2 advances 1000x faster.
> >
> > This is fine if all three tasks continue to be runnable,
> > since t1, t2 and t3 each get their fair share of CPU time.
> >
> > However, if t1 goes to sleep, t2 is the only thing running
> > inside cgroup A, which has the same priority as cgroup B,
> > and tasks t2 and t3 should be getting the same amount of
> > CPU time.
> >
> > They eventually will, but not before task t3 has used up
> > enough CPU time to catch up with the enormous vruntime
> > advance that t2 just suffered.
> >
> > That needs to be fixed, to get near-immediate convergence,
> > and not convergence after some unknown (potentially long)
> > period of time.
>
> I'm trying to understand this issue in detail ...
>
> Since t1 and t2 are single tasks in A1 and A2, this taskgroup level
> shouldn't matter for tick preemption after t1 went to sleep?
>
> check_preempt_tick() is only invoked for 'cfs_rq->nr_running > 1'
> from
> entity_tick().
>
> IMHO, tick preemption is handled between A and B and since they have
> the
> same cpu.weight (cpu.shares) t2 and t3 get the same time slice after
> t1
> went to sleep.
>
> I think that here tick preemption happens in the 'if (delta_exec >
> ideal_runtime)' condition w/ delta_exec = curr->sum_exec_runtime -
> curr->prev_sum_exec_runtime.
>
> Did I miss anything?

The issue happens with a flat runqueue, when t1 goes
to sleep, but t2 and t3 continue running.

We need to make sure the vruntime for t2 has not been
advanced so far into the future that cgroup A is unable
to get its fair share of CPU wihle t1 is sleeping.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-08-21 17:54:32

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: CFS flat runqueue proposal fixes/update

On 20.08.20 22:39, Rik van Riel wrote:
> On Thu, 2020-08-20 at 16:56 +0200, Dietmar Eggemann wrote:

[...]

> The issue happens with a flat runqueue, when t1 goes
> to sleep, but t2 and t3 continue running.
>
> We need to make sure the vruntime for t2 has not been
> advanced so far into the future that cgroup A is unable
> to get its fair share of CPU wihle t1 is sleeping.

Ah, these problems are related to the 'CFS flat runqueue' patch-set!
Misunderstanding om my side.

I somehow assumed that you wanted to say that these problems exist in
current mainline and could be solved with the patch-set plus some extra
functionality on top.