Date: Fri, 15 Dec 2017 16:13:09 +0000
From: Patrick Bellasi <patrick.bellasi@arm.com>
To: Mike Galbraith <efault@gmx.de>
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
        Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Paul Turner <pjt@google.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>, Todd Kjos <tkjos@android.com>,
        Joel Fernandes <joelaf@google.com>
Subject: Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
Message-ID: <20171215161309.GF19821@e110439-lin>
References: <20171205171018.9203-1-patrick.bellasi@arm.com>
 <1513187793.7297.26.camel@gmx.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1513187793.7297.26.camel@gmx.de>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5657
Lines: 101

Hi Mike,

On 13-Dec 18:56, Mike Galbraith wrote:
> On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> > This is a respin of:
> >    https://lkml.org/lkml/2017/11/9/546
> > which has been rebased on v4.15-rc2 to have util_est now working on top
> > of the recent PeterZ's:
> >    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> > 
> > The aim of this series is to improve some PELT behaviors to make it a
> > better fit for the scheduling of tasks common in embedded mobile
> > use-cases, without affecting other classes of workloads.
> 
> I thought perhaps this patch set would improve the below behavior, but
> alas it does not. ?That's 3 instances of firefox playing youtube clips
> being shoved into a corner by hogs sitting on 7 of 8 runqueues. ?PELT
> serializes the threaded desktop, making that threading kinda pointless,
> and CFS not all that fair.

Perhaps I don't completely get your use-case.
Are the cpuhog thread pinned to a CPU or just happens to be always
running on the same CPU?

I guess you would expect the three Firefox instances to be spread on
different CPUs. But whether this is possible depends also on the
specific tasks composition generated by Firefox, isn't it?

Being a video playback pipeline I would not be surprised to see that
most of the time we actually have only 1 or 2 tasks RUNNABLE, while
the others are sleeping... and if an HW decoder is involved, even if
you have three instances running you likely get only one pipeline
active at each time...

If that's the case, why should CFS move Fairfox tasks around?


>  6569 root      20   0    4048    704    628 R 100.0 0.004   5:10.48 7 cpuhog
>  6573 root      20   0    4048    712    636 R 100.0 0.004   5:07.47 5 cpuhog
>  6581 root      20   0    4048    696    620 R 100.0 0.004   5:07.36 1 cpuhog
>  6585 root      20   0    4048    812    736 R 100.0 0.005   5:08.14 4 cpuhog
>  6589 root      20   0    4048    712    636 R 100.0 0.004   5:06.42 6 cpuhog
>  6577 root      20   0    4048    720    644 R 99.80 0.005   5:06.52 3 cpuhog
>  6593 root      20   0    4048    728    652 R 99.60 0.005   5:04.25 0 cpuhog
>  6755 mikeg     20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web Content
>  6620 mikeg     20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox
>  3190 root      20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg
>  3718 root      20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 kwin_x11
>  3761 root      20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole
>  3845 root       9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 pulseaudio
>  3722 root      20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 plasmashel

Is this always happening... or sometimes Firefox tasks gets a chance
to run on CPUs other then CPU2?

Could be that looking at an htop output we don't see these small opportunities?


>  ------------------------------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |
>  ------------------------------------------------------------------------------------------------------------------------------------
>   Web Content:6755      |   2864.862 ms |     7314 | avg:    0.299 ms | max:   40.374 ms | sum: 2189.472 ms | max at:    375.769240 |
>   Compositor:6680       |   1889.847 ms |     4672 | avg:    0.531 ms | max:   29.092 ms | sum: 2478.559 ms | max at:    375.759405 |
>   MediaPl~back #3:(13)  |   3269.777 ms |     7853 | avg:    0.218 ms | max:   19.451 ms | sum: 1711.635 ms | max at:    391.123970 |
>   MediaPl~back #4:(10)  |   1472.986 ms |     8189 | avg:    0.236 ms | max:   18.653 ms | sum: 1933.886 ms | max at:    376.124211 |
>   MediaPl~back #1:(9)   |    601.788 ms |     6598 | avg:    0.247 ms | max:   17.823 ms | sum: 1627.852 ms | max at:    401.122567 |
>   firefox:6620          |    303.181 ms |     6232 | avg:    0.111 ms | max:   15.602 ms | sum:  691.865 ms | max at:    385.078558 |
>   Socket Thread:6639    |    667.537 ms |     4806 | avg:    0.069 ms | max:   12.638 ms | sum:  329.387 ms | max at:    380.827323 |
>   MediaPD~oder #1:6835  |    154.737 ms |     1592 | avg:    0.700 ms | max:   10.139 ms | sum: 1113.688 ms | max at:    392.575370 |
>   MediaTimer #1:6828    |     42.660 ms |     5250 | avg:    0.575 ms | max:    9.845 ms | sum: 3018.994 ms | max at:    380.823677 |
>   MediaPD~oder #2:6840  |    150.822 ms |     1583 | avg:    0.703 ms | max:    9.639 ms | sum: 1112.962 ms | max at:    380.823741 |

How do you get these stats?

It's definitively an interesting use-case however I think it's out of
the scope of util_est.

Regarding the specific statement "CFS not all that fair", I would say
that the fairness of CFS is defined and has to be evaluated within a
single CPU and on a temporal (not clock cycles) base.

AFAIK, vruntime is progressed based on elapsed time, thus you can have
two tasks which gets the same slice time but consume it at different
frequencies. In this case also we are not that fair, isn't it?

Thus, at the end it all boils down to some (as much as possible)
low-overhead heuristics. Thus, a proper description of a
reproducible use-case can help on improving them.

Can we model your use-case using a simple rt-app configuration?

This would likely help to have a simple and reproducible testing
scenario to better understand where the issue eventually is...
maybe by looking at an execution trace.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi