by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

On Mon, 9 Nov 2020 at 16:38, Mel Gorman <[email protected]> wrote:
>
> On Mon, Nov 09, 2020 at 10:24:11AM -0500, Phil Auld wrote:
> > Hi,
> >
> > On Fri, Nov 06, 2020 at 04:00:10PM +0000 Mel Gorman wrote:
> > > On Fri, Nov 06, 2020 at 02:33:56PM +0100, Vincent Guittot wrote:
> > > > On Fri, 6 Nov 2020 at 13:03, Mel Gorman <[email protected]> wrote:
> > > > >
> > > > > On Wed, Nov 04, 2020 at 09:42:05AM +0000, Mel Gorman wrote:
> > > > > > While it's possible that some other factor masked the impact of the patch,
> > > > > > the fact it's neutral for two workloads in 5.10-rc2 is suspicious as it
> > > > > > indicates that if the patch was implemented against 5.10-rc2, it would
> > > > > > likely not have been merged. I've queued the tests on the remaining
> > > > > > machines to see if something more conclusive falls out.
> > > > > >
> > > > >
> > > > > It's not as conclusive as I would like. fork_test generally benefits
> > > > > across the board but I do not put much weight in that.
> > > > >
> > > > > Otherwise, it's workload and machine-specific.
> > > > >
> > > > > schbench: (wakeup latency sensitive), all machines benefitted from the
> > > > > revert at the low utilisation except one 2-socket haswell machine
> > > > > which showed higher variability when the machine was fully
> > > > > utilised.
> > > >
> > > > There is a pending patch to should improve this bench:
> > > > https://lore.kernel.org/patchwork/patch/1330614/
> > > >
> > >
> > > Ok, I've slotted this one in with a bunch of other stuff I wanted to run
> > > over the weekend. That particular patch was on my radar anyway. It just
> > > got bumped up the schedule a little bit.
> > >
> >
> >
> > We've run some of our perf tests against various kernels in this thread.
> > By default RHEL configs run with the performance governor.
> >
>
> This aspect is somewhat critical because the patches affect CPU
> selection. If a mostly idle CPU is used due to spreading load wider,
> it can take longer to ramp up to the highest frequency. It can be a
> dominating factor and may account for some of the differences.

I agree but that also highlights that the problem comes from frequency
selection more than task placement. In such a case, instead of trying
to bias task placement to compensate for wrong freq selection, we
should look at the freq selection itself. Not sure if it's the case
but it's worth identifying if perf regression comes from task
placement and data locality or from freq selection

>
> Generally my tests are not based on the performance governor because a)
> it's not a universal win and b) the powersave/ondemand govenors should
> be able to function reasonably well. For short-lived workloads it may
> not matter but ultimately schedutil should be good enough that it can

Yeah, schedutil should be able to manage this. But there is another
place which impacts benchmark which are based on a lot of fork/exec :
the initial value of task's PELT signal. Current implementation tries
to accommodate both perf and embedded system but might fail to satisfy
any of them at the end.

> keep track of task utilisation after migrations and select appropriate
> frequencies based on the tasks historical behaviour.
>
> --
> Mel Gorman
> SUSE Labs

2020-11-10 14:09:11

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

On Mon, Nov 09, 2020 at 04:49:07PM +0100, Vincent Guittot wrote:
> > This aspect is somewhat critical because the patches affect CPU
> > selection. If a mostly idle CPU is used due to spreading load wider,
> > it can take longer to ramp up to the highest frequency. It can be a
> > dominating factor and may account for some of the differences.
>
> I agree but that also highlights that the problem comes from frequency
> selection more than task placement. In such a case, instead of trying
> to bias task placement to compensate for wrong freq selection, we
> should look at the freq selection itself. Not sure if it's the case
> but it's worth identifying if perf regression comes from task
> placement and data locality or from freq selection
>

That's a fair point although it's worth noting the biasing the freq
selection itself means that schedutil needs to become the default which is
not quite there yet. Otherwise, the machine is often relying on firmware
to give hints as to how quickly it should ramp up or per-driver hacks
which is the road to hell.

> >
> > Generally my tests are not based on the performance governor because a)
> > it's not a universal win and b) the powersave/ondemand govenors should
> > be able to function reasonably well. For short-lived workloads it may
> > not matter but ultimately schedutil should be good enough that it can
>
> Yeah, schedutil should be able to manage this. But there is another
> place which impacts benchmark which are based on a lot of fork/exec :
> the initial value of task's PELT signal. Current implementation tries
> to accommodate both perf and embedded system but might fail to satisfy
> any of them at the end.
>

Quite likely. Assuming schedutil gets the default, it may be necessary
to either have a tunable or a kconfig that affects the initial PELT
signal as to whether it should start low and ramp up, pick a midpoint or
start high and scale down.

--
Mel Gorman
SUSE Labs