by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH RFC/RFT] sched/fair: Improve the behavior of sync flag

On Mon, 2017-08-28 at 09:20 -0700, Joel Fernandes wrote:
> Hi Mike,
>
> On Sun, Aug 27, 2017 at 11:10 PM, Mike Galbraith <[email protected]> wrote:
> > On Sun, 2017-08-27 at 22:27 -0700, Joel Fernandes wrote:
> >> Hi Mike,
> >>
> >> On Sun, Aug 27, 2017 at 11:07 AM, Mike Galbraith <[email protected]> wrote:
> >> > On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
> >> >>
> >> >> Also about real world benchmarks, in Android we have usecases that
> >> >> show that the graphics performance and we have risk of frame drops if
> >> >> we don't use the sync flag so this is a real world need.
> >> >
> >> > That likely has everything to do with cpufreq not realizing that your
> >> > CPUs really are quite busy when scheduling cross core at fairly high
> >> > frequency, and not clocking up properly.
> >> >
> >>
> >> I'm glad you brought this point up. Since Android O, the userspace
> >> processes are much more split across procedure calls due to a feature
> >> called treble (which does this for security, modularity etc). Due to
> >> this, a lot of things that were happening within a process boundary
> >> happen now across process boundaries over the binder bus. Early on
> >> folks noticed that this caused performance issues without sync flag
> >> being used as a more strong hint. This can happen when there are 2
> >> threads are in different frequency domains on different CPUs and are
> >> communicating over binder, due to this the combined load of both
> >> threads is divided between the individual CPUs and causes them to run
> >> at lower frequency. Where as if they are running together on the same
> >> CPUs, then they would run at a higher frequency and perform better as
> >> their combined load would run at a higher frequency. So a stronger
> >> sync actually helps this case if we're careful about using it when
> >> possible.
> >
> > Sure, but isn't that really a cpufreq issue? We schedule cross core
>
> IMO its an issue with the scheduler not being aware of the
> relationship between groups of tasks doing work as a pipeline. Sync
> seems to me one way to communicate that.

It's certainly simple, which is requirement #1.

> > quite aggressively for obvious reasons. Now on mostly idle handheld
> > devices, you may get better battery life by stacking tasks a bit more,
> > in which case a sync-me-harder flag may be what you really want/need,
> > but with modern CPUs, I'm kinda skeptical of that, would have to see
> > cold hard numbers to become a believer. Iff deeper cstate etc for
>
> If you can suggest any tests, I could run them on my Intel machine? By
> the way CPUs on handhelds are pretty modern these days ;-)

Nah, that would be wasting your time doing what I could do myself.

> > longer does make a big difference, I can imagine wakeup time migrate
> > leftward if capacity exists as an "on battery" tactic. (though that
> > thought also invokes some unpleasant bounce fest images)
>
> I'm assuming you mean that tasks are packed together more closely and
> scheduled on energy aware CPUs at wake up time. This is exactly what
> we do in the Energy aware scheduling (EAS) that ARM has been working
> on and is integrated into the Android kernels but its being done
> whether on battery or plugged in. Also instead of migrating leftward
> or a fixed diretion, it checks for what would the changes in energy be
> based on an "energy model" and the current utilization before deciding
> on whether to wake up on a different CPU. Its also not perfect and is
> an approximation but overall seems to provide good energy savings.

OK, so you're already doing some packing.

Personally, I still think that there's a bit of a disconnect between
cpufreq and scheduler, but that aside, sync was invented for what
you're trying to do. If you can improve it without wreckage, cool,
generic improvement rocks, so good luck.

-Mike

2017-09-10 13:42:05

[permalink] [raw]

Subject: Re: [lkp-robot] [sched/fair] 6d46bd3d97: netperf.Throughput_tps -11.3% regression

Hi Mike,

On Sun, Sep 17, 2017 at 10:30 PM, Mike Galbraith <[email protected]> wrote:
> On Sun, 2017-09-17 at 14:41 -0700, Joel Fernandes wrote:
>> Hi Mike,
>>
>> On Sun, Sep 17, 2017 at 9:47 AM, Mike Galbraith <[email protected]> wrote:
>> > On Sat, 2017-09-16 at 23:42 -0700, Joel Fernandes wrote:
>> >>
>> >> Yes I understand. However with my 'strong sync' patch, such a
>> >> balancing check could be useful which is what I was trying to do in a
>> >> different way in my patch - but it could be that my way is not good
>> >> enough and potentially the old wake_affine check could help here
>> >
>> > Help how? The old wake_affine() check contained zero concurrency
>> > information, it served to exclude excessive stacking, defeating the
>> > purpose of SMP. A truly synchronous wakeup has absolutely nothing to
>> > do with load balance in the general case: you can neither generate nor
>> > cure an imbalance by replacing one (nice zero) task with another. The
>> > mere existence of a load based braking mechanism speaks volumes.
>>
>> This is the part I didn't get.. you said "neither generate an
>> imbalance", it is possible that a CPU with a high blocked load but
>> just happen to be running 1 task at the time and did a sync wake up
>> for another task. In this case dragging the task onto the waker's CPU
>> might hurt it since it will face more competition than if it were
>> woken up on its previous CPU which is possibly lighty loaded than the
>> waker's CPU?
>
> Sure, a preexisting imbalance, but what does that have to do with
> generic pass the baton behavior, and knowing whether pull will result
> in a performance win or loss? In the overloaded and imbalanced case, a
> lower load number on this/that CPU means little in and of itself wrt
> latency expectations... the wakee may or may not be next in line, may
> or may not be able to cut the line (preempt).

In my patch, I used nr_running == 1, so if waker slept immediately,
then wakee would be next in line. I guess this is the part I
misunderstood - that I assumed for all cases the sync waker sleeps
immediately, even though it does in the binder use cases. We have
control over the binder userspace bits that sleep immediately since
its part of the framework - user apps don't decide whether to stay
awake or not - the framework will put them to sleep since any progress
can happen only after waiting for the transaction to return. However
this assumption may not be true for other users of the regular sync
flag as you pointed :-/.

Just for completeness sake, I wanted to mention how we have a patch
similar to this on the Pixel product and works real well for all the
usecases we care about:
https://android.googlesource.com/kernel/msm.git/+/android-msm-marlin-3.18-oreo-r6/kernel/sched/fair.c#5678
It will probably not work well if someone tried to run netperf or
other usecases where waker doesn't sleep immediately and is definitely
not mainlinable considering the wreckage in $SUBJECT.

>
>> Also the other thing I didn't fully get is why is concurrency a
>> discussion point here, in this case A wakes up B and goes to sleep,
>> and then B wakes up A. They never run concurrently. Could you let me
>> know what I am missing?
>
> If waker does not IMMEDIATELY sleep, and you wake CPU affine, you toss
> the time interval between wakeup and context switch out the window, the
> wakee could have already been executing elsewhere when the waker gets
> around to getting the hell out of the way.
>
> The point I'm belaboring here is that we had that load check and more,
> and it proved insufficient. Yeah, pipe-test absolutely loves running
> on one CPU, but who cares, pipe-test is an overhead measurement tool,
> NOT a benchmark, those try to emulate the real world.

Ok.

> Hell, take two large footprint tasks, and let them exchange data
> briefly via pipe.. what would make it a wonderful idea to pull these
> tasks together each and every time they did that? Because pipe-test
> can thus play super fast ping-pong with itself? Obviously not.

Yes, I agree with you.

[..]
>>
>> >> > "Strong sync" wakeups like you propose would also
>> >> > change the semantics of wake_wide() and potentially
>> >> > other bits of code...doesn't make much sense for
>> >> >
>> >>
>> >> I understand, I am not very confident that wake_wide does the right
>> >> thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
>> >> the most common usecase of display pipeline well. It seems that we
>> >> have cases where the 'flip count' is really high and causes wake_wide
>> >> all the time and sends us straight to the wake up slow path causing
>> >> regressions in Android benchmarks.
>> >
>> > Hm. It didn't pull those counts out of the vacuum, it measured them.
>> > It definitely does not force Android into the full balance path, that
>> > is being done by Android developers, as SD_BALANCE_WAKE is off by
>> > default. It was briefly on by default, but was quickly turned back off
>> > because it... induced performance regressions.
>> >
>> > In any case, if you have cause to believe that wake_wide() is causing
>> > you grief, why the heck are you bending up the sync hint?
>>
>> So its not just wake_wide causing the grief, even select_idle_sibling
>> doesn't seem to be doing the right thing. We really don't want to wake
>> up a task on an idle CPU if the current CPU is a better candidate.
>
> That is the nut, defining better candidate is not so trivial. Lacking
> concrete knowledge (omniscience?), you're always gambling, betting on
> concurrency potential existing is much safer than betting against it if
> your prime directive is to maximize utilization.

Ok so I think then reusing the sync flag for binder in mainline
doesn't make much sense and we could define a new flag that indicates
the waker will sleep immediately and call it WF_HSYNC (as in hard
sync) - since in the binder case we have this knowledge that waker is
going to sleep waiting for the transaction to finish.. I think the
mistake I did was trying to change the semantic of the original sync
flag which cannot assume that the waker sleeps immediately.

>> Binder microbenchmarks should that (as you can see in the results in
>> this patch) it performs better to wake up on the current CPU (no wake
>> up from idle latency on a sibling etc). Perhaps we should fix
>> select_idle_sibling to also consider latency of CPU to come out of
>> idle?
>
> That would add complexity. If you're gonna do that, you may as well go
> the extra mile, nuke it, and unify the full balance path instead.

Ok, I agree its not a good idea to make the fast path any slower just for this.

>> Using the sync hint was/is a way on the product kernel to
>> prevent both these paths. On current products, sync is ignored though
>> if the system is in an "overutilized" state (any of the CPUs are more
>> than 80% utilized) but otherwise the sync is used as a hard flag. This
>> is all probably wrong though - considering the concurrency point you
>> brought up..
>
> My experience with this space was that it's remarkably easy to wreck
> performance, why I was inspired to bang on podium with shoe. I'm not
> trying to discourage your explorations, merely passing along a well
> meant "Danger Will Robinson".

Thanks for the guidance, and sharing your knowledge/experience about this,

regards,

- Joel