Binder (Android's IPC mechanism) which uses sync wake ups during synchronous
transactions to the scheduler to indicate that the waker is about to sleep
soon. The current wake up path can improved when the sync flag is passed
resulting in higher binder performance. In this patch we more strongly wake up
the wakee on the waker's CPU if sync is passed based on a few other conditions
such as wake_cap, cpus allowed. wake_wide is checked only after the sync flag
check so that it doesn't mess up sync. Binder throughput tests see good
improvement improvement when waking up wakee (calling thread) on the waker's
CPU (called thread) with this flag. Some tests results are below:
On an 8-core ARM64 system, following is data from running
hwbinderThroughputTest with variable number of workers and services (the
workers are clients calling into the service threads and sleeps till the
service replies to them).
2 workers calling into 4 services:
Without patch: iterations per sec: 62757
With patch: iterations per sec: 75236 (+19.88%)
4 workers calling into 2 services:
Without patch: iterations per sec: 82379
With patch: iterations per sec: 85829 (+4.18%)
Cc: Peter Zijlstra <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
Rik, as we discussed on IRC I am hoping that the lkp bot will also do its own
tests with this patch. I'm not sure if anything special needs to be in the
subject line to trigger the tests, if that's the case let me know and thanks!
kernel/sched/fair.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eca6a57527f9..808571bc8ebe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6276,10 +6276,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE) {
+ if (sd_flag & SD_BALANCE_WAKE)
record_wakee(p);
- want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
- && cpumask_test_cpu(cpu, &p->cpus_allowed);
+
+ if ((sd_flag & SD_BALANCE_WAKE) && !wake_cap(p, cpu, prev_cpu)
+ && cpumask_test_cpu(cpu, &p->cpus_allowed)) {
+ /*
+ * WF_SYNC indicates waker will goto sleep, incase waker is
+ * the only one running, select the waker's CPU to run wakee
+ */
+ if (sync && cpu_rq(cpu)->nr_running < 2)
+ return cpu;
+
+ want_affine = !wake_wide(p);
}
rcu_read_lock();
--
2.14.1.342.g6490525c54-goog
On Sat, 2017-08-26 at 18:02 -0700, Joel Fernandes wrote:
> Binder (Android's IPC mechanism) which uses sync wake ups during synchronous
> transactions to the scheduler to indicate that the waker is about to sleep
> soon. The current wake up path can improved when the sync flag is passed
> resulting in higher binder performance. In this patch we more strongly wake up
> the wakee on the waker's CPU if sync is passed based on a few other conditions
> such as wake_cap, cpus allowed. wake_wide is checked only after the sync flag
> check so that it doesn't mess up sync. Binder throughput tests see good
> improvement improvement when waking up wakee (calling thread) on the waker's
> CPU (called thread) with this flag. Some tests results are below:
Sync is not a contract, it's a hint. If you really want sync behavior,
you need to create a contract signed in blood to signal that you really
really are passing the baton.
Sync wakeups make tons of sense when the waker really really has one
and only one wakee, AND really really is going to sleep immediately,
with zero overlap that can be converted to throughput by waking to an
idle core. With no L2 misses to slow them down, pass the baton
microbenchmarks that do no real work can generate impressive ping-pong
numbers... but the real world tends to do more than just bat a byte
around endlessly. The existing hint is not strong enough for your
needs, it's current users may overlap with their wakee(s), change their
minds about sleeping (be handed more work to do) etc etc.
-Mike
On Sun, 2017-08-27 at 07:44 +0200, Mike Galbraith wrote:
> On Sat, 2017-08-26 at 18:02 -0700, Joel Fernandes wrote:
> > Binder (Android's IPC mechanism) which uses sync wake ups during synchronous
> > transactions to the scheduler to indicate that the waker is about to sleep
> > soon. The current wake up path can improved when the sync flag is passed
> > resulting in higher binder performance. In this patch we more strongly wake up
> > the wakee on the waker's CPU if sync is passed based on a few other conditions
> > such as wake_cap, cpus allowed. wake_wide is checked only after the sync flag
> > check so that it doesn't mess up sync. Binder throughput tests see good
> > improvement improvement when waking up wakee (calling thread) on the waker's
> > CPU (called thread) with this flag. Some tests results are below:
>
> Sync is not a contract, it's a hint. If you really want sync behavior,
> you need to create a contract signed in blood to signal that you really
> really are passing the baton.
P.S. to get the most bang for your synchronous buck, you want a
preemptive wakeup.. but that butts heads with the fair engine.
-Mike
(Sorry my last reply was incorrectly formatted, resending..)
Hi Mike,
On Sat, Aug 26, 2017 at 10:44 PM, Mike Galbraith <[email protected]> wrote:
> On Sat, 2017-08-26 at 18:02 -0700, Joel Fernandes wrote:
>> Binder (Android's IPC mechanism) which uses sync wake ups during
synchronous
>> transactions to the scheduler to indicate that the waker is about to
sleep
>> soon. The current wake up path can improved when the sync flag is passed
>> resulting in higher binder performance. In this patch we more strongly
wake up
>> the wakee on the waker's CPU if sync is passed based on a few other
conditions
>> such as wake_cap, cpus allowed. wake_wide is checked only after the
sync flag
>> check so that it doesn't mess up sync. Binder throughput tests see good
>> improvement improvement when waking up wakee (calling thread) on the
waker's
>> CPU (called thread) with this flag. Some tests results are below:
>
> Sync is not a contract, it's a hint. If you really want sync behavior,
> you need to create a contract signed in blood to signal that you really
> really are passing the baton.
Yes that is the usecase of binder, we are really passing the baton when we
pass sync. We also make binder to not pass sync if there's more work todo
and more tasks to wake up. In all current and past products, we have been
using sync has a hard contract as you said. Are you proposing addition of
another flag to differentiate between the existing hint and the contract?
I tried making sync be ignored if wake_wide = 1 as well but its not working
well for our use cases and hurts performance.
> Sync wakeups make tons of sense when the waker really really has one
> and only one wakee, AND really really is going to sleep immediately,
Yes that is the case of binder. If we're going to be doing more work and
waking up others before going to sleep, we wouldn't pass sync.
Binder is actually an RPC mechanism, where the calling thread and called
thread are essentially a single entity but are split across process
boundaries. By using thread pools, we increase the likelihood that there's
a single thread available for each caller which will go back to sleep after
replying.
> with zero overlap that can be converted to throughput by waking to an
> idle core.
That's exactly why the micro benchmark too speeds up, from our observation
the wake up of an idle core increases the latency (it probably also wastes
power).
thanks,
-Joel
Hi Mike,
On Sat, Aug 26, 2017 at 11:08 PM, Mike Galbraith <[email protected]> wrote:
> On Sun, 2017-08-27 at 07:44 +0200, Mike Galbraith wrote:
>> On Sat, 2017-08-26 at 18:02 -0700, Joel Fernandes wrote:
>> > Binder (Android's IPC mechanism) which uses sync wake ups during synchronous
>> > transactions to the scheduler to indicate that the waker is about to sleep
>> > soon. The current wake up path can improved when the sync flag is passed
>> > resulting in higher binder performance. In this patch we more strongly wake up
>> > the wakee on the waker's CPU if sync is passed based on a few other conditions
>> > such as wake_cap, cpus allowed. wake_wide is checked only after the sync flag
>> > check so that it doesn't mess up sync. Binder throughput tests see good
>> > improvement improvement when waking up wakee (calling thread) on the waker's
>> > CPU (called thread) with this flag. Some tests results are below:
>>
>> Sync is not a contract, it's a hint. If you really want sync behavior,
>> you need to create a contract signed in blood to signal that you really
>> really are passing the baton.
>
> P.S. to get the most bang for your synchronous buck, you want a
> preemptive wakeup.. but that butts heads with the fair engine.
>
By preemptive wake up I guess you mean the waker would give up its
time slice and let the wakee use it? That's a cool idea but I agree it
would be against the fair task behavior.
Also about real world benchmarks, in Android we have usecases that
show that the graphics performance and we have risk of frame drops if
we don't use the sync flag so this is a real world need. Binder is the
backbone of Android and for the benefit of these usecases, Android
kernels make sync a contract (written in blood as you put it).
thanks,
-Joel
> -Mike
On Sat, 2017-08-26 at 23:18 -0700, Joel Fernandes wrote:
>
> > Sync is not a contract, it's a hint. If you really want sync behavior,
> > you need to create a contract signed in blood to signal that you really
> > really are passing the baton.
>
> Yes that is the usecase of binder, we are really passing the baton
> when we pass sync. We also make binder to not pass sync if there's
> more work todo and more tasks to wake up. In all current and past
> products, we have been using sync has a hard contract as you said.
> Are you proposing addition of another flag to differentiate between
> the existing hint and the contract?
Yes, the alternative being to unilaterally (as you did) make the
existing hint a contract, in which case you would get to deal with any
fallout. You'd certainly get some "yay, you rock" mail, but you'd also
get some "boo, you suck rocks" to go with it :)
-Mike
On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
>
> > P.S. to get the most bang for your synchronous buck, you want a
> > preemptive wakeup.. but that butts heads with the fair engine.
> >
>
> By preemptive wake up I guess you mean the waker would give up its
> time slice and let the wakee use it? That's a cool idea but I agree it
> would be against the fair task behavior.
No, I meant a preemption, that being the cheapest switch. Any mucking
about with vruntime is a non-starter (NAK bait), making guaranteed
preemption a non-starter.
-Mike
On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
>
> Also about real world benchmarks, in Android we have usecases that
> show that the graphics performance and we have risk of frame drops if
> we don't use the sync flag so this is a real world need.
That likely has everything to do with cpufreq not realizing that your
CPUs really are quite busy when scheduling cross core at fairly high
frequency, and not clocking up properly.
-Mike
Hi Mike,
On Sun, Aug 27, 2017 at 11:07 AM, Mike Galbraith <[email protected]> wrote:
> On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
>>
>> Also about real world benchmarks, in Android we have usecases that
>> show that the graphics performance and we have risk of frame drops if
>> we don't use the sync flag so this is a real world need.
>
> That likely has everything to do with cpufreq not realizing that your
> CPUs really are quite busy when scheduling cross core at fairly high
> frequency, and not clocking up properly.
>
I'm glad you brought this point up. Since Android O, the userspace
processes are much more split across procedure calls due to a feature
called treble (which does this for security, modularity etc). Due to
this, a lot of things that were happening within a process boundary
happen now across process boundaries over the binder bus. Early on
folks noticed that this caused performance issues without sync flag
being used as a more strong hint. This can happen when there are 2
threads are in different frequency domains on different CPUs and are
communicating over binder, due to this the combined load of both
threads is divided between the individual CPUs and causes them to run
at lower frequency. Where as if they are running together on the same
CPUs, then they would run at a higher frequency and perform better as
their combined load would run at a higher frequency. So a stronger
sync actually helps this case if we're careful about using it when
possible.
thanks,
-Joel
> -Mike
On Sun, 2017-08-27 at 22:27 -0700, Joel Fernandes wrote:
> Hi Mike,
>
> On Sun, Aug 27, 2017 at 11:07 AM, Mike Galbraith <[email protected]> wrote:
> > On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
> >>
> >> Also about real world benchmarks, in Android we have usecases that
> >> show that the graphics performance and we have risk of frame drops if
> >> we don't use the sync flag so this is a real world need.
> >
> > That likely has everything to do with cpufreq not realizing that your
> > CPUs really are quite busy when scheduling cross core at fairly high
> > frequency, and not clocking up properly.
> >
>
> I'm glad you brought this point up. Since Android O, the userspace
> processes are much more split across procedure calls due to a feature
> called treble (which does this for security, modularity etc). Due to
> this, a lot of things that were happening within a process boundary
> happen now across process boundaries over the binder bus. Early on
> folks noticed that this caused performance issues without sync flag
> being used as a more strong hint. This can happen when there are 2
> threads are in different frequency domains on different CPUs and are
> communicating over binder, due to this the combined load of both
> threads is divided between the individual CPUs and causes them to run
> at lower frequency. Where as if they are running together on the same
> CPUs, then they would run at a higher frequency and perform better as
> their combined load would run at a higher frequency. So a stronger
> sync actually helps this case if we're careful about using it when
> possible.
Sure, but isn't that really a cpufreq issue? We schedule cross core
quite aggressively for obvious reasons. Now on mostly idle handheld
devices, you may get better battery life by stacking tasks a bit more,
in which case a sync-me-harder flag may be what you really want/need,
but with modern CPUs, I'm kinda skeptical of that, would have to see
cold hard numbers to become a believer. Iff deeper cstate etc for
longer does make a big difference, I can imagine wakeup time migrate
leftward if capacity exists as an "on battery" tactic. (though that
thought also invokes some unpleasant bounce fest images)
-Mike
On Mon, 2017-08-28 at 08:10 +0200, Mike Galbraith wrote:
> Iff deeper cstate etc for
> longer does make a big difference, I can imagine wakeup time migrate
> leftward if capacity exists as an "on battery" tactic. (though that
> thought also invokes some unpleasant bounce fest images)
(consolidate left would have to be LB global to avoid fight with self)
Hi Mike,
On Sun, Aug 27, 2017 at 11:10 PM, Mike Galbraith <[email protected]> wrote:
> On Sun, 2017-08-27 at 22:27 -0700, Joel Fernandes wrote:
>> Hi Mike,
>>
>> On Sun, Aug 27, 2017 at 11:07 AM, Mike Galbraith <[email protected]> wrote:
>> > On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
>> >>
>> >> Also about real world benchmarks, in Android we have usecases that
>> >> show that the graphics performance and we have risk of frame drops if
>> >> we don't use the sync flag so this is a real world need.
>> >
>> > That likely has everything to do with cpufreq not realizing that your
>> > CPUs really are quite busy when scheduling cross core at fairly high
>> > frequency, and not clocking up properly.
>> >
>>
>> I'm glad you brought this point up. Since Android O, the userspace
>> processes are much more split across procedure calls due to a feature
>> called treble (which does this for security, modularity etc). Due to
>> this, a lot of things that were happening within a process boundary
>> happen now across process boundaries over the binder bus. Early on
>> folks noticed that this caused performance issues without sync flag
>> being used as a more strong hint. This can happen when there are 2
>> threads are in different frequency domains on different CPUs and are
>> communicating over binder, due to this the combined load of both
>> threads is divided between the individual CPUs and causes them to run
>> at lower frequency. Where as if they are running together on the same
>> CPUs, then they would run at a higher frequency and perform better as
>> their combined load would run at a higher frequency. So a stronger
>> sync actually helps this case if we're careful about using it when
>> possible.
>
> Sure, but isn't that really a cpufreq issue? We schedule cross core
IMO its an issue with the scheduler not being aware of the
relationship between groups of tasks doing work as a pipeline. Sync
seems to me one way to communicate that.
> quite aggressively for obvious reasons. Now on mostly idle handheld
> devices, you may get better battery life by stacking tasks a bit more,
> in which case a sync-me-harder flag may be what you really want/need,
> but with modern CPUs, I'm kinda skeptical of that, would have to see
> cold hard numbers to become a believer. Iff deeper cstate etc for
If you can suggest any tests, I could run them on my Intel machine? By
the way CPUs on handhelds are pretty modern these days ;-)
> longer does make a big difference, I can imagine wakeup time migrate
> leftward if capacity exists as an "on battery" tactic. (though that
> thought also invokes some unpleasant bounce fest images)
I'm assuming you mean that tasks are packed together more closely and
scheduled on energy aware CPUs at wake up time. This is exactly what
we do in the Energy aware scheduling (EAS) that ARM has been working
on and is integrated into the Android kernels but its being done
whether on battery or plugged in. Also instead of migrating leftward
or a fixed diretion, it checks for what would the changes in energy be
based on an "energy model" and the current utilization before deciding
on whether to wake up on a different CPU. Its also not perfect and is
an approximation but overall seems to provide good energy savings.
thanks,
-Joel
>
> -Mike
On Mon, 2017-08-28 at 09:20 -0700, Joel Fernandes wrote:
> Hi Mike,
>
> On Sun, Aug 27, 2017 at 11:10 PM, Mike Galbraith <[email protected]> wrote:
> > On Sun, 2017-08-27 at 22:27 -0700, Joel Fernandes wrote:
> >> Hi Mike,
> >>
> >> On Sun, Aug 27, 2017 at 11:07 AM, Mike Galbraith <[email protected]> wrote:
> >> > On Sat, 2017-08-26 at 23:39 -0700, Joel Fernandes wrote:
> >> >>
> >> >> Also about real world benchmarks, in Android we have usecases that
> >> >> show that the graphics performance and we have risk of frame drops if
> >> >> we don't use the sync flag so this is a real world need.
> >> >
> >> > That likely has everything to do with cpufreq not realizing that your
> >> > CPUs really are quite busy when scheduling cross core at fairly high
> >> > frequency, and not clocking up properly.
> >> >
> >>
> >> I'm glad you brought this point up. Since Android O, the userspace
> >> processes are much more split across procedure calls due to a feature
> >> called treble (which does this for security, modularity etc). Due to
> >> this, a lot of things that were happening within a process boundary
> >> happen now across process boundaries over the binder bus. Early on
> >> folks noticed that this caused performance issues without sync flag
> >> being used as a more strong hint. This can happen when there are 2
> >> threads are in different frequency domains on different CPUs and are
> >> communicating over binder, due to this the combined load of both
> >> threads is divided between the individual CPUs and causes them to run
> >> at lower frequency. Where as if they are running together on the same
> >> CPUs, then they would run at a higher frequency and perform better as
> >> their combined load would run at a higher frequency. So a stronger
> >> sync actually helps this case if we're careful about using it when
> >> possible.
> >
> > Sure, but isn't that really a cpufreq issue? We schedule cross core
>
> IMO its an issue with the scheduler not being aware of the
> relationship between groups of tasks doing work as a pipeline. Sync
> seems to me one way to communicate that.
It's certainly simple, which is requirement #1.
> > quite aggressively for obvious reasons. Now on mostly idle handheld
> > devices, you may get better battery life by stacking tasks a bit more,
> > in which case a sync-me-harder flag may be what you really want/need,
> > but with modern CPUs, I'm kinda skeptical of that, would have to see
> > cold hard numbers to become a believer. Iff deeper cstate etc for
>
> If you can suggest any tests, I could run them on my Intel machine? By
> the way CPUs on handhelds are pretty modern these days ;-)
Nah, that would be wasting your time doing what I could do myself.
> > longer does make a big difference, I can imagine wakeup time migrate
> > leftward if capacity exists as an "on battery" tactic. (though that
> > thought also invokes some unpleasant bounce fest images)
>
> I'm assuming you mean that tasks are packed together more closely and
> scheduled on energy aware CPUs at wake up time. This is exactly what
> we do in the Energy aware scheduling (EAS) that ARM has been working
> on and is integrated into the Android kernels but its being done
> whether on battery or plugged in. Also instead of migrating leftward
> or a fixed diretion, it checks for what would the changes in energy be
> based on an "energy model" and the current utilization before deciding
> on whether to wake up on a different CPU. Its also not perfect and is
> an approximation but overall seems to provide good energy savings.
OK, so you're already doing some packing.
Personally, I still think that there's a bit of a disconnect between
cpufreq and scheduler, but that aside, sync was invented for what
you're trying to do. If you can improve it without wreckage, cool,
generic improvement rocks, so good luck.
-Mike
Greeting,
FYI, we noticed a -11.3% regression of netperf.Throughput_tps due to commit:
commit: 6d46bd3d9705555382b83554b56a34f231d5d1dd ("sched/fair: Improve the behavior of sync flag")
url: https://github.com/0day-ci/linux/commits/Joel-Fernandes/sched-fair-Improve-the-behavior-of-sync-flag/20170829-191129
in testcase: netperf
on test machine: 8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz with 16G memory
with following parameters:
ip: ipv4
runtime: 300s
nr_threads: 25%
cluster: cs-localhost
test: TCP_CRR
cpufreq_governor: performance
test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
test-url: http://www.netperf.org/netperf/
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
testcase/path_params/tbox_group/run: netperf/ipv4-300s-25%-cs-localhost-TCP_CRR-performance/lkp-avoton3
9c8783201cb58e9a 6d46bd3d9705555382b83554b5
---------------- --------------------------
6829 -11% 6054 netperf.Throughput_tps
68768 ? 7% 10484% 7278249 netperf.time.involuntary_context_switches
165 -28% 119 netperf.time.percent_of_cpu_this_job_got
497 -29% 354 netperf.time.system_time
4106701 -100% 17497 ? 3% netperf.time.voluntary_context_switches
8921 8696 vmstat.system.in
62311 -21% 49214 vmstat.system.cs
11.61 3% 12.00 boot-time.dhcp
12.69 3% 13.08 boot-time.kernel_boot
193 3% 198 boot-time.idle
27.06 27.81 boot-time.boot
11950 399% 59639 perf-stat.cpu-migrations
0.22 10% 0.25 perf-stat.ipc
1.62 1.58 perf-stat.iTLB-load-miss-rate%
4.49 -9% 4.08 perf-stat.cpi
1.19 ? 5% -14% 1.03 perf-stat.cache-miss-rate%
8.983e+10 -14% 7.732e+10 perf-stat.branch-instructions
4.855e+11 -14% 4.166e+11 perf-stat.iTLB-loads
4.857e+11 -15% 4.145e+11 perf-stat.instructions
7.988e+09 -17% 6.669e+09 perf-stat.iTLB-load-misses
7.348e+10 -20% 5.849e+10 perf-stat.cache-references
19124461 -21% 15098659 perf-stat.context-switches
2.181e+12 -23% 1.69e+12 perf-stat.cpu-cycles
12.15 -25% 9.15 perf-stat.branch-miss-rate%
8.72e+08 ? 5% -31% 6.002e+08 perf-stat.cache-misses
1.091e+10 -35% 7.072e+09 perf-stat.branch-misses
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Xiaolong
Hi,
On Sun, Sep 10, 2017 at 6:40 AM, kernel test robot
<[email protected]> wrote:
>
> Greeting,
>
> FYI, we noticed a -11.3% regression of netperf.Throughput_tps due to commit:
>
>
> commit: 6d46bd3d9705555382b83554b56a34f231d5d1dd ("sched/fair: Improve the behavior of sync flag")
> url: https://github.com/0day-ci/linux/commits/Joel-Fernandes/sched-fair-Improve-the-behavior-of-sync-flag/20170829-191129
>
Glad to see the bot tested this RFC/RFT patch. Anyone know what in the
netperf test triggers use of the sync flag?
thanks,
-Joel
[..]
On Sun, 2017-09-10 at 09:53 -0700, Joel Fernandes wrote:
>
> Anyone know what in the netperf test triggers use of the sync flag?
homer:..kernel/linux-master # git grep wake_up_interruptible_sync_poll net
net/core/sock.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
net/core/sock.c: wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
net/sctp/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
net/smc/smc_rx.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
net/tipc/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
net/tipc/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
net/unix/af_unix.c: wake_up_interruptible_sync_poll(&wq->wait,
net/unix/af_unix.c: wake_up_interruptible_sync_poll(&u->peer_wait,
The same as metric tons of other stuff.
Once upon a time, we had avg_overlap to help decide whether to wake
core affine or not, on top of the wake_affine() imbalance constraint,
but instrumentation showed it to be too error prone, so it had to die.
These days, an affine wakeup generally means cache affine, and the
sync hint gives you a wee bit more chance of migration near to tasty
hot data being approved.
The sync hint was born back in the bad old days, when communicating
tasks not sharing L2 may as well have been talking over two tin cans
and a limp string. These days, things are oodles better, but truly
synchronous stuff could still benefit from core affinity (up to hugely
for very fast/light stuff) if it weren't for all the caveats that can
lead to tossing concurrency opportunities out the window.
-Mike
Hi Mike,
Thanks a lot for sharing the history of this.
On Sun, Sep 10, 2017 at 7:55 PM, Mike Galbraith <[email protected]> wrote:
> On Sun, 2017-09-10 at 09:53 -0700, Joel Fernandes wrote:
>>
>> Anyone know what in the netperf test triggers use of the sync flag?
>
> homer:..kernel/linux-master # git grep wake_up_interruptible_sync_poll net
> net/core/sock.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
> net/core/sock.c: wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
> net/sctp/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
> net/smc/smc_rx.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
> net/tipc/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
> net/tipc/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
> net/unix/af_unix.c: wake_up_interruptible_sync_poll(&wq->wait,
> net/unix/af_unix.c: wake_up_interruptible_sync_poll(&u->peer_wait,
>
> The same as metric tons of other stuff.
>
> Once upon a time, we had avg_overlap to help decide whether to wake
> core affine or not, on top of the wake_affine() imbalance constraint,
> but instrumentation showed it to be too error prone, so it had to die.
> These days, an affine wakeup generally means cache affine, and the
> sync hint gives you a wee bit more chance of migration near to tasty
> hot data being approved.
>
> The sync hint was born back in the bad old days, when communicating
> tasks not sharing L2 may as well have been talking over two tin cans
> and a limp string. These days, things are oodles better, but truly
> synchronous stuff could still benefit from core affinity (up to hugely
> for very fast/light stuff) if it weren't for all the caveats that can
> lead to tossing concurrency opportunities out the window.
Cool, thanks. For this test I suspect its the other way? I think the
reason why regresses is that the 'nr_running < 2' check is too weak of
a check to prevent sync in all bad situations ('bad' being pulling a
task to a crowded CPU). Could we maybe be having a situation for this
test where if the blocked load a CPU is high (many tasks recently were
running on it and went to sleep), then the nr_running < 2 is a false
positive and in such a scenario we listened to the sync flag when we
shouldn't have?
To make the load check more meaningful, I am thinking if using
wake_affine()'s balance check is a better thing to do than the
'nr_running < 2' check I used in this patch. Then again, since commit
3fed382b46baac ("sched/numa: Implement NUMA node level wake_affine()",
wake_affine() doesn't do balance check for CPUs within a socket so
probably bringing back something like the *old* wake_affine that
checked load between different CPUs within a socket is needed to avoid
a potentially disastrous sync decision? The commit I refer to was
added with the reason that select_idle_sibling was selecting cores
anywhere within a socket, but with my patch we're more specifically
selecting the waker's CPU on passing the sync flag. Could you share
your thoughts about this?
I will run some tracing on this netperf test and try to understand the
undesirable behavior better as well,
thanks,
-Joel
On Sun, 2017-09-10 at 23:32 -0700, Joel Fernandes wrote:
> Hi Mike,
> Thanks a lot for sharing the history of this.
>
> On Sun, Sep 10, 2017 at 7:55 PM, Mike Galbraith <[email protected]> wrote:
> > On Sun, 2017-09-10 at 09:53 -0700, Joel Fernandes wrote:
> >>
> >> Anyone know what in the netperf test triggers use of the sync flag?
> >
> > homer:..kernel/linux-master # git grep wake_up_interruptible_sync_poll net
> > net/core/sock.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
> > net/core/sock.c: wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
> > net/sctp/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
> > net/smc/smc_rx.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
> > net/tipc/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
> > net/tipc/socket.c: wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
> > net/unix/af_unix.c: wake_up_interruptible_sync_poll(&wq->wait,
> > net/unix/af_unix.c: wake_up_interruptible_sync_poll(&u->peer_wait,
> >
> > The same as metric tons of other stuff.
> >
> > Once upon a time, we had avg_overlap to help decide whether to wake
> > core affine or not, on top of the wake_affine() imbalance constraint,
> > but instrumentation showed it to be too error prone, so it had to die.
> > These days, an affine wakeup generally means cache affine, and the
> > sync hint gives you a wee bit more chance of migration near to tasty
> > hot data being approved.
> >
> > The sync hint was born back in the bad old days, when communicating
> > tasks not sharing L2 may as well have been talking over two tin cans
> > and a limp string. These days, things are oodles better, but truly
> > synchronous stuff could still benefit from core affinity (up to hugely
> > for very fast/light stuff) if it weren't for all the caveats that can
> > lead to tossing concurrency opportunities out the window.
>
> Cool, thanks. For this test I suspect its the other way? I think the
> reason why regresses is that the 'nr_running < 2' check is too weak of
> a check to prevent sync in all bad situations ('bad' being pulling a
> task to a crowded CPU). Could we maybe be having a situation for this
> test where if the blocked load a CPU is high (many tasks recently were
> running on it and went to sleep), then the nr_running < 2 is a false
> positive and in such a scenario we listened to the sync flag when we
> shouldn't have?
nr_running has ~little to do with synchronous baton passing, or rather
with overlap being too small to overcome costs (L2 miss/cstate..). If
you run TCP_RR, pipe-test or ilk (tiny ball ping-pong), you should see
gorgeous numbers, and may find yourself thinking sync is a wonder
elixir for network-land, but 'taint necessarily so in the real world.
The only way to win the "Rob Peter to pay Paul" game is through the use
of knowledge, acquisition of which is expensive. Try resurrecting the
cheap avg_overlap, you'll love it.. for a little while.
> To make the load check more meaningful, I am thinking if using
> wake_affine()'s balance check is a better thing to do than the
> 'nr_running < 2' check I used in this patch. Then again, since commit
> 3fed382b46baac ("sched/numa: Implement NUMA node level wake_affine()",
> wake_affine() doesn't do balance check for CPUs within a socket so
> probably bringing back something like the *old* wake_affine that
> checked load between different CPUs within a socket is needed to avoid
> a potentially disastrous sync decision? The commit I refer to was
> added with the reason that select_idle_sibling was selecting cores
> anywhere within a socket, but with my patch we're more specifically
> selecting the waker's CPU on passing the sync flag. Could you share
> your thoughts about this?
Mucking about with wake_affine() in core affinity context would be a
step backward methinks. The sync hint has not meant CPU affine for
quite some time now.
If you can come up with a sync hint modification that is win/win,
super, but I suspect you'll find it a lot easier to create a new
'really really' flag or make your governor more responsive.
-Mike
On Sun, 2017-09-10 at 23:32 -0700, Joel Fernandes wrote:
>
> To make the load check more meaningful, I am thinking if using
> wake_affine()'s balance check is a better thing to do than the
> 'nr_running < 2' check I used in this patch. Then again, since commit
> 3fed382b46baac ("sched/numa: Implement NUMA node level
> wake_affine()",
> wake_affine() doesn't do balance check for CPUs within a socket so
> probably bringing back something like the *old* wake_affine that
> checked load between different CPUs within a socket is needed to
> avoid
> a potentially disastrous sync decision?
This is because regardless of whether or not we did
an affine wakeup, the code called select_idle_sibling
within that socket, anyway.
In other words, the behavior for within-socket
wakeups was not substantially different with or
without an affine wakeup.
All that changed is which CPU select_idle_sibling
starts searching at, and that only if the woken
task's previous CPU is not idle.
> The commit I refer to was
> added with the reason that select_idle_sibling was selecting cores
> anywhere within a socket, but with my patch we're more specifically
> selecting the waker's CPU on passing the sync flag. Could you share
> your thoughts about this?
On systems with SMT, it may make more sense for
sync wakeups to look for idle threads of the same
core, than to have the woken task end up on the
same thread, and wait for the current task to stop
running.
"Strong sync" wakeups like you propose would also
change the semantics of wake_wide() and potentially
other bits of code...
--
All rights reversed
On Thu, 2017-09-14 at 11:56 -0400, Rik van Riel wrote:
>
> On systems with SMT, it may make more sense for
> sync wakeups to look for idle threads of the same
> core, than to have the woken task end up on the
> same thread, and wait for the current task to stop
> running.
Depends.
homer:/root # taskset -c 3 pipe-test
1.412185 usecs/loop -- avg 1.412185 1416.2 KHz
homer:/root # taskset -c 2,3 pipe-test
2.298820 usecs/loop -- avg 2.298820 870.0 KHz
homer:/root # taskset -c 3,7 pipe-test
1.899164 usecs/loop -- avg 1.899164 1053.1 KHz
For pipe-test, having ~zero overlap as well as ~zero footprint, that's
a good choice, but..
homer:/root # taskset -c 3 tbench.sh 1 10 2>&1|grep Throughput
Throughput 844.04 MB/sec 1 clients 1 procs max_latency=0.042 ms
homer:/root # taskset -c 2,3 tbench.sh 1 10 2>&1|grep Throughput
Throughput 713.25 MB/sec 1 clients 1 procs max_latency=0.324 ms
homer:/root # taskset -c 3,7 tbench.sh 1 10 2>&1|grep Throughput
Throughput 512.866 MB/sec 1 clients 1 procs max_latency=0.454 ms
..for tbench, where my crusty ole Q6600 turns in a win by scheduling
the pair on separate L2 sharing cores, for the more modern SMT equipped
i4790, targeting shared L2 is the worst choice.
Bigger issue is that while microbenchmark behavior is consistant,
applications tend to process data and react to it (vs merely batting it
about like playful kittens, cute, but not all that productive), likely
mucking up any heuristic anyone invents with depressing regularity.
-Mike
Hi Rik,
On Thu, Sep 14, 2017 at 8:56 AM, Rik van Riel <[email protected]> wrote:
> On Sun, 2017-09-10 at 23:32 -0700, Joel Fernandes wrote:
>>
>> To make the load check more meaningful, I am thinking if using
>> wake_affine()'s balance check is a better thing to do than the
>> 'nr_running < 2' check I used in this patch. Then again, since commit
>> 3fed382b46baac ("sched/numa: Implement NUMA node level
>> wake_affine()",
>> wake_affine() doesn't do balance check for CPUs within a socket so
>> probably bringing back something like the *old* wake_affine that
>> checked load between different CPUs within a socket is needed to
>> avoid
>> a potentially disastrous sync decision?
>
> This is because regardless of whether or not we did
> an affine wakeup, the code called select_idle_sibling
> within that socket, anyway.
>
> In other words, the behavior for within-socket
> wakeups was not substantially different with or
> without an affine wakeup.
>
> All that changed is which CPU select_idle_sibling
> starts searching at, and that only if the woken
> task's previous CPU is not idle.
Yes I understand. However with my 'strong sync' patch, such a
balancing check could be useful which is what I was trying to do in a
different way in my patch - but it could be that my way is not good
enough and potentially the old wake_affine check could help here - I
thought of spending some time next week after LPC travel to
reintroduce the old wake_affine and monitor this signal with some
tracing for the regressing netperf usecase.
>> The commit I refer to was
>> added with the reason that select_idle_sibling was selecting cores
>> anywhere within a socket, but with my patch we're more specifically
>> selecting the waker's CPU on passing the sync flag. Could you share
>> your thoughts about this?
>
> On systems with SMT, it may make more sense for
> sync wakeups to look for idle threads of the same
> core, than to have the woken task end up on the
> same thread, and wait for the current task to stop
> running.
I am ok with additionally doing an select_idle_smt for the SMT cases.
However Mike shows that it doesn't necessarily cause a performance
improvement. But if there is consensus on checking for idle SMT
threads, then I'm Ok with doing that.
>
> "Strong sync" wakeups like you propose would also
> change the semantics of wake_wide() and potentially
> other bits of code...
>
I understand, I am not very confident that wake_wide does the right
thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
the most common usecase of display pipeline well. It seems that we
have cases where the 'flip count' is really high and causes wake_wide
all the time and sends us straight to the wake up slow path causing
regressions in Android benchmarks.
Atleast with the sync flag, the caller provides a meaningful
indication and I think making that flag stronger / more preferred than
wake_wide makes sense from that perspective since its not a signal
that's guessed, but is rather an input request.
thanks,
-Joel
On Sat, 2017-09-16 at 23:42 -0700, Joel Fernandes wrote:
>
> Yes I understand. However with my 'strong sync' patch, such a
> balancing check could be useful which is what I was trying to do in a
> different way in my patch - but it could be that my way is not good
> enough and potentially the old wake_affine check could help here
Help how? The old wake_affine() check contained zero concurrency
information, it served to exclude excessive stacking, defeating the
purpose of SMP. A truly synchronous wakeup has absolutely nothing to
do with load balance in the general case: you can neither generate nor
cure an imbalance by replacing one (nice zero) task with another. The
mere existence of a load based braking mechanism speaks volumes.
> On systems with SMT, it may make more sense for
> > sync wakeups to look for idle threads of the same
> > core, than to have the woken task end up on the
> > same thread, and wait for the current task to stop
> > running.
>
> I am ok with additionally doing an select_idle_smt for the SMT cases.
> However Mike shows that it doesn't necessarily cause a performance
> improvement. But if there is consensus on checking for idle SMT
> threads, then I'm Ok with doing that.
select_idle_sibling() used to check thread first, that was changed to
core first for performance reasons.
> > "Strong sync" wakeups like you propose would also
> > change the semantics of wake_wide() and potentially
> > other bits of code...
> >
>
> I understand, I am not very confident that wake_wide does the right
> thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
> the most common usecase of display pipeline well. It seems that we
> have cases where the 'flip count' is really high and causes wake_wide
> all the time and sends us straight to the wake up slow path causing
> regressions in Android benchmarks.
Hm. It didn't pull those counts out of the vacuum, it measured them.
It definitely does not force Android into the full balance path, that
is being done by Android developers, as SD_BALANCE_WAKE is off by
default. It was briefly on by default, but was quickly turned back off
because it... induced performance regressions.
In any case, if you have cause to believe that wake_wide() is causing
you grief, why the heck are you bending up the sync hint?
> Atleast with the sync flag, the caller provides a meaningful
> indication and I think making that flag stronger / more preferred than
> wake_wide makes sense from that perspective since its not a signal
> that's guessed, but is rather an input request.
Sort of, if you disregard the history I mentioned...
https://www.youtube.com/watch?v=Yho1Eydh1mM :)
Lacking solid concurrency information to base your decision on, you'll
end up robbing Peter to pay Paul forever, you absolutely will stack
non-synchronous tasks, inducing needless latency hits and injuring
scalability. We've been down that road. $subject was a very small
sample of what lies down this path <bang bang bang>.
-Mike
Hi Mike,
On Sun, Sep 17, 2017 at 9:47 AM, Mike Galbraith <[email protected]> wrote:
> On Sat, 2017-09-16 at 23:42 -0700, Joel Fernandes wrote:
>>
>> Yes I understand. However with my 'strong sync' patch, such a
>> balancing check could be useful which is what I was trying to do in a
>> different way in my patch - but it could be that my way is not good
>> enough and potentially the old wake_affine check could help here
>
> Help how? The old wake_affine() check contained zero concurrency
> information, it served to exclude excessive stacking, defeating the
> purpose of SMP. A truly synchronous wakeup has absolutely nothing to
> do with load balance in the general case: you can neither generate nor
> cure an imbalance by replacing one (nice zero) task with another. The
> mere existence of a load based braking mechanism speaks volumes.
This is the part I didn't get.. you said "neither generate an
imbalance", it is possible that a CPU with a high blocked load but
just happen to be running 1 task at the time and did a sync wake up
for another task. In this case dragging the task onto the waker's CPU
might hurt it since it will face more competition than if it were
woken up on its previous CPU which is possibly lighty loaded than the
waker's CPU?
Also the other thing I didn't fully get is why is concurrency a
discussion point here, in this case A wakes up B and goes to sleep,
and then B wakes up A. They never run concurrently. Could you let me
know what I am missing?
>> On systems with SMT, it may make more sense for
>> > sync wakeups to look for idle threads of the same
>> > core, than to have the woken task end up on the
>> > same thread, and wait for the current task to stop
>> > running.
>>
>> I am ok with additionally doing an select_idle_smt for the SMT cases.
>> However Mike shows that it doesn't necessarily cause a performance
>> improvement. But if there is consensus on checking for idle SMT
>> threads, then I'm Ok with doing that.
>
> select_idle_sibling() used to check thread first, that was changed to
> core first for performance reasons.
Oh I see. Ok.
>> > "Strong sync" wakeups like you propose would also
>> > change the semantics of wake_wide() and potentially
>> > other bits of code...
>> >
>>
>> I understand, I am not very confident that wake_wide does the right
>> thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
>> the most common usecase of display pipeline well. It seems that we
>> have cases where the 'flip count' is really high and causes wake_wide
>> all the time and sends us straight to the wake up slow path causing
>> regressions in Android benchmarks.
>
> Hm. It didn't pull those counts out of the vacuum, it measured them.
> It definitely does not force Android into the full balance path, that
> is being done by Android developers, as SD_BALANCE_WAKE is off by
> default. It was briefly on by default, but was quickly turned back off
> because it... induced performance regressions.
>
> In any case, if you have cause to believe that wake_wide() is causing
> you grief, why the heck are you bending up the sync hint?
So its not just wake_wide causing the grief, even select_idle_sibling
doesn't seem to be doing the right thing. We really don't want to wake
up a task on an idle CPU if the current CPU is a better candidate.
Binder microbenchmarks should that (as you can see in the results in
this patch) it performs better to wake up on the current CPU (no wake
up from idle latency on a sibling etc). Perhaps we should fix
select_idle_sibling to also consider latency of CPU to come out of
idle? Using the sync hint was/is a way on the product kernel to
prevent both these paths. On current products, sync is ignored though
if the system is in an "overutilized" state (any of the CPUs are more
than 80% utilized) but otherwise the sync is used as a hard flag. This
is all probably wrong though - considering the concurrency point you
brought up..
>> Atleast with the sync flag, the caller provides a meaningful
>> indication and I think making that flag stronger / more preferred than
>> wake_wide makes sense from that perspective since its not a signal
>> that's guessed, but is rather an input request.
>
> Sort of, if you disregard the history I mentioned...
>
> https://www.youtube.com/watch?v=Yho1Eydh1mM :)
No no, definitely not disregarding anything :) - just trying to make
sense of it all (and the history). Thanks so much for sharing it and
all the current/previous discussion.
>
> Lacking solid concurrency information to base your decision on, you'll
> end up robbing Peter to pay Paul forever, you absolutely will stack
> non-synchronous tasks, inducing needless latency hits and injuring
> scalability. We've been down that road. $subject was a very small
> sample of what lies down this path <bang bang bang>.
Ok. That is pretty scary. I am going to spend some time first go
through all the previous emails in this and other threads on the
balance flag usages and try to work on a solution after that. I will
probably start to document somewhere all the usecases for my own
sanity :)
thanks,
- Joel
On Sun, 2017-09-17 at 14:41 -0700, Joel Fernandes wrote:
> Hi Mike,
>
> On Sun, Sep 17, 2017 at 9:47 AM, Mike Galbraith <[email protected]> wrote:
> > On Sat, 2017-09-16 at 23:42 -0700, Joel Fernandes wrote:
> >>
> >> Yes I understand. However with my 'strong sync' patch, such a
> >> balancing check could be useful which is what I was trying to do in a
> >> different way in my patch - but it could be that my way is not good
> >> enough and potentially the old wake_affine check could help here
> >
> > Help how? The old wake_affine() check contained zero concurrency
> > information, it served to exclude excessive stacking, defeating the
> > purpose of SMP. A truly synchronous wakeup has absolutely nothing to
> > do with load balance in the general case: you can neither generate nor
> > cure an imbalance by replacing one (nice zero) task with another. The
> > mere existence of a load based braking mechanism speaks volumes.
>
> This is the part I didn't get.. you said "neither generate an
> imbalance", it is possible that a CPU with a high blocked load but
> just happen to be running 1 task at the time and did a sync wake up
> for another task. In this case dragging the task onto the waker's CPU
> might hurt it since it will face more competition than if it were
> woken up on its previous CPU which is possibly lighty loaded than the
> waker's CPU?
Sure, a preexisting imbalance, but what does that have to do with
generic pass the baton behavior, and knowing whether pull will result
in a performance win or loss? In the overloaded and imbalanced case, a
lower load number on this/that CPU means little in and of itself wrt
latency expectations... the wakee may or may not be next in line, may
or may not be able to cut the line (preempt).
> Also the other thing I didn't fully get is why is concurrency a
> discussion point here, in this case A wakes up B and goes to sleep,
> and then B wakes up A. They never run concurrently. Could you let me
> know what I am missing?
If waker does not IMMEDIATELY sleep, and you wake CPU affine, you toss
the time interval between wakeup and context switch out the window, the
wakee could have already been executing elsewhere when the waker gets
around to getting the hell out of the way.
The point I'm belaboring here is that we had that load check and more,
and it proved insufficient. Yeah, pipe-test absolutely loves running
on one CPU, but who cares, pipe-test is an overhead measurement tool,
NOT a benchmark, those try to emulate the real world.
Hell, take two large footprint tasks, and let them exchange data
briefly via pipe.. what would make it a wonderful idea to pull these
tasks together each and every time they did that? Because pipe-test
can thus play super fast ping-pong with itself? Obviously not.
> >> On systems with SMT, it may make more sense for
> >> > sync wakeups to look for idle threads of the same
> >> > core, than to have the woken task end up on the
> >> > same thread, and wait for the current task to stop
> >> > running.
> >>
> >> I am ok with additionally doing an select_idle_smt for the SMT cases.
> >> However Mike shows that it doesn't necessarily cause a performance
> >> improvement. But if there is consensus on checking for idle SMT
> >> threads, then I'm Ok with doing that.
> >
> > select_idle_sibling() used to check thread first, that was changed to
> > core first for performance reasons.
>
> Oh I see. Ok.
>
> >> > "Strong sync" wakeups like you propose would also
> >> > change the semantics of wake_wide() and potentially
> >> > other bits of code...
> >> >
> >>
> >> I understand, I am not very confident that wake_wide does the right
> >> thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
> >> the most common usecase of display pipeline well. It seems that we
> >> have cases where the 'flip count' is really high and causes wake_wide
> >> all the time and sends us straight to the wake up slow path causing
> >> regressions in Android benchmarks.
> >
> > Hm. It didn't pull those counts out of the vacuum, it measured them.
> > It definitely does not force Android into the full balance path, that
> > is being done by Android developers, as SD_BALANCE_WAKE is off by
> > default. It was briefly on by default, but was quickly turned back off
> > because it... induced performance regressions.
> >
> > In any case, if you have cause to believe that wake_wide() is causing
> > you grief, why the heck are you bending up the sync hint?
>
> So its not just wake_wide causing the grief, even select_idle_sibling
> doesn't seem to be doing the right thing. We really don't want to wake
> up a task on an idle CPU if the current CPU is a better candidate.
That is the nut, defining better candidate is not so trivial. Lacking
concrete knowledge (omniscience?), you're always gambling, betting on
concurrency potential existing is much safer than betting against it if
your prime directive is to maximize utilization.
> Binder microbenchmarks should that (as you can see in the results in
> this patch) it performs better to wake up on the current CPU (no wake
> up from idle latency on a sibling etc). Perhaps we should fix
> select_idle_sibling to also consider latency of CPU to come out of
> idle?
That would add complexity. If you're gonna do that, you may as well go
the extra mile, nuke it, and unify the full balance path instead.
> Using the sync hint was/is a way on the product kernel to
> prevent both these paths. On current products, sync is ignored though
> if the system is in an "overutilized" state (any of the CPUs are more
> than 80% utilized) but otherwise the sync is used as a hard flag. This
> is all probably wrong though - considering the concurrency point you
> brought up..
My experience with this space was that it's remarkably easy to wreck
performance, why I was inspired to bang on podium with shoe. I'm not
trying to discourage your explorations, merely passing along a well
meant "Danger Will Robinson".
-Mike
Hi Mike,
On Sun, Sep 17, 2017 at 10:30 PM, Mike Galbraith <[email protected]> wrote:
> On Sun, 2017-09-17 at 14:41 -0700, Joel Fernandes wrote:
>> Hi Mike,
>>
>> On Sun, Sep 17, 2017 at 9:47 AM, Mike Galbraith <[email protected]> wrote:
>> > On Sat, 2017-09-16 at 23:42 -0700, Joel Fernandes wrote:
>> >>
>> >> Yes I understand. However with my 'strong sync' patch, such a
>> >> balancing check could be useful which is what I was trying to do in a
>> >> different way in my patch - but it could be that my way is not good
>> >> enough and potentially the old wake_affine check could help here
>> >
>> > Help how? The old wake_affine() check contained zero concurrency
>> > information, it served to exclude excessive stacking, defeating the
>> > purpose of SMP. A truly synchronous wakeup has absolutely nothing to
>> > do with load balance in the general case: you can neither generate nor
>> > cure an imbalance by replacing one (nice zero) task with another. The
>> > mere existence of a load based braking mechanism speaks volumes.
>>
>> This is the part I didn't get.. you said "neither generate an
>> imbalance", it is possible that a CPU with a high blocked load but
>> just happen to be running 1 task at the time and did a sync wake up
>> for another task. In this case dragging the task onto the waker's CPU
>> might hurt it since it will face more competition than if it were
>> woken up on its previous CPU which is possibly lighty loaded than the
>> waker's CPU?
>
> Sure, a preexisting imbalance, but what does that have to do with
> generic pass the baton behavior, and knowing whether pull will result
> in a performance win or loss? In the overloaded and imbalanced case, a
> lower load number on this/that CPU means little in and of itself wrt
> latency expectations... the wakee may or may not be next in line, may
> or may not be able to cut the line (preempt).
In my patch, I used nr_running == 1, so if waker slept immediately,
then wakee would be next in line. I guess this is the part I
misunderstood - that I assumed for all cases the sync waker sleeps
immediately, even though it does in the binder use cases. We have
control over the binder userspace bits that sleep immediately since
its part of the framework - user apps don't decide whether to stay
awake or not - the framework will put them to sleep since any progress
can happen only after waiting for the transaction to return. However
this assumption may not be true for other users of the regular sync
flag as you pointed :-/.
Just for completeness sake, I wanted to mention how we have a patch
similar to this on the Pixel product and works real well for all the
usecases we care about:
https://android.googlesource.com/kernel/msm.git/+/android-msm-marlin-3.18-oreo-r6/kernel/sched/fair.c#5678
It will probably not work well if someone tried to run netperf or
other usecases where waker doesn't sleep immediately and is definitely
not mainlinable considering the wreckage in $SUBJECT.
>
>> Also the other thing I didn't fully get is why is concurrency a
>> discussion point here, in this case A wakes up B and goes to sleep,
>> and then B wakes up A. They never run concurrently. Could you let me
>> know what I am missing?
>
> If waker does not IMMEDIATELY sleep, and you wake CPU affine, you toss
> the time interval between wakeup and context switch out the window, the
> wakee could have already been executing elsewhere when the waker gets
> around to getting the hell out of the way.
>
> The point I'm belaboring here is that we had that load check and more,
> and it proved insufficient. Yeah, pipe-test absolutely loves running
> on one CPU, but who cares, pipe-test is an overhead measurement tool,
> NOT a benchmark, those try to emulate the real world.
Ok.
> Hell, take two large footprint tasks, and let them exchange data
> briefly via pipe.. what would make it a wonderful idea to pull these
> tasks together each and every time they did that? Because pipe-test
> can thus play super fast ping-pong with itself? Obviously not.
Yes, I agree with you.
[..]
>>
>> >> > "Strong sync" wakeups like you propose would also
>> >> > change the semantics of wake_wide() and potentially
>> >> > other bits of code...doesn't make much sense for
>> >> >
>> >>
>> >> I understand, I am not very confident that wake_wide does the right
>> >> thing anyway. Atleast for Android, wake_wide doesn't seem to mirror
>> >> the most common usecase of display pipeline well. It seems that we
>> >> have cases where the 'flip count' is really high and causes wake_wide
>> >> all the time and sends us straight to the wake up slow path causing
>> >> regressions in Android benchmarks.
>> >
>> > Hm. It didn't pull those counts out of the vacuum, it measured them.
>> > It definitely does not force Android into the full balance path, that
>> > is being done by Android developers, as SD_BALANCE_WAKE is off by
>> > default. It was briefly on by default, but was quickly turned back off
>> > because it... induced performance regressions.
>> >
>> > In any case, if you have cause to believe that wake_wide() is causing
>> > you grief, why the heck are you bending up the sync hint?
>>
>> So its not just wake_wide causing the grief, even select_idle_sibling
>> doesn't seem to be doing the right thing. We really don't want to wake
>> up a task on an idle CPU if the current CPU is a better candidate.
>
> That is the nut, defining better candidate is not so trivial. Lacking
> concrete knowledge (omniscience?), you're always gambling, betting on
> concurrency potential existing is much safer than betting against it if
> your prime directive is to maximize utilization.
Ok so I think then reusing the sync flag for binder in mainline
doesn't make much sense and we could define a new flag that indicates
the waker will sleep immediately and call it WF_HSYNC (as in hard
sync) - since in the binder case we have this knowledge that waker is
going to sleep waiting for the transaction to finish.. I think the
mistake I did was trying to change the semantic of the original sync
flag which cannot assume that the waker sleeps immediately.
>> Binder microbenchmarks should that (as you can see in the results in
>> this patch) it performs better to wake up on the current CPU (no wake
>> up from idle latency on a sibling etc). Perhaps we should fix
>> select_idle_sibling to also consider latency of CPU to come out of
>> idle?
>
> That would add complexity. If you're gonna do that, you may as well go
> the extra mile, nuke it, and unify the full balance path instead.
Ok, I agree its not a good idea to make the fast path any slower just for this.
>> Using the sync hint was/is a way on the product kernel to
>> prevent both these paths. On current products, sync is ignored though
>> if the system is in an "overutilized" state (any of the CPUs are more
>> than 80% utilized) but otherwise the sync is used as a hard flag. This
>> is all probably wrong though - considering the concurrency point you
>> brought up..
>
> My experience with this space was that it's remarkably easy to wreck
> performance, why I was inspired to bang on podium with shoe. I'm not
> trying to discourage your explorations, merely passing along a well
> meant "Danger Will Robinson".
Thanks for the guidance, and sharing your knowledge/experience about this,
regards,
- Joel