2019-06-27 01:38:04

by Subhra Mazumdar

[permalink] [raw]
Subject: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

Hi,

Resending this patchset, will be good to get some feedback. Any suggestions
that will make it more acceptable are welcome. We have been shipping this
with Unbreakable Enterprise Kernel in Oracle Linux.

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
- Setting an upper and lower limit of idle cpu search in select_idle_cpu
to keep search time low and constant
- Adding a new sched feature SIS_CORE to disable select_idle_core

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Following are the performance numbers with various benchmarks with SIS_CORE
true (idle core search enabled).

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5816 8.94 0.5903 (-1.5%) 11.28
2 0.6428 10.64 0.5843 (9.1%) 4.93
4 1.0152 1.99 0.9965 (1.84%) 1.83
8 1.8128 1.4 1.7921 (1.14%) 1.76
16 3.1666 0.8 3.1345 (1.01%) 0.81
32 5.6084 0.83 5.5677 (0.73%) 0.8

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline %stdev patch %stdev
8 2095.45 1.82 2102.6 (0.34%) 2.11
16 4218.45 0.06 4221.35 (0.07%) 0.38
32 7531.36 0.49 7607.18 (1.01%) 0.25
48 10206.42 0.21 10324.26 (1.15%) 0.13
64 12053.73 0.1 12158.3 (0.87%) 0.24
128 14810.33 0.04 14840.4 (0.2%) 0.38

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 0.9 1.0068 (0.68%) 0.27
40 1 0.8 1.0103 (1.03%) 1.24
60 1 0.34 1.0178 (1.78%) 0.49
80 1 0.53 1.0092 (0.92%) 1.5
100 1 0.79 1.0090 (0.9%) 0.88
120 1 0.06 1.0048 (0.48%) 0.72
140 1 0.22 1.0116 (1.16%) 0.05
160 1 0.57 1.0264 (2.64%) 0.67
180 1 0.81 1.0194 (1.94%) 0.91
200 1 0.44 1.028 (2.8%) 3.09
220 1 1.74 1.0229 (2.29%) 0.21

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 45.36 0.43 46.28 (2.01%) 0.29
16 87.81 0.82 89.67 (2.12%) 0.38
32 151.19 0.02 153.5 (1.53%) 0.41
48 190.2 0.21 194.79 (2.41%) 0.07
64 190.42 0.35 202.9 (6.55%) 1.66
128 323.86 0.28 343.56 (6.08%) 1.34

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 629.8 603.83 (-4.12%)
2 1159.65 1155.75 (-0.34%)
4 2121.61 2093.99 (-1.3%)
8 2620.52 2641.51 (0.8%)
16 2879.31 2897.6 (0.64%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 256.41 255.8 (-0.24%)
2 509.89 504.52 (-1.05%)
4 999.44 1003.74 (0.43%)
8 1982.7 1976.42 (-0.32%)
16 3891.51 3916.04 (0.63%)
32 6819.24 6845.06 (0.38%)
64 8542.95 8568.28 (0.3%)
128 15277.6 15754.6 (3.12%)

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 94 2.82 92 (2.13%) 2.17
75 124 2.13 122 (1.61%) 1.42
90 152 1.74 151 (0.66%) 0.66
95 171 2.11 170 (0.58%) 0
99 512.67 104.96 208.33 (59.36%) 1.2
99.5 2296 82.55 3674.66 (-60.05%) 22.19
99.9 12517.33 2.38 12784 (-2.13%) 0.66

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups baseline %stdev patch %stdev
1 1.3085 6.65 1.2213 (6.66%) 10.32
2 1.4559 8.55 1.5048 (-3.36%) 4.72
4 2.6271 1.74 2.5532 (2.81%) 2.02
8 4.7089 3.01 4.5118 (4.19%) 2.74
16 8.7406 2.25 8.6801 (0.69%) 4.78
32 17.7835 1.01 16.759 (5.76%) 1.38
64 36.1901 0.65 34.6652 (4.21%) 1.24
128 72.6585 0.51 70.9762 (2.32%) 0.9

Following are the performance numbers with various benchmarks with SIS_CORE
false (idle core search disabled). This improves throughput of certain
workloads but increases latency of other workloads.

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1 0.5816 8.94 0.5835 (-0.33%) 8.21
2 0.6428 10.64 0.5752 (10.52%) 4.05
4 1.0152 1.99 0.9946 (2.03%) 2.56
8 1.8128 1.4 1.7619 (2.81%) 1.88
16 3.1666 0.8 3.1275 (1.23%) 0.42
32 5.6084 0.83 5.5856 (0.41%) 0.89

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline %stdev patch %stdev
8 2095.45 1.82 2084.72 (-0.51%) 1.65
16 4218.45 0.06 4179.69 (-0.92%) 0.18
32 7531.36 0.49 7623.18 (1.22%) 0.39
48 10206.42 0.21 10159.16 (-0.46%) 0.21
64 12053.73 0.1 12087.21 (0.28%) 0.19
128 14810.33 0.04 14894.08 (0.57%) 0.08

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users baseline %stdev patch %stdev
20 1 0.9 1.0056 (0.56%) 0.34
40 1 0.8 1.0173 (1.73%) 0.13
60 1 0.34 0.9995 (-0.05%) 0.85
80 1 0.53 1.0175 (1.75%) 1.56
100 1 0.79 1.0151 (1.51%) 1.31
120 1 0.06 1.0244 (2.44%) 0.5
140 1 0.22 1.034 (3.4%) 0.66
160 1 0.57 1.0362 (3.62%) 0.07
180 1 0.81 1.041 (4.1%) 0.8
200 1 0.44 1.0233 (2.33%) 1.4
220 1 1.74 1.0125 (1.25%) 1.41

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8 45.36 0.43 46.94 (3.48%) 0.2
16 87.81 0.82 91.75 (4.49%) 0.43
32 151.19 0.02 167.74 (10.95%) 1.29
48 190.2 0.21 200.57 (5.45%) 0.89
64 190.42 0.35 226.74 (19.07%) 1.79
128 323.86 0.28 348.12 (7.49%) 0.77

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 629.8 600.19 (-4.7%)
2 1159.65 1162.07 (0.21%)
4 2121.61 2112.27 (-0.44%)
8 2620.52 2645.55 (0.96%)
16 2879.31 2828.87 (-1.75%)
32 2791.24 2760.97 (-1.08%)
64 1853.07 1747.66 (-5.69%)
128 1484.95 1459.81 (-1.69%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline patch
1 256.41 258.11 (0.67%)
2 509.89 509.13 (-0.15%)
4 999.44 1016.58 (1.72%)
8 1982.7 2006.53 (1.2%)
16 3891.51 3964.43 (1.87%)
32 6819.24 7376.92 (8.18%)
64 8542.95 9660.45 (13.08%)
128 15277.6 15438.4 (1.05%)

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 94 2.82 94.67 (-0.71%) 2.2
75 124 2.13 124.67 (-0.54%) 1.67
90 152 1.74 154.33 (-1.54%) 0.75
95 171 2.11 176.67 (-3.31%) 0.86
99 512.67 104.96 4130.33 (-705.65%) 79.41
99.5 2296 82.55 10066.67 (-338.44%) 26.15
99.9 12517.33 2.38 12869.33 (-2.81%) 0.8

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups baseline %stdev patch %stdev
1 1.3085 6.65 1.2514 (4.36%) 11.1
2 1.4559 8.55 1.5433 (-6%) 3.05
4 2.6271 1.74 2.5626 (2.5%) 2.69
8 4.7089 3.01 4.5316 (3.77%) 2.95
16 8.7406 2.25 8.6585 (0.94%) 2.91
32 17.7835 1.01 17.175 (3.42%) 1.38
64 36.1901 0.65 35.5294 (1.83%) 1.02
128 72.6585 0.51 71.8821 (1.07%) 1.05

Following are the schbench performance numbers with SIS_CORE false and
SIS_PROP false. This recovers the latency increase by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile baseline %stdev patch %stdev
50 94 2.82 93.33 (0.71%) 1.24
75 124 2.13 122.67 (1.08%) 1.7
90 152 1.74 149.33 (1.75%) 2.35
95 171 2.11 167 (2.34%) 2.74
99 512.67 104.96 206 (59.82%) 8.86
99.5 2296 82.55 3121.67 (-35.96%) 97.37
99.9 12517.33 2.38 12592 (-0.6%) 1.67

Changes from v2->v3:
-Use shift operator instead of multiplication to compute limit
-Use per-CPU variable to precompute the number of sibling SMTs for x86

subhra mazumdar (7):
sched: limit cpu search in select_idle_cpu
sched: introduce per-cpu var next_cpu to track search limit
sched: rotate the cpu search window for better spread
sched: add sched feature to disable idle core search
sched: SIS_CORE to disable idle core search
x86/smpboot: introduce per-cpu variable for HT siblings
sched: use per-cpu variable cpumask_weight_sibling

arch/x86/include/asm/smp.h | 1 +
arch/x86/include/asm/topology.h | 1 +
arch/x86/kernel/smpboot.c | 17 ++++++++++++++++-
include/linux/topology.h | 4 ++++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 31 +++++++++++++++++++++++--------
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
8 files changed, 49 insertions(+), 9 deletions(-)

--
2.9.3


2019-06-27 01:38:05

by Subhra Mazumdar

[permalink] [raw]
Subject: [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling

Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu.
This is the fast path of scheduler and every cycle is worth saving. Usage
of cpumask_weight can result in iterations.

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a74808..878f11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
- floor = cpumask_weight(topology_sibling_cpumask(target));
+ floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
limit = floor << 1;
--
2.9.3

2019-07-01 09:03:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> Hi,
>
> Resending this patchset, will be good to get some feedback. Any suggestions
> that will make it more acceptable are welcome. We have been shipping this
> with Unbreakable Enterprise Kernel in Oracle Linux.
>
> Current select_idle_sibling first tries to find a fully idle core using
> select_idle_core which can potentially search all cores and if it fails it
> finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> search all cpus in the llc domain. This doesn't scale for large llc domains
> and will only get worse with more cores in future.
>
> This patch solves the scalability problem by:
> - Setting an upper and lower limit of idle cpu search in select_idle_cpu
> to keep search time low and constant
> - Adding a new sched feature SIS_CORE to disable select_idle_core
>
> Additionally it also introduces a new per-cpu variable next_cpu to track
> the limit of search so that every time search starts from where it ended.
> This rotating search window over cpus in LLC domain ensures that idle
> cpus are eventually found in case of high load.

Right, so we had a wee conversation about this patch series at OSPM, and
I don't see any of that reflected here :-(

Specifically, given that some people _really_ want the whole L3 mask
scanned to reduce tail latency over raw throughput, while you guys
prefer the other way around, it was proposed to extend the task model.

Specifically something like a latency-nice was mentioned (IIRC) where a
task can give a bias but not specify specific behaviour. This is very
important since we don't want to be ABI tied to specific behaviour.

Some of the things we could tie to this would be:

- select_idle_siblings; -nice would scan more than +nice,

- wakeup preemption; when the wakee has a relative smaller
latency-nice value than the current running task, it might preempt
sooner and the other way around of course.

- pack-vs-spread; +nice would pack more with like tasks (since we
already spread by default [0] I don't think -nice would affect much
here).


Hmmm?

2019-07-01 15:21:30

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

On 01-Jul 11:02, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> > Hi,
> >
> > Resending this patchset, will be good to get some feedback. Any suggestions
> > that will make it more acceptable are welcome. We have been shipping this
> > with Unbreakable Enterprise Kernel in Oracle Linux.
> >
> > Current select_idle_sibling first tries to find a fully idle core using
> > select_idle_core which can potentially search all cores and if it fails it
> > finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> > search all cpus in the llc domain. This doesn't scale for large llc domains
> > and will only get worse with more cores in future.
> >
> > This patch solves the scalability problem by:
> > - Setting an upper and lower limit of idle cpu search in select_idle_cpu
> > to keep search time low and constant
> > - Adding a new sched feature SIS_CORE to disable select_idle_core
> >
> > Additionally it also introduces a new per-cpu variable next_cpu to track
> > the limit of search so that every time search starts from where it ended.
> > This rotating search window over cpus in LLC domain ensures that idle
> > cpus are eventually found in case of high load.
>
> Right, so we had a wee conversation about this patch series at OSPM, and
> I don't see any of that reflected here :-(
>
> Specifically, given that some people _really_ want the whole L3 mask
> scanned to reduce tail latency over raw throughput, while you guys
> prefer the other way around, it was proposed to extend the task model.
>
> Specifically something like a latency-nice was mentioned (IIRC) where a

Right, AFAIR PaulT suggested to add support for the concept of a task
being "latency tolerant": meaning we can spend more time to search for
a CPU and/or avoid preempting the current task.

> task can give a bias but not specify specific behaviour. This is very
> important since we don't want to be ABI tied to specific behaviour.

I like the idea of biasing, especially considering we are still in the
domain of the FAIR scheduler. If something more mandatory should be
required there are other classes which are likely more appropriate.

> Some of the things we could tie to this would be:
>
> - select_idle_siblings; -nice would scan more than +nice,

Just to be sure, you are not proposing to use the nice value we
already have, i.e.
p->{static,normal}_prio
but instead a new similar concept, right?

Otherwise, pros would be we don't touch userspace, but as a cons we
would have side effects, i.e. bandwidth allocation.
While I think we don't want to mix "specific behaviors" with "biases".

> - wakeup preemption; when the wakee has a relative smaller
> latency-nice value than the current running task, it might preempt
> sooner and the other way around of course.

I think we currently have a single system-wide parameter for that now:

sched_wakeup_granularity_ns ==> sysctl_sched_min_granularity

which is used in:

wakeup_gran() for the wakeup path
check_preempt_tick() for the periodic tick

that's where it should be possible to extend the heuristics with some
biasing based on the latency-nice attribute of a task, right?

> - pack-vs-spread; +nice would pack more with like tasks (since we
> already spread by default [0] I don't think -nice would affect much
> here).

That will be very useful for the Android case too.
In Android we used to call it "prefer_idle", but that's probably not
the best name, conceptually similar however.

In Android we would use a latency-nice concept to go for either the
fast (select_idle_siblings) or the slow (energy aware) path.

> Hmmm?

Just one more requirement I think it's worth to consider since the
beginning: CGroups support

That would be very welcome interface. Just because is so much more
convenient (and safe) to set these bias on a group of tasks depending
on their role in the system.

Do you have any idea on how we can expose such a "lantency-nice"
property via CGroups? It's very similar to cpu.shares but it does not
represent a resource which can be partitioned.

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-07-01 15:25:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

On Mon, Jul 01, 2019 at 02:55:52PM +0100, Patrick Bellasi wrote:
> On 01-Jul 11:02, Peter Zijlstra wrote:

> > Some of the things we could tie to this would be:
> >
> > - select_idle_siblings; -nice would scan more than +nice,
>
> Just to be sure, you are not proposing to use the nice value we
> already have, i.e.
> p->{static,normal}_prio
> but instead a new similar concept, right?

Correct; a new sched_attr::sched_latency_nice value, which is like
sched_nice, but controls a different dimmension of behaviour.

2019-07-01 15:26:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

On Mon, Jul 01, 2019 at 02:55:52PM +0100, Patrick Bellasi wrote:
> On 01-Jul 11:02, Peter Zijlstra wrote:

> > Hmmm?
>
> Just one more requirement I think it's worth to consider since the
> beginning: CGroups support
>
> That would be very welcome interface. Just because is so much more
> convenient (and safe) to set these bias on a group of tasks depending
> on their role in the system.
>
> Do you have any idea on how we can expose such a "lantency-nice"
> property via CGroups? It's very similar to cpu.shares but it does not
> represent a resource which can be partitioned.

If the latency_nice idea lives; exactly like the normal nice? That is;
IIRC cgroupv2 has a nice value interface (see cpu_weight_nice_*()).

But yes, this isn't a resource per se; just a shared attribute like
thing.

2019-07-02 00:09:02

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path


On 7/1/19 6:55 AM, Patrick Bellasi wrote:
> On 01-Jul 11:02, Peter Zijlstra wrote:
>> On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
>>> Hi,
>>>
>>> Resending this patchset, will be good to get some feedback. Any suggestions
>>> that will make it more acceptable are welcome. We have been shipping this
>>> with Unbreakable Enterprise Kernel in Oracle Linux.
>>>
>>> Current select_idle_sibling first tries to find a fully idle core using
>>> select_idle_core which can potentially search all cores and if it fails it
>>> finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
>>> search all cpus in the llc domain. This doesn't scale for large llc domains
>>> and will only get worse with more cores in future.
>>>
>>> This patch solves the scalability problem by:
>>> - Setting an upper and lower limit of idle cpu search in select_idle_cpu
>>> to keep search time low and constant
>>> - Adding a new sched feature SIS_CORE to disable select_idle_core
>>>
>>> Additionally it also introduces a new per-cpu variable next_cpu to track
>>> the limit of search so that every time search starts from where it ended.
>>> This rotating search window over cpus in LLC domain ensures that idle
>>> cpus are eventually found in case of high load.
>> Right, so we had a wee conversation about this patch series at OSPM, and
>> I don't see any of that reflected here :-(
>>
>> Specifically, given that some people _really_ want the whole L3 mask
>> scanned to reduce tail latency over raw throughput, while you guys
>> prefer the other way around, it was proposed to extend the task model.
>>
>> Specifically something like a latency-nice was mentioned (IIRC) where a
> Right, AFAIR PaulT suggested to add support for the concept of a task
> being "latency tolerant": meaning we can spend more time to search for
> a CPU and/or avoid preempting the current task.
>
Wondering if searching and preempting needs will ever be conflicting?
Otherwise sounds like a good direction to me. For the searching aspect, can
we map latency nice values to the % of cores we search in select_idle_cpu?
Thus the search cost can be controlled by latency nice value. But the issue
is if more latency tolerant workloads set to less search, we still need
some mechanism to achieve good spread of threads. Can we keep the sliding
window mechanism in that case? Also will latency nice do anything for
select_idle_core and select_idle_smt?

2019-07-02 08:55:24

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

On 01-Jul 17:01, Subhra Mazumdar wrote:
>
> On 7/1/19 6:55 AM, Patrick Bellasi wrote:
> > On 01-Jul 11:02, Peter Zijlstra wrote:
> > > On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> > > > Hi,
> > > >
> > > > Resending this patchset, will be good to get some feedback. Any suggestions
> > > > that will make it more acceptable are welcome. We have been shipping this
> > > > with Unbreakable Enterprise Kernel in Oracle Linux.
> > > >
> > > > Current select_idle_sibling first tries to find a fully idle core using
> > > > select_idle_core which can potentially search all cores and if it fails it
> > > > finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> > > > search all cpus in the llc domain. This doesn't scale for large llc domains
> > > > and will only get worse with more cores in future.
> > > >
> > > > This patch solves the scalability problem by:
> > > > - Setting an upper and lower limit of idle cpu search in select_idle_cpu
> > > > to keep search time low and constant
> > > > - Adding a new sched feature SIS_CORE to disable select_idle_core
> > > >
> > > > Additionally it also introduces a new per-cpu variable next_cpu to track
> > > > the limit of search so that every time search starts from where it ended.
> > > > This rotating search window over cpus in LLC domain ensures that idle
> > > > cpus are eventually found in case of high load.
> > > Right, so we had a wee conversation about this patch series at OSPM, and
> > > I don't see any of that reflected here :-(
> > >
> > > Specifically, given that some people _really_ want the whole L3 mask
> > > scanned to reduce tail latency over raw throughput, while you guys
> > > prefer the other way around, it was proposed to extend the task model.
> > >
> > > Specifically something like a latency-nice was mentioned (IIRC) where a
> > Right, AFAIR PaulT suggested to add support for the concept of a task
> > being "latency tolerant": meaning we can spend more time to search for
> > a CPU and/or avoid preempting the current task.
> >
> Wondering if searching and preempting needs will ever be conflicting?

I guess the winning point is that we don't commit behaviors to
userspace, but just abstract concepts which are turned into biases.

I don't see conflicts right now: if you are latency tolerant that
means you can spend more time to try finding a better CPU (e.g. we can
use the energy model to compare multiple CPUs) _and/or_ give the
current task a better chance to complete by delaying its preemption.

> Otherwise sounds like a good direction to me. For the searching aspect, can
> we map latency nice values to the % of cores we search in select_idle_cpu?
> Thus the search cost can be controlled by latency nice value.

I guess that's worth a try, only caveat I see is that it's turning the
bias into something very platform specific. Meaning, the same
latency-nice value on different machines can have very different
results.

Would not be better to try finding a more platform independent mapping?

Maybe something time bounded, e.g. the higher the latency-nice the more
time we can spend looking for CPUs?

> But the issue is if more latency tolerant workloads set to less
> search, we still need some mechanism to achieve good spread of
> threads.

I don't get this example: why more latency tolerant workloads should
require less search?

> Can we keep the sliding window mechanism in that case?

Which one? Sorry did not went through the patches, can you briefly
resume the idea?

> Also will latency nice do anything for select_idle_core and
> select_idle_smt?

I guess principle the same bias can be used at different levels, maybe
with different mappings.

In the mobile world use-case we will likely use it only to switch from
select_idle_sibling to the energy aware slow path. And perhaps to see
if we can bias the wakeup preemption granularity.

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-07-03 03:55:21

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path


On 7/2/19 1:54 AM, Patrick Bellasi wrote:
> Wondering if searching and preempting needs will ever be conflicting?
> I guess the winning point is that we don't commit behaviors to
> userspace, but just abstract concepts which are turned into biases.
>
> I don't see conflicts right now: if you are latency tolerant that
> means you can spend more time to try finding a better CPU (e.g. we can
> use the energy model to compare multiple CPUs) _and/or_ give the
> current task a better chance to complete by delaying its preemption.
OK
>
>> Otherwise sounds like a good direction to me. For the searching aspect, can
>> we map latency nice values to the % of cores we search in select_idle_cpu?
>> Thus the search cost can be controlled by latency nice value.
> I guess that's worth a try, only caveat I see is that it's turning the
> bias into something very platform specific. Meaning, the same
> latency-nice value on different machines can have very different
> results.
>
> Would not be better to try finding a more platform independent mapping?
>
> Maybe something time bounded, e.g. the higher the latency-nice the more
> time we can spend looking for CPUs?
The issue I see is suppose we have a range of latency-nice values, then it
should cover the entire range of search (one core to all cores). As Peter
said some workloads will want to search the LLC fully. If we have absolute
time, the map of latency-nice values range to them will be arbitrary. If
you have something in mind let me know, may be I am thinking differently.
>
>> But the issue is if more latency tolerant workloads set to less
>> search, we still need some mechanism to achieve good spread of
>> threads.
> I don't get this example: why more latency tolerant workloads should
> require less search?
I guess I got the definition of "latency tolerant" backwards.
>
>> Can we keep the sliding window mechanism in that case?
> Which one? Sorry did not went through the patches, can you briefly
> resume the idea?
If a workload has set it to low latency tolerant, then the search will be
less. That can lead to localization of threads on a few CPUs as we are not
searching the entire LLC even if there are idle CPUs available. For this
I had introduced a per-CPU variable (for the target CPU) to track the
boundary of search so that every time it will start from the boundary, thus
sliding the window. So even if we are searching very little the search
window keeps shifting and gives us a good spread. This is orthogonal to the
latency-nice thing.
>
>> Also will latency nice do anything for select_idle_core and
>> select_idle_smt?
> I guess principle the same bias can be used at different levels, maybe
> with different mappings.
Doing it for select_idle_core will have the issue that the dynamic flag
(whether an idle core is present or not) can only be updated by threads
which are doing the full search.

Thanks,
Subhra

> In the mobile world use-case we will likely use it only to switch from
> select_idle_sibling to the energy aware slow path. And perhaps to see
> if we can bias the wakeup preemption granularity.
>
> Best,
> Patrick
>

2019-07-04 11:37:16

by Parth Shah

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

Hi,

On 7/3/19 9:22 AM, Subhra Mazumdar wrote:
>
> On 7/2/19 1:54 AM, Patrick Bellasi wrote:
>> Wondering if searching and preempting needs will ever be conflicting?
>> I guess the winning point is that we don't commit behaviors to
>> userspace, but just abstract concepts which are turned into biases.
>>
>> I don't see conflicts right now: if you are latency tolerant that
>> means you can spend more time to try finding a better CPU (e.g. we can
>> use the energy model to compare multiple CPUs) _and/or_ give the
>> current task a better chance to complete by delaying its preemption.
> OK
>>
>>> Otherwise sounds like a good direction to me. For the searching aspect, can
>>> we map latency nice values to the % of cores we search in select_idle_cpu?
>>> Thus the search cost can be controlled by latency nice value.
>> I guess that's worth a try, only caveat I see is that it's turning the
>> bias into something very platform specific. Meaning, the same
>> latency-nice value on different machines can have very different
>> results.
>>
>> Would not be better to try finding a more platform independent mapping?
>>
>> Maybe something time bounded, e.g. the higher the latency-nice the more
>> time we can spend looking for CPUs?
> The issue I see is suppose we have a range of latency-nice values, then it
> should cover the entire range of search (one core to all cores). As Peter
> said some workloads will want to search the LLC fully. If we have absolute
> time, the map of latency-nice values range to them will be arbitrary. If
> you have something in mind let me know, may be I am thinking differently.
>>
>>> But the issue is if more latency tolerant workloads set to less
>>> search, we still need some mechanism to achieve good spread of
>>> threads.
>> I don't get this example: why more latency tolerant workloads should
>> require less search?
> I guess I got the definition of "latency tolerant" backwards.
>>
>>> Can we keep the sliding window mechanism in that case?
>> Which one? Sorry did not went through the patches, can you briefly
>> resume the idea?
> If a workload has set it to low latency tolerant, then the search will be
> less. That can lead to localization of threads on a few CPUs as we are not
> searching the entire LLC even if there are idle CPUs available. For this
> I had introduced a per-CPU variable (for the target CPU) to track the
> boundary of search so that every time it will start from the boundary, thus
> sliding the window. So even if we are searching very little the search
> window keeps shifting and gives us a good spread. This is orthogonal to the
> latency-nice thing.

Can it be done something like turning off searching for an idle core if the wakee
task is latency tolerant(more latency-nice)? We search for idle core to get faster
resource allocation, thus such tasks don't need to find idle core and can
directly jump to finding idle CPUs.
This can include sliding windows mechanism along, but as I commented previously, it
imposes task ping-pong problem as sliding window gets away from target_cpu. So maybe
we can first search the core with target_cpu and if no idle CPUs found then bail
to this sliding window mechanism.
Just an thought.

Best,
Parth


>>
>>> Also will latency nice do anything for select_idle_core and
>>> select_idle_smt?
>> I guess principle the same bias can be used at different levels, maybe
>> with different mappings.
> Doing it for select_idle_core will have the issue that the dynamic flag
> (whether an idle core is present or not) can only be updated by threads
> which are doing the full search.
>
> Thanks,
> Subhra
>
>> In the mobile world use-case we will likely use it only to switch from
>> select_idle_sibling to the energy aware slow path. And perhaps to see
>> if we can bias the wakeup preemption granularity.
>>
>> Best,
>> Patrick
>>
>

2019-07-08 23:34:06

by Tim Chen

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

On 7/1/19 7:04 AM, Peter Zijlstra wrote:
> On Mon, Jul 01, 2019 at 02:55:52PM +0100, Patrick Bellasi wrote:
>> On 01-Jul 11:02, Peter Zijlstra wrote:
>
>>> Some of the things we could tie to this would be:
>>>
>>> - select_idle_siblings; -nice would scan more than +nice,
>>
>> Just to be sure, you are not proposing to use the nice value we
>> already have, i.e.
>> p->{static,normal}_prio
>> but instead a new similar concept, right?
>
> Correct; a new sched_attr::sched_latency_nice value, which is like
> sched_nice, but controls a different dimmension of behaviour.
>

I think the sched_latency_nice value could be also useful for AVX512 for x86.
Running an AVX512 task could drop frequency of the cpu, including the sibling
hardware thread. So scheduling task that don't mind latency on the
sibling while an AVX512 task runs will make sense.

Tim