2018-01-29 23:23:40

by Rohit Jain

[permalink] [raw]
Subject: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

Currently fast path in the scheduler looks for an idle CPU to schedule
threads on. Capacity is taken into account in the function
'select_task_rq_fair' when it calls 'wake_cap', however it ignores the
instantaneous capacity and looks at the original capacity. Furthermore
select_idle_sibling path of the code, ignores the RT/IRQ threads which
are also running on the CPUs it is looking to schedule fair threads on.

We don't necessarily have to force the code to go to slow path (by
modifying wake_cap), instead we could do a better selection of the CPU
in the current domain itself (i.e. in the fast path).

This patch makes the fast path aware of capacity, resulting in overall
performance improvements as shown in the test results.

1) KVM Test:
-----------------------------------------------------------------------
In this test KVM is configured with a ubuntu VM (unchanged kernel, used
Ubuntu server 16.04) which is running ping workload in a taskset along
with hackbench in a separate taskset. The VM is a virtio VM (which means
the host is taking and processing the interrupts). In this case, we want
to avoid scheduling vcpus on CPUs which are very busy processing
interrupts if there is a better choice available.

This machine is a 2 socket 40 CPU 20 core Intel x86 machine.

lscpu output:
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
NUMA node0 CPUs: 0-9,20-29
NUMA node1 CPUs: 10-19,30-39

The setup is realistic enough to mirror realistic use cases. KVM is
bound to a full NUMA node with CPUs bound to whole cores, i.e. CPU 0 and
1 are bound to core 0, CPU 2 and 3 to core 1 and so on.

virsh vcpupin output:
VCPU: CPU Affinity
----------------------------------
0: 0,20
1: 0,20
2: 1,21
3: 1,21
4: 2,22
5: 2,22
6: 3,23
7: 3,23
8: 4,24
9: 4,24
10: 5,25
11: 5,25
12: 6,26
13: 6,26
14: 7,27
15: 7,27
16: 8,28
17: 8,28
18: 9,29
19: 9,29

Here are the results seen with ping and hackbench running inside the
KVM. Note: hackbench was run for 10000 loops (lower is better)
(Improvement is show in brackets +ve is good, -ve is bad)
+-------------------+-----------------+---------------------------+
| | Without patch | With patch |
+---+-------+-------+-------+---------+-----------------+---------+
|FD | Groups| Tasks | Mean | Std Dev | Mean | Std Dev |
+---+-------+-------+-------+---------+-----------------+---------+
|4 | 1 | 4 | 0.059 | 0.021 | 0.034 (+42.37%) | 0.008 |
|4 | 2 | 8 | 0.087 | 0.031 | 0.075 (+13.79%) | 0.021 |
|4 | 4 | 16 | 0.124 | 0.022 | 0.089 (+28.23%) | 0.013 |
|4 | 8 | 32 | 0.149 | 0.031 | 0.126 (+15.43%) | 0.022 |
+---+-------+-------+-------+---------+-----------------+---------+
|8 | 1 | 8 | 0.212 | 0.025 | 0.211 (+0.47%) | 0.023 |
|8 | 2 | 16 | 0.246 | 0.055 | 0.225 (+8.54%) | 0.024 |
|8 | 4 | 32 | 0.298 | 0.047 | 0.294 (+1.34%) | 0.022 |
|8 | 8 | 64 | 0.407 | 0.03 | 0.378 (+7.13%) | 0.032 |
+---+-------+-------+-------+---------+-----------------+---------+
|40 | 1 | 40 | 1.703 | 0.133 | 1.451 (+14.80%) | 0.072 |
|40 | 2 | 80 | 2.912 | 0.204 | 2.431 (+16.52%) | 0.075 |
+---+-------+-------+-------+---------+-----------------+---------+

2) ping + hackbench test on x86 machine:
-----------------------------------------------------------------------

Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

This test is running on 2 socket, 20 core and 40 threads Intel x86 machine:
runtime is in seconds (Lower is better)

+-----------------------------+----------------+---------------------------+
| | Without patch | With patch |
+----------+----+------+------+-------+--------+----------------+----------+
|Loops | FD |Groups|Tasks | Mean | Std Dev|Mean | Std Dev |
+----------+----+------+------+-------+--------+----------------+----------+
|1,000,000 | 4 |1 |4 | 2.375 | 0.818 |1.785 (+24.84%) | 0.21 |
|1,000,000 | 4 |2 |8 | 2.748 | 0.694 |2.102 (+23.51%) | 0.239 |
|1,000,000 | 4 |4 |16 | 3.237 | 0.256 |2.922 (+9.73%) | 0.169 |
|1,000,000 | 4 |8 |32 | 3.786 | 0.185 |3.637 (+3.94%) | 0.471 |
+----------+----+------+------+-------+--------+----------------+----------+
|1,000,000 | 8 |1 |8 | 7.287 | 1.03 |5.364 (+26.39%) | 1.085 |
|1,000,000 | 8 |2 |16 | 7.963 | 0.951 |6.474 (+18.70%) | 0.397 |
|1,000,000 | 8 |4 |32 | 8.991 | 0.618 |8.32 (+7.46%) | 0.69 |
|1,000,000 | 8 |8 |64 | 13.868| 1.195 |12.881 (+7.12%) | 0.722 |
+----------+----+------+------+-------+--------+----------------+----------+
|10,000 | 40 |1 |40 | 0.828 | 0.032 |0.784 (+5.31%) | 0.010 |
|10,000 | 40 |2 |80 | 1.087 | 0.246 |0.980 (+9.84%) | 0.037 |
|10,000 | 40 |4 |160 | 1.611 | 0.055 |1.591 (+1.24%) | 0.039 |
|10,000 | 40 |8 |320 | 2.827 | 0.031 |2.817 (+0.35%) | 0.025 |
|10,000 | 40 |16 |640 | 5.107 | 0.085 |5.087 (+0.39%) | 0.045 |
|10,000 | 40 |25 |1000 | 7.503 | 0.143 |7.468 (+0.46%) | 0.045 |
+----------+----+------+------+-------+--------+----------------+----------+

Sanity tests:
-----------------------------------------------------------------------
schbench results on 2 socket, 44 core and 88 threads Intel x86
machine, with 2 message threads (lower is better):

+----------+-------------+----------+------------------+
|Threads | Latency | Without | With Patch |
| | percentiles | Patch | |
| | | (usec) | (usec) |
+----------+-------------+----------+------------------+
|16 | 50.0000th | 60 | 62 (-3.33%) |
|16 | 75.0000th | 72 | 68 (+5.56%) |
|16 | 90.0000th | 81 | 80 (+2.46%) |
|16 | 95.0000th | 88 | 83 (+5.68%) |
|16 | *99.0000th | 100 | 92 (+8.00%) |
|16 | 99.5000th | 105 | 97 (+7.62%) |
|16 | 99.9000th | 110 | 100 (+9.09%) |
+----------+-------------+----------+------------------+
|32 | 50.0000th | 62 | 62 (0%) |
|32 | 75.0000th | 80 | 81 (0%) |
|32 | 90.0000th | 93 | 94 (-1.07%) |
|32 | 95.0000th | 103 | 105 (-1.94%) |
|32 | *99.0000th | 121 | 121 (0%) |
|32 | 99.5000th | 127 | 125 (+1.57%) |
|32 | 99.9000th | 143 | 135 (+5.59%) |
+----------+-------------+----------+------------------+
|44 | 50.0000th | 79 | 79 (0%) |
|44 | 75.0000th | 104 | 104 (0%) |
|44 | 90.0000th | 126 | 125 (+0.79%) |
|44 | 95.0000th | 138 | 137 (+0.72%) |
|44 | *99.0000th | 163 | 163 (0%) |
|44 | 99.5000th | 174 | 171 (+1.72%) |
|44 | 99.9000th | 10832 | 11248 (-3.84%) |
+----------+-------------+----------+------------------+

I also ran uperf and sysbench MySQL workloads but I see no statistically
significant change.

Signed-off-by: Rohit Jain <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26a71eb..ce5ccf8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
return cpu_rq(cpu)->cpu_capacity_orig;
}

+static inline bool full_capacity(int cpu)
+{
+ return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
+}
+
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int

for_each_cpu(cpu, cpu_smt_mask(core)) {
cpumask_clear_cpu(cpu, cpus);
- if (!idle_cpu(cpu))
+ if (!idle_cpu(cpu) || !full_capacity(cpu))
idle = false;
}

@@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
*/
static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
{
- int cpu;
+ int cpu, rcpu = -1;
+ unsigned long max_cap = 0;

if (!static_branch_likely(&sched_smt_present))
return -1;
@@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu(cpu, cpu_smt_mask(target)) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
- return cpu;
+ if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
+ max_cap = capacity_of(cpu);
+ rcpu = cpu;
+ }
}

- return -1;
+ return rcpu;
}

#else /* CONFIG_SCHED_SMT */
@@ -6143,6 +6151,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 time, cost;
s64 delta;
int cpu, nr = INT_MAX;
+ int best_cpu = -1;
+ unsigned int best_cap = 0;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
- break;
+ if (idle_cpu(cpu)) {
+ if (full_capacity(cpu)) {
+ best_cpu = cpu;
+ break;
+ } else if (capacity_of(cpu) > best_cap) {
+ best_cap = capacity_of(cpu);
+ best_cpu = cpu;
+ }
+ }
}

time = local_clock() - time;
@@ -6182,7 +6199,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
delta = (s64)(time - cost) / 8;
this_sd->avg_scan_cost += delta;

- return cpu;
+ return best_cpu;
}

/*
@@ -6193,13 +6210,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd;
int i;

- if (idle_cpu(target))
+ if (idle_cpu(target) && full_capacity(target))
return target;

/*
* If the previous cpu is cache affine and idle, don't be stupid.
*/
- if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+ if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev) &&
+ full_capacity(prev))
return prev;

sd = rcu_dereference(per_cpu(sd_llc, target));
--
2.7.4



2018-01-30 03:40:26

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

Hi Rohit,

On Mon, Jan 29, 2018 at 3:27 PM, Rohit Jain <[email protected]> wrote:
> Currently fast path in the scheduler looks for an idle CPU to schedule
> threads on. Capacity is taken into account in the function
> 'select_task_rq_fair' when it calls 'wake_cap', however it ignores the
> instantaneous capacity and looks at the original capacity. Furthermore
> select_idle_sibling path of the code, ignores the RT/IRQ threads which
> are also running on the CPUs it is looking to schedule fair threads on.
>
> We don't necessarily have to force the code to go to slow path (by
> modifying wake_cap), instead we could do a better selection of the CPU
> in the current domain itself (i.e. in the fast path).
>
> This patch makes the fast path aware of capacity, resulting in overall
> performance improvements as shown in the test results.
>
[...]
>
> I also ran uperf and sysbench MySQL workloads but I see no statistically
> significant change.
>
> Signed-off-by: Rohit Jain <[email protected]>
> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
> 1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26a71eb..ce5ccf8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity_orig;
> }
>
> +static inline bool full_capacity(int cpu)
> +{
> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
> +}
> +
> static unsigned long cpu_avg_load_per_task(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>
> for_each_cpu(cpu, cpu_smt_mask(core)) {
> cpumask_clear_cpu(cpu, cpus);
> - if (!idle_cpu(cpu))
> + if (!idle_cpu(cpu) || !full_capacity(cpu))
> idle = false;
> }

There's some difference in logic between select_idle_core and
select_idle_cpu as far as the full_capacity stuff you're adding goes.
In select_idle_core, if all CPUs are !full_capacity, you're returning
-1. But in select_idle_cpu you're returning the best idle CPU that's
the most cap among the !full_capacity ones. Why there is this
different in logic? Did I miss something?

>
> @@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> */
> static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> {
> - int cpu;
> + int cpu, rcpu = -1;
> + unsigned long max_cap = 0;
>
> if (!static_branch_likely(&sched_smt_present))
> return -1;
> @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> for_each_cpu(cpu, cpu_smt_mask(target)) {
> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> continue;
> - if (idle_cpu(cpu))
> - return cpu;
> + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
> + max_cap = capacity_of(cpu);
> + rcpu = cpu;

At the SMT level, do you need to bother with choosing best capacity
among threads? If RT is eating into one of the SMT thread's underlying
capacity, it would eat into the other's. Wondering what's the benefit
of doing this here.

thanks,

- Joel

2018-01-30 20:13:20

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

Hi Joel,

On 1/29/2018 7:39 PM, Joel Fernandes wrote:
> Hi Rohit,
>
> On Mon, Jan 29, 2018 at 3:27 PM, Rohit Jain<[email protected]> wrote:
>> Currently fast path in the scheduler looks for an idle CPU to schedule
>> threads on. Capacity is taken into account in the function
>> 'select_task_rq_fair' when it calls 'wake_cap', however it ignores the
>> instantaneous capacity and looks at the original capacity. Furthermore
>> select_idle_sibling path of the code, ignores the RT/IRQ threads which
>> are also running on the CPUs it is looking to schedule fair threads on.
>>
>> We don't necessarily have to force the code to go to slow path (by
>> modifying wake_cap), instead we could do a better selection of the CPU
>> in the current domain itself (i.e. in the fast path).
>>
>> This patch makes the fast path aware of capacity, resulting in overall
>> performance improvements as shown in the test results.
>>
> [...]
>> I also ran uperf and sysbench MySQL workloads but I see no statistically
>> significant change.
>>
>> Signed-off-by: Rohit Jain<[email protected]>
>> ---
>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 26a71eb..ce5ccf8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>> return cpu_rq(cpu)->cpu_capacity_orig;
>> }
>>
>> +static inline bool full_capacity(int cpu)
>> +{
>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>> +}
>> +
>> static unsigned long cpu_avg_load_per_task(int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>>
>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>> cpumask_clear_cpu(cpu, cpus);
>> - if (!idle_cpu(cpu))
>> + if (!idle_cpu(cpu) || !full_capacity(cpu))
>> idle = false;
>> }
> There's some difference in logic between select_idle_core and
> select_idle_cpu as far as the full_capacity stuff you're adding goes.
> In select_idle_core, if all CPUs are !full_capacity, you're returning
> -1. But in select_idle_cpu you're returning the best idle CPU that's
> the most cap among the !full_capacity ones. Why there is this
> different in logic? Did I miss something?

This is the previous discussion on this same code. I measured the
performance difference and saw no statistically significant impact,
hence went with your suggestion of simpler code.

https://lkml.org/lkml/2017/10/3/1001

>
>> @@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>> */
>> static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
>> {
>> - int cpu;
>> + int cpu, rcpu = -1;
>> + unsigned long max_cap = 0;
>>
>> if (!static_branch_likely(&sched_smt_present))
>> return -1;
>> @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>> for_each_cpu(cpu, cpu_smt_mask(target)) {
>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>> continue;
>> - if (idle_cpu(cpu))
>> - return cpu;
>> + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
>> + max_cap = capacity_of(cpu);
>> + rcpu = cpu;
> At the SMT level, do you need to bother with choosing best capacity
> among threads? If RT is eating into one of the SMT thread's underlying
> capacity, it would eat into the other's. Wondering what's the benefit
> of doing this here.

Yes, you are right because of SD_SHARE_CPUCAPACITY, however the benefit
is that if don't do this check, we might end up picking a SMT thread
which has "high" RT/IRQ activity and be on the run queue for a while,
till the pull side can bail us out.

Thanks,
Rohit

> thanks,
>
> - Joel


2018-01-31 02:38:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Tue, Jan 30, 2018 at 11:47 AM, Rohit Jain <[email protected]> wrote:
[...]
>>>
>>> Currently fast path in the scheduler looks for an idle CPU to schedule
>>> threads on. Capacity is taken into account in the function
>>> 'select_task_rq_fair' when it calls 'wake_cap', however it ignores the
>>> instantaneous capacity and looks at the original capacity. Furthermore
>>> select_idle_sibling path of the code, ignores the RT/IRQ threads which
>>> are also running on the CPUs it is looking to schedule fair threads on.
>>>
>>> We don't necessarily have to force the code to go to slow path (by
>>> modifying wake_cap), instead we could do a better selection of the CPU
>>> in the current domain itself (i.e. in the fast path).
>>>
>>> This patch makes the fast path aware of capacity, resulting in overall
>>> performance improvements as shown in the test results.
>>>
>> [...]
>>>
>>> I also ran uperf and sysbench MySQL workloads but I see no statistically
>>> significant change.
>>>
>>> Signed-off-by: Rohit Jain<[email protected]>
>>> ---
>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
>>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 26a71eb..ce5ccf8 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>> }
>>>
>>> +static inline bool full_capacity(int cpu)
>>> +{
>>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>>> +}
>>> +
>>> static unsigned long cpu_avg_load_per_task(int cpu)
>>> {
>>> struct rq *rq = cpu_rq(cpu);
>>> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p,
>>> struct sched_domain *sd, int
>>>
>>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>>> cpumask_clear_cpu(cpu, cpus);
>>> - if (!idle_cpu(cpu))
>>> + if (!idle_cpu(cpu) || !full_capacity(cpu))
>>> idle = false;
>>> }
>>
>> There's some difference in logic between select_idle_core and
>> select_idle_cpu as far as the full_capacity stuff you're adding goes.
>> In select_idle_core, if all CPUs are !full_capacity, you're returning
>> -1. But in select_idle_cpu you're returning the best idle CPU that's
>> the most cap among the !full_capacity ones. Why there is this
>> different in logic? Did I miss something?
>
>
> This is the previous discussion on this same code. I measured the
> performance difference and saw no statistically significant impact,
> hence went with your suggestion of simpler code.

Dude :) That is hardly an answer to the question I asked. Hint:
*different in logic*.

- Joel

2018-01-31 18:00:17

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On 01/30/2018 05:57 PM, Joel Fernandes wrote:

<snip>
>>>> Signed-off-by: Rohit Jain<[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
>>>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 26a71eb..ce5ccf8 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>>> }
>>>>
>>>> +static inline bool full_capacity(int cpu)
>>>> +{
>>>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>>>> +}
>>>> +
>>>> static unsigned long cpu_avg_load_per_task(int cpu)
>>>> {
>>>> struct rq *rq = cpu_rq(cpu);
>>>> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p,
>>>> struct sched_domain *sd, int
>>>>
>>>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>>>> cpumask_clear_cpu(cpu, cpus);
>>>> - if (!idle_cpu(cpu))
>>>> + if (!idle_cpu(cpu) || !full_capacity(cpu))
>>>> idle = false;
>>>> }
>>> There's some difference in logic between select_idle_core and
>>> select_idle_cpu as far as the full_capacity stuff you're adding goes.
>>> In select_idle_core, if all CPUs are !full_capacity, you're returning
>>> -1. But in select_idle_cpu you're returning the best idle CPU that's
>>> the most cap among the !full_capacity ones. Why there is this
>>> different in logic? Did I miss something?
>>>

<snip>

> Dude :) That is hardly an answer to the question I asked. Hint:
> *different in logic*.

Let me re-try :)

For select_idle_core, we are doing a search for a fully idle and full
capacity core, the fail-safe is select_idle_cpu because we will re-scan
the CPUs. The notion is to select an idle CPU no matter what, because
being on an idle CPU is better than waiting on a non-idle one. In
select_idle_core you can be slightly picky about the core because
select_idle_cpu is a fail safe. I measured the performance impact of
choosing the "best among low cap" vs the code changes I have (for
select_idle_core) and could not find a statistically significant impact,
hence went with the simpler code changes.

Hope I answered your question.

Thanks,
Rohit

2018-02-06 06:44:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Tue, Jan 30, 2018 at 11:47 AM, Rohit Jain <[email protected]> wrote:
[...]
>>> @@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p,
>>> struct sched_domain *sd, int
>>> */
>>> static int select_idle_smt(struct task_struct *p, struct sched_domain
>>> *sd, int target)
>>> {
>>> - int cpu;
>>> + int cpu, rcpu = -1;
>>> + unsigned long max_cap = 0;
>>>
>>> if (!static_branch_likely(&sched_smt_present))
>>> return -1;
>>> @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p,
>>> struct sched_domain *sd, int t
>>> for_each_cpu(cpu, cpu_smt_mask(target)) {
>>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>>> continue;
>>> - if (idle_cpu(cpu))
>>> - return cpu;
>>> + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
>>> + max_cap = capacity_of(cpu);
>>> + rcpu = cpu;
>>
>> At the SMT level, do you need to bother with choosing best capacity
>> among threads? If RT is eating into one of the SMT thread's underlying
>> capacity, it would eat into the other's. Wondering what's the benefit
>> of doing this here.
>
>
> Yes, you are right because of SD_SHARE_CPUCAPACITY, however the benefit
> is that if don't do this check, we might end up picking a SMT thread
> which has "high" RT/IRQ activity and be on the run queue for a while,
> till the pull side can bail us out.

Do your tests show a difference in results though with such change
(for select_idle_smt)?

thanks,

- Joel

2018-02-06 06:52:29

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Wed, Jan 31, 2018 at 9:50 AM, Rohit Jain <[email protected]> wrote:
>>>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
>>>>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 26a71eb..ce5ccf8 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>>>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>>>> }
>>>>>
>>>>> +static inline bool full_capacity(int cpu)
>>>>> +{
>>>>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>>>>> +}
>>>>> +
>>>>> static unsigned long cpu_avg_load_per_task(int cpu)
>>>>> {
>>>>> struct rq *rq = cpu_rq(cpu);
>>>>> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct
>>>>> *p,
>>>>> struct sched_domain *sd, int
>>>>>
>>>>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>>>>> cpumask_clear_cpu(cpu, cpus);
>>>>> - if (!idle_cpu(cpu))
>>>>> + if (!idle_cpu(cpu) || !full_capacity(cpu))
>>>>> idle = false;
>>>>> }
>>>>
>>>> There's some difference in logic between select_idle_core and
>>>> select_idle_cpu as far as the full_capacity stuff you're adding goes.
>>>> In select_idle_core, if all CPUs are !full_capacity, you're returning
>>>> -1. But in select_idle_cpu you're returning the best idle CPU that's
>>>> the most cap among the !full_capacity ones. Why there is this
>>>> different in logic? Did I miss something?
>>>>
>
> <snip>
>
>> Dude :) That is hardly an answer to the question I asked. Hint:
>> *different in logic*.
>
>
> Let me re-try :)
>
> For select_idle_core, we are doing a search for a fully idle and full
> capacity core, the fail-safe is select_idle_cpu because we will re-scan
> the CPUs. The notion is to select an idle CPU no matter what, because
> being on an idle CPU is better than waiting on a non-idle one. In
> select_idle_core you can be slightly picky about the core because
> select_idle_cpu is a fail safe. I measured the performance impact of
> choosing the "best among low cap" vs the code changes I have (for
> select_idle_core) and could not find a statistically significant impact,
> hence went with the simpler code changes.

That's Ok with me. Just that I remember Peter messing with this path
and that it was expensive to scan too much for some systems. The other
thing is you're really doing to do a "fail safe" as you call it search
here with SIS_PROP set. Do you see a difference in perf when doing the
same approach as you took in select_idle_core?

Peter, are you with the approach Rohit has adopted to pick best
capacity idle CPU in select_idle_cpu? I guess nr--; will bail out
early if we have SIS_PROP set, incase the scan cost gets too much but
then again we might end scanning too few CPUs.

thanks,

- Joel

2018-02-06 06:54:04

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Mon, Feb 5, 2018 at 10:50 PM, Joel Fernandes <[email protected]> wrote:
> On Wed, Jan 31, 2018 at 9:50 AM, Rohit Jain <[email protected]> wrote:
>>>>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
>>>>>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 26a71eb..ce5ccf8 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>>>>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>>>>> }
>>>>>>
>>>>>> +static inline bool full_capacity(int cpu)
>>>>>> +{
>>>>>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>>>>>> +}
>>>>>> +
>>>>>> static unsigned long cpu_avg_load_per_task(int cpu)
>>>>>> {
>>>>>> struct rq *rq = cpu_rq(cpu);
>>>>>> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct
>>>>>> *p,
>>>>>> struct sched_domain *sd, int
>>>>>>
>>>>>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>>>>>> cpumask_clear_cpu(cpu, cpus);
>>>>>> - if (!idle_cpu(cpu))
>>>>>> + if (!idle_cpu(cpu) || !full_capacity(cpu))
>>>>>> idle = false;
>>>>>> }
>>>>>
>>>>> There's some difference in logic between select_idle_core and
>>>>> select_idle_cpu as far as the full_capacity stuff you're adding goes.
>>>>> In select_idle_core, if all CPUs are !full_capacity, you're returning
>>>>> -1. But in select_idle_cpu you're returning the best idle CPU that's
>>>>> the most cap among the !full_capacity ones. Why there is this
>>>>> different in logic? Did I miss something?
>>>>>
>>
>> <snip>
>>
>>> Dude :) That is hardly an answer to the question I asked. Hint:
>>> *different in logic*.
>>
>>
>> Let me re-try :)
>>
>> For select_idle_core, we are doing a search for a fully idle and full
>> capacity core, the fail-safe is select_idle_cpu because we will re-scan
>> the CPUs. The notion is to select an idle CPU no matter what, because
>> being on an idle CPU is better than waiting on a non-idle one. In
>> select_idle_core you can be slightly picky about the core because
>> select_idle_cpu is a fail safe. I measured the performance impact of
>> choosing the "best among low cap" vs the code changes I have (for
>> select_idle_core) and could not find a statistically significant impact,
>> hence went with the simpler code changes.
>
> That's Ok with me. Just that I remember Peter messing with this path
> and that it was expensive to scan too much for some systems. The other
> thing is you're really doing to do a "fail safe" as you call it search

I meant to say here, "you're NOT really going to be doing a 'fail
safe' search here as you call it"

thanks,

- Joel

2018-02-06 17:36:14

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling



On 02/05/2018 10:42 PM, Joel Fernandes wrote:
> On Tue, Jan 30, 2018 at 11:47 AM, Rohit Jain <[email protected]> wrote:
> [...]
>>>> @@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p,
>>>> struct sched_domain *sd, int
>>>> */
>>>> static int select_idle_smt(struct task_struct *p, struct sched_domain
>>>> *sd, int target)
>>>> {
>>>> - int cpu;
>>>> + int cpu, rcpu = -1;
>>>> + unsigned long max_cap = 0;
>>>>
>>>> if (!static_branch_likely(&sched_smt_present))
>>>> return -1;
>>>> @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p,
>>>> struct sched_domain *sd, int t
>>>> for_each_cpu(cpu, cpu_smt_mask(target)) {
>>>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>>>> continue;
>>>> - if (idle_cpu(cpu))
>>>> - return cpu;
>>>> + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
>>>> + max_cap = capacity_of(cpu);
>>>> + rcpu = cpu;
>>> At the SMT level, do you need to bother with choosing best capacity
>>> among threads? If RT is eating into one of the SMT thread's underlying
>>> capacity, it would eat into the other's. Wondering what's the benefit
>>> of doing this here.
>>
>> Yes, you are right because of SD_SHARE_CPUCAPACITY, however the benefit
>> is that if don't do this check, we might end up picking a SMT thread
>> which has "high" RT/IRQ activity and be on the run queue for a while,
>> till the pull side can bail us out.
> Do your tests show a difference in results though with such change
> (for select_idle_smt)?

I don't have the numbers readily available, but I did see a measurable
difference with the select_idle_smt changes.

Thanks,
Rohit

>
> thanks,
>
> - Joel


2018-02-06 17:38:40

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling



On 02/05/2018 10:50 PM, Joel Fernandes wrote:
> On Wed, Jan 31, 2018 at 9:50 AM, Rohit Jain <[email protected]> wrote:
>>>>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++----------
>>>>>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 26a71eb..ce5ccf8 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>>>>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>>>>> }
>>>>>>
>>>>>> +static inline bool full_capacity(int cpu)
>>>>>> +{
>>>>>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>>>>>> +}
>>>>>> +
>>>>>> static unsigned long cpu_avg_load_per_task(int cpu)
>>>>>> {
>>>>>> struct rq *rq = cpu_rq(cpu);
>>>>>> @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct
>>>>>> *p,
>>>>>> struct sched_domain *sd, int
>>>>>>
>>>>>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>>>>>> cpumask_clear_cpu(cpu, cpus);
>>>>>> - if (!idle_cpu(cpu))
>>>>>> + if (!idle_cpu(cpu) || !full_capacity(cpu))
>>>>>> idle = false;
>>>>>> }
>>>>> There's some difference in logic between select_idle_core and
>>>>> select_idle_cpu as far as the full_capacity stuff you're adding goes.
>>>>> In select_idle_core, if all CPUs are !full_capacity, you're returning
>>>>> -1. But in select_idle_cpu you're returning the best idle CPU that's
>>>>> the most cap among the !full_capacity ones. Why there is this
>>>>> different in logic? Did I miss something?
>>>>>
>> <snip>
>>
>>
>>
>> Let me re-try :)
>>
>> For select_idle_core, we are doing a search for a fully idle and full
>> capacity core, the fail-safe is select_idle_cpu because we will re-scan
>> the CPUs. The notion is to select an idle CPU no matter what, because
>> being on an idle CPU is better than waiting on a non-idle one. In
>> select_idle_core you can be slightly picky about the core because
>> select_idle_cpu is a fail safe. I measured the performance impact of
>> choosing the "best among low cap" vs the code changes I have (for
>> select_idle_core) and could not find a statistically significant impact,
>> hence went with the simpler code changes.
> That's Ok with me. Just that I remember Peter messing with this path
> and that it was expensive to scan too much for some systems. The other
> thing is you're really doing to do a "fail safe" as you call it search
> here with SIS_PROP set. Do you see a difference in perf when doing the
> same approach as you took in select_idle_core?

I didn't see any measurable impact by changing select_idle_core from the
above logic to be the same logic as select_idle_cpu. I am OK with either
if there are concerns.

Thanks,
Rohit

>
> Peter, are you with the approach Rohit has adopted to pick best
> capacity idle CPU in select_idle_cpu? I guess nr--; will bail out
> early if we have SIS_PROP set, incase the scan cost gets too much but
> then again we might end scanning too few CPUs.
>
> thanks,
>
> - Joel


2018-02-09 12:38:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Mon, Jan 29, 2018 at 07:39:15PM -0800, Joel Fernandes wrote:

> > @@ -6081,7 +6086,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> >
> > for_each_cpu(cpu, cpu_smt_mask(core)) {
> > cpumask_clear_cpu(cpu, cpus);
> > - if (!idle_cpu(cpu))
> > + if (!idle_cpu(cpu) || !full_capacity(cpu))
> > idle = false;
> > }
>
> There's some difference in logic between select_idle_core and
> select_idle_cpu as far as the full_capacity stuff you're adding goes.
> In select_idle_core, if all CPUs are !full_capacity, you're returning
> -1. But in select_idle_cpu you're returning the best idle CPU that's
> the most cap among the !full_capacity ones. Why there is this
> different in logic? Did I miss something?

select_idle_core() wants to find a whole core that's idle, the way he
changed it we'll not consider a core idle if one (or more) of the
siblings have a heavy IRQ load.

select_idle_cpu() just wants an idle (logical) CPU, and here it looks
for

> >
> > @@ -6102,7 +6107,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
> > */
> > static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> > {
> > - int cpu;
> > + int cpu, rcpu = -1;
> > + unsigned long max_cap = 0;
> >
> > if (!static_branch_likely(&sched_smt_present))
> > return -1;
> > @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> > for_each_cpu(cpu, cpu_smt_mask(target)) {
> > if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > continue;
> > - if (idle_cpu(cpu))
> > - return cpu;
> > + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
> > + max_cap = capacity_of(cpu);
> > + rcpu = cpu;
>
> At the SMT level, do you need to bother with choosing best capacity
> among threads? If RT is eating into one of the SMT thread's underlying
> capacity, it would eat into the other's. Wondering what's the benefit
> of doing this here.

Its about latency mostly I think; scheduling on the other sibling gets
you to run faster -- the core will interleave the SMT threads and you
don't get to suffer the interrupt load _as_bad_.

If people really cared about their RT workload, they would not allow
regular tasks on its siblings in any case.

2018-02-09 12:38:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Mon, Feb 05, 2018 at 10:50:41PM -0800, Joel Fernandes wrote:

> That's Ok with me. Just that I remember Peter messing with this path
> and that it was expensive to scan too much for some systems. The other
> thing is you're really doing to do a "fail safe" as you call it search
> here with SIS_PROP set. Do you see a difference in perf when doing the
> same approach as you took in select_idle_core?
>
> Peter, are you with the approach Rohit has adopted to pick best
> capacity idle CPU in select_idle_cpu? I guess nr--; will bail out
> early if we have SIS_PROP set, incase the scan cost gets too much but
> then again we might end scanning too few CPUs.

_IF_ you want to consider capacity, I don't think there's much else you
can do. And yes, nr-- might have to save the day more often.

2018-02-09 12:55:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On Mon, Jan 29, 2018 at 03:27:09PM -0800, Rohit Jain wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26a71eb..ce5ccf8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity_orig;
> }
>
> +static inline bool full_capacity(int cpu)
> +{
> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
> +}

I don't like that name; >.75 != 1.

Maybe invert things and do something like:

static inline bool reduced_capacity(int cpu)
{
return capacity_of(cpu) < (3*capacity_orig_of(cpu))/4;
}

> @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
> for_each_cpu(cpu, cpu_smt_mask(target)) {
> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> continue;
> + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
> + max_cap = capacity_of(cpu);
> + rcpu = cpu;
> + }

if (idle_cpu(cpu)) {
if (!reduced_capacity(cpu))
return cpu;

if (capacity_cpu(cpu) > max_cap) {
max_cap = capacity_cpu(cpu);
rcpu = cpu;
}
}

Would be more consistent, I think.

> }
>
> - return -1;
> + return rcpu;
> }



> @@ -6143,6 +6151,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> u64 time, cost;
> s64 delta;
> int cpu, nr = INT_MAX;
> + int best_cpu = -1;
> + unsigned int best_cap = 0;

Randomly different names for the same thing as in select_idle_smt().
Thinking up two different names for the same thing is more work; be more
lazy.

> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> if (!this_sd)
> @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> return -1;
> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> continue;
> + if (idle_cpu(cpu)) {
> + if (full_capacity(cpu)) {
> + best_cpu = cpu;
> + break;
> + } else if (capacity_of(cpu) > best_cap) {
> + best_cap = capacity_of(cpu);
> + best_cpu = cpu;
> + }
> + }

No need for the else. And you'll note you're once again inconsistent
with your previous self.

But here I worry about big.little a wee bit. I think we're allowed big
and little cores on the same L3 these days, and you can't directly
compare capacity between them.

Morten / Dietmar, any comments?

> @@ -6193,13 +6210,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> struct sched_domain *sd;
> int i;
>
> - if (idle_cpu(target))
> + if (idle_cpu(target) && full_capacity(target))
> return target;
>
> /*
> * If the previous cpu is cache affine and idle, don't be stupid.
> */
> - if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
> + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev) &&
> + full_capacity(prev))
> return prev;

split before idle_cpu() for a better balance.

2018-02-09 15:47:53

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On 02/09/2018 01:53 PM, Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 03:27:09PM -0800, Rohit Jain wrote:

[...]

>> @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> return -1;
>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>> continue;
>> + if (idle_cpu(cpu)) {
>> + if (full_capacity(cpu)) {
>> + best_cpu = cpu;
>> + break;
>> + } else if (capacity_of(cpu) > best_cap) {
>> + best_cap = capacity_of(cpu);
>> + best_cpu = cpu;
>> + }
>> + }
>
> No need for the else. And you'll note you're once again inconsistent
> with your previous self.
>
> But here I worry about big.little a wee bit. I think we're allowed big
> and little cores on the same L3 these days, and you can't directly
> compare capacity between them.
>
> Morten / Dietmar, any comments?

Yes, for DynamIQ (big.little successor) systems, those cpus can have
different capacity_orig_of() values already.


2018-02-09 22:01:37

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling



On 02/09/2018 07:46 AM, Dietmar Eggemann wrote:
> On 02/09/2018 01:53 PM, Peter Zijlstra wrote:
>> On Mon, Jan 29, 2018 at 03:27:09PM -0800, Rohit Jain wrote:
>
> [...]
>
>>> @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct
>>> *p, struct sched_domain *sd, int t
>>>               return -1;
>>>           if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>>>               continue;
>>> +        if (idle_cpu(cpu)) {
>>> +            if (full_capacity(cpu)) {
>>> +                best_cpu = cpu;
>>> +                break;
>>> +            } else if (capacity_of(cpu) > best_cap) {
>>> +                best_cap = capacity_of(cpu);
>>> +                best_cpu = cpu;
>>> +            }
>>> +        }
>>
>> No need for the else. And you'll note you're once again inconsistent
>> with your previous self.
>>
>> But here I worry about big.little a wee bit. I think we're allowed big
>> and little cores on the same L3 these days, and you can't directly
>> compare capacity between them.
>>
>> Morten / Dietmar, any comments?
>
> Yes, for DynamIQ (big.little successor) systems, those cpus can have
> different capacity_orig_of() values already.
>

OK, given that there are asymmetric capacities in L3 cores, we would
probably have something like the below(?) in select_idle_cpu:

          if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
              continue;
+        if (idle_cpu(cpu) && !reduced_capacity(cpu))
+            break;


2018-02-09 22:30:37

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling



On 02/09/2018 04:53 AM, Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 03:27:09PM -0800, Rohit Jain wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 26a71eb..ce5ccf8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5625,6 +5625,11 @@ static unsigned long capacity_orig_of(int cpu)
>> return cpu_rq(cpu)->cpu_capacity_orig;
>> }
>>
>> +static inline bool full_capacity(int cpu)
>> +{
>> + return capacity_of(cpu) >= (capacity_orig_of(cpu)*3)/4;
>> +}
> I don't like that name; >.75 != 1.
>
> Maybe invert things and do something like:
>
> static inline bool reduced_capacity(int cpu)
> {
> return capacity_of(cpu) < (3*capacity_orig_of(cpu))/4;
> }

OK, I will change the name and invert the logic.

>> @@ -6110,11 +6116,13 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>> for_each_cpu(cpu, cpu_smt_mask(target)) {
>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>> continue;
>> + if (idle_cpu(cpu) && (capacity_of(cpu) > max_cap)) {
>> + max_cap = capacity_of(cpu);
>> + rcpu = cpu;
>> + }
> if (idle_cpu(cpu)) {
> if (!reduced_capacity(cpu))
> return cpu;
>
> if (capacity_cpu(cpu) > max_cap) {
> max_cap = capacity_cpu(cpu);
> rcpu = cpu;
> }
> }
>
> Would be more consistent, I think.

OK

>
>> }
>>
>> - return -1;
>> + return rcpu;
>> }
>
>
>> @@ -6143,6 +6151,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> u64 time, cost;
>> s64 delta;
>> int cpu, nr = INT_MAX;
>> + int best_cpu = -1;
>> + unsigned int best_cap = 0;
> Randomly different names for the same thing as in select_idle_smt().
> Thinking up two different names for the same thing is more work; be more
> lazy.

OK, will be more consistent in v1

>
>> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> if (!this_sd)
>> @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> return -1;
>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>> continue;
>> + if (idle_cpu(cpu)) {
>> + if (full_capacity(cpu)) {
>> + best_cpu = cpu;
>> + break;
>> + } else if (capacity_of(cpu) > best_cap) {
>> + best_cap = capacity_of(cpu);
>> + best_cpu = cpu;
>> + }
>> + }
> No need for the else. And you'll note you're once again inconsistent
> with your previous self.
>
> But here I worry about big.little a wee bit. I think we're allowed big
> and little cores on the same L3 these days, and you can't directly
> compare capacity between them.
>
> Morten / Dietmar, any comments?
>
>> @@ -6193,13 +6210,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> struct sched_domain *sd;
>> int i;
>>
>> - if (idle_cpu(target))
>> + if (idle_cpu(target) && full_capacity(target))
>> return target;
>>
>> /*
>> * If the previous cpu is cache affine and idle, don't be stupid.
>> */
>> - if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
>> + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev) &&
>> + full_capacity(prev))
>> return prev;
> split before idle_cpu() for a better balance.


OK

Thanks,
Rohit

2018-02-14 09:12:22

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

On 02/09/2018 11:05 PM, Rohit Jain wrote:
>
> On 02/09/2018 07:46 AM, Dietmar Eggemann wrote:
>> On 02/09/2018 01:53 PM, Peter Zijlstra wrote:
>>> On Mon, Jan 29, 2018 at 03:27:09PM -0800, Rohit Jain wrote:
>>
>> [...]
>>
>>>> @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct
>>>> *p, struct sched_domain *sd, int t
>>>>               return -1;
>>>>           if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>>>>               continue;
>>>> +        if (idle_cpu(cpu)) {
>>>> +            if (full_capacity(cpu)) {
>>>> +                best_cpu = cpu;
>>>> +                break;
>>>> +            } else if (capacity_of(cpu) > best_cap) {
>>>> +                best_cap = capacity_of(cpu);
>>>> +                best_cpu = cpu;
>>>> +            }
>>>> +        }
>>>
>>> No need for the else. And you'll note you're once again inconsistent
>>> with your previous self.
>>>
>>> But here I worry about big.little a wee bit. I think we're allowed big
>>> and little cores on the same L3 these days, and you can't directly
>>> compare capacity between them.
>>>
>>> Morten / Dietmar, any comments?
>>
>> Yes, for DynamIQ (big.little successor) systems, those cpus can have
>> different capacity_orig_of() values already.
>>
>
> OK, given that there are asymmetric capacities in L3 cores, we would
> probably have something like the below(?) in select_idle_cpu:
>
>           if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>               continue;
> +        if (idle_cpu(cpu) && !reduced_capacity(cpu))
> +            break;
>

Only returning the first idle cpu w/o reduced capacity will definitely
work better here on future DynamIQ systems than trying to find the
best_cap cpu by taking capacity_of() into consideration.

2018-03-10 20:45:10

by Rohit Jain

[permalink] [raw]
Subject: Re: [RESEND PATCH] sched/fair: consider RT/IRQ pressure in select_idle_sibling

Hi Peter,

On 02/09/2018 04:53 AM, Peter Zijlstra wrote:

<snip>

>> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> if (!this_sd)
>> @@ -6173,8 +6183,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>> return -1;
>> if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>> continue;
>> + if (idle_cpu(cpu)) {
>> + if (full_capacity(cpu)) {
>> + best_cpu = cpu;
>> + break;
>> + } else if (capacity_of(cpu) > best_cap) {
>> + best_cap = capacity_of(cpu);
>> + best_cpu = cpu;
>> + }
>> + }
> No need for the else. And you'll note you're once again inconsistent
> with your previous self.
>
> But here I worry about big.little a wee bit. I think we're allowed big
> and little cores on the same L3 these days, and you can't directly
> compare capacity between them.
>
>
<snip>

After pulling to the latest code I see that the changes by Mel Gorman
(commit 32e839dda3ba576943365f0f5817ce5c843137dc) have created a short
path for returning an idle CPU.

The fact that now there exists a short path, to bypass rest of
select_idle_sibling (SIS) is causing a regression in the
"hackbench + ping" testcase *when* I add capacity awareness in the baseline
code as was discussed here.

In details: baseline today has a short cut in the recent_used_cpu to
bypass SIS. When I add capacity awareness in the SIS code path, causing
that extra search to find a better CPU itself is taking more time than
the benefit it provides.

However, there are certain patches which reduce SIS cost while
maintaining a similar spread for threads on CPUs. When I use those
patches I see that the benefit for adding capacity awareness is
restored. Please suggest how to proceed on this.

Thanks,
Rohit