2017-12-05 19:11:10

by Atish Patra

[permalink] [raw]
Subject: [PATCH RFC v2] Fix race window during idle cpu selection.

This patch tries to fix a possible race window that lets multiple
tasks wakeup on the same cpu. schbench, uperf & Jackbench (Android)
mostly showed improvements while none of the other standard benchmarks
have shown any regression.

changes from v1->v2:
1. Added the performance numbers.
2. Did not include the debug patch.
It can still be found at
https://lkml.org/lkml/2017/10/31/16

Atish Patra (1):
sched: Minimize the idle cpu selection race window.

kernel/sched/core.c | 4 ++++
kernel/sched/fair.c | 12 +++++++++---
kernel/sched/idle_task.c | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 15 insertions(+), 3 deletions(-)

--
2.7.4


2017-12-05 19:11:15

by Atish Patra

[permalink] [raw]
Subject: [PATCH RFC v2] sched: Minimize the idle cpu selection race window.

Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.

Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.

Here are some performance numbers:

Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
uperf config:ping pong test on loopback interface with message size = 8k

Baseline (4.14) Baseline +pcpu
Mean stdev Mean stdev Improvement(%)
1 9.056 0.02 8.966 0.083 -0.993
2 17.664 0.13 17.448 0.303 -1.222
4 32.03 0.22 31.972 0.129 -0.181
8 58.198 0.31 58.588 0.198 0.670
16 101.018 0.67 100.056 0.455 -0.952
32 148.1 15.41 164.494 2.312 11.069
64 203.66 1.16 203.042 1.348 -0.3073
128 197.12 1.04 194.722 1.174 -1.2165

schbench config: message threads = 2; time = 180s, worker thread = 19

Baseline(4.14) Base+pcpu
Max Latency
Mean stdev Mean stdev Improvement(%)
16457 2046.51 12985.4 48.46 21.0949747828

Latency in usec(lower is better)
Mean stdev Mean stdev Improvement(%)
50% 63 4.774 58 4 7.936
75% 81.6 1.2 79.4 0.8 2.696
90% 92.8 0.979 90.6 0.489 2.370
95% 102.6 1.624 99 0.894 3.508
99% 126.25 4.573 120.4 3.872 4.63
99.5% 2712.2 4772.883 133.6 5.571 95.074

Reported by Joel:
Jackbench(Android)
count 16304 (@60 fps, 4.5 minutes)

Without patch With patch
mean 5.196633 4.429641 (+14.75%)
std 2.030054 2.310025
25% 5.606810 1.991017 (+64.48%)
50% 5.824013 5.716631 (+1.84%)
75% 5.987102 5.932751 (+0.90%)
95% 6.461230 6.301318 (+2.47%)
99% 9.828959 9.697076 (+1.34%)

The details are at:
https://lkml.org/lkml/2017/11/22/5

Suggested-by: Peter Zijlstra <[email protected]>
Tested-by: Joel Fernandes <[email protected]>
Signed-off-by: Atish Patra <[email protected]>
---
kernel/sched/core.c | 4 ++++
kernel/sched/fair.c | 12 +++++++++---
kernel/sched/idle_task.c | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2288a14..d9d501c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3896,6 +3896,7 @@ int task_prio(const struct task_struct *p)
return p->prio - MAX_RT_PRIO;
}

+DEFINE_PER_CPU(int, claim_wakeup);
/**
* idle_cpu - is a given CPU idle currently?
* @cpu: the processor in question.
@@ -3917,6 +3918,9 @@ int idle_cpu(int cpu)
return 0;
#endif

+ if (per_cpu(claim_wakeup, cpu))
+ return 0;
+
return 1;
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13393bb..885023a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6077,8 +6077,10 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
idle = false;
}

- if (idle)
+ if (idle) {
+ per_cpu(claim_wakeup, core) = 1;
return core;
+ }
}

/*
@@ -6102,8 +6104,10 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu(cpu, cpu_smt_mask(target)) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
+ if (idle_cpu(cpu)) {
+ per_cpu(claim_wakeup, cpu) = 1;
return cpu;
+ }
}

return -1;
@@ -6165,8 +6169,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
+ if (idle_cpu(cpu)) {
+ per_cpu(claim_wakeup, cpu) = 1;
break;
+ }
}

time = local_clock() - time;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..64d6495 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
put_prev_task(rq, prev);
update_idle_core(rq);
+ this_cpu_write(claim_wakeup, 0);
schedstat_inc(rq->sched_goidle);
return rq->idle;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8aa24b4..5f70b98 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1068,6 +1068,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);

struct sched_group_capacity {
atomic_t ref;
--
2.7.4

2018-02-07 14:01:26

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH RFC v2] sched: Minimize the idle cpu selection race window.

On Tue, 05 Dec, at 01:09:07PM, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
>
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.
>
> Here are some performance numbers:

I ran this patch through some tests here on the SUSE performance grid
and there's a definite regression for Mike's personal favourite
benchmark, tbench.

Here are the results: vanilla 4.15-rc9 on the left, -rc9 plus this
patch on the right.

tbench4
4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Min mb/sec-1 484.50 ( 0.00%) 463.03 ( -4.43%)
Min mb/sec-2 961.43 ( 0.00%) 959.35 ( -0.22%)
Min mb/sec-4 1789.60 ( 0.00%) 1760.21 ( -1.64%)
Min mb/sec-8 3518.51 ( 0.00%) 3471.47 ( -1.34%)
Min mb/sec-16 5521.12 ( 0.00%) 5409.77 ( -2.02%)
Min mb/sec-32 7268.61 ( 0.00%) 7491.29 ( 3.06%)
Min mb/sec-64 14413.45 ( 0.00%) 14347.69 ( -0.46%)
Min mb/sec-128 13501.84 ( 0.00%) 13413.82 ( -0.65%)
Min mb/sec-192 13237.02 ( 0.00%) 13231.43 ( -0.04%)
Hmean mb/sec-1 505.20 ( 0.00%) 485.81 ( -3.84%)
Hmean mb/sec-2 973.12 ( 0.00%) 970.67 ( -0.25%)
Hmean mb/sec-4 1835.22 ( 0.00%) 1788.54 ( -2.54%)
Hmean mb/sec-8 3529.35 ( 0.00%) 3487.20 ( -1.19%)
Hmean mb/sec-16 5531.16 ( 0.00%) 5437.43 ( -1.69%)
Hmean mb/sec-32 7627.96 ( 0.00%) 8021.26 ( 5.16%)
Hmean mb/sec-64 14441.20 ( 0.00%) 14395.08 ( -0.32%)
Hmean mb/sec-128 13620.40 ( 0.00%) 13569.17 ( -0.38%)
Hmean mb/sec-192 13265.26 ( 0.00%) 13263.98 ( -0.01%)
Max mb/sec-1 510.30 ( 0.00%) 489.89 ( -4.00%)
Max mb/sec-2 989.45 ( 0.00%) 976.10 ( -1.35%)
Max mb/sec-4 1845.65 ( 0.00%) 1795.50 ( -2.72%)
Max mb/sec-8 3574.03 ( 0.00%) 3547.56 ( -0.74%)
Max mb/sec-16 5556.99 ( 0.00%) 5564.80 ( 0.14%)
Max mb/sec-32 7678.18 ( 0.00%) 8098.63 ( 5.48%)
Max mb/sec-64 14463.07 ( 0.00%) 14437.58 ( -0.18%)
Max mb/sec-128 13659.67 ( 0.00%) 13602.65 ( -0.42%)
Max mb/sec-192 13612.01 ( 0.00%) 13832.98 ( 1.62%)

There's a nice little performance bump around the 32-client mark.
Incidentally, my test machine has 2 NUMA nodes with 24 cpus (12 cores,
2 threads) each. So 32 clients is the point at which things no longer
fit on a single node.

It doesn't look like the regression is caused by the schedule() path
being slightly longer (i.e. it's not a latency issue) because schbench
results show improvements for the low-end:

schbench
4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Lat 50.00th-qrtle-1 46.00 ( 0.00%) 36.00 ( 21.74%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 37.00 ( 24.49%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 38.00 ( 26.92%)
Lat 95.00th-qrtle-1 56.00 ( 0.00%) 41.00 ( 26.79%)
Lat 99.00th-qrtle-1 61.00 ( 0.00%) 46.00 ( 24.59%)
Lat 99.50th-qrtle-1 63.00 ( 0.00%) 48.00 ( 23.81%)
Lat 99.90th-qrtle-1 77.00 ( 0.00%) 64.00 ( 16.88%)
Lat 50.00th-qrtle-2 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-2 47.00 ( 0.00%) 46.00 ( 2.13%)
Lat 90.00th-qrtle-2 50.00 ( 0.00%) 49.00 ( 2.00%)
Lat 95.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
Lat 99.00th-qrtle-2 58.00 ( 0.00%) 57.00 ( 1.72%)
Lat 99.50th-qrtle-2 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 99.90th-qrtle-2 72.00 ( 0.00%) 69.00 ( 4.17%)
Lat 50.00th-qrtle-4 46.00 ( 0.00%) 45.00 ( 2.17%)
Lat 75.00th-qrtle-4 49.00 ( 0.00%) 48.00 ( 2.04%)
Lat 90.00th-qrtle-4 52.00 ( 0.00%) 51.00 ( 1.92%)
Lat 95.00th-qrtle-4 55.00 ( 0.00%) 53.00 ( 3.64%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 59.00 ( 3.28%)
Lat 99.50th-qrtle-4 63.00 ( 0.00%) 61.00 ( 3.17%)
Lat 99.90th-qrtle-4 69.00 ( 0.00%) 74.00 ( -7.25%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 50.00 ( -4.17%)
Lat 75.00th-qrtle-8 52.00 ( 0.00%) 54.00 ( -3.85%)
Lat 90.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%)
Lat 95.00th-qrtle-8 57.00 ( 0.00%) 61.00 ( -7.02%)
Lat 99.00th-qrtle-8 64.00 ( 0.00%) 68.00 ( -6.25%)
Lat 99.50th-qrtle-8 67.00 ( 0.00%) 72.00 ( -7.46%)
Lat 99.90th-qrtle-8 81.00 ( 0.00%) 81.00 ( 0.00%)
Lat 50.00th-qrtle-16 50.00 ( 0.00%) 47.00 ( 6.00%)
Lat 75.00th-qrtle-16 59.00 ( 0.00%) 57.00 ( 3.39%)
Lat 90.00th-qrtle-16 66.00 ( 0.00%) 65.00 ( 1.52%)
Lat 95.00th-qrtle-16 69.00 ( 0.00%) 68.00 ( 1.45%)
Lat 99.00th-qrtle-16 76.00 ( 0.00%) 75.00 ( 1.32%)
Lat 99.50th-qrtle-16 79.00 ( 0.00%) 79.00 ( 0.00%)
Lat 99.90th-qrtle-16 86.00 ( 0.00%) 89.00 ( -3.49%)
Lat 50.00th-qrtle-23 52.00 ( 0.00%) 52.00 ( 0.00%)
Lat 75.00th-qrtle-23 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 90.00th-qrtle-23 75.00 ( 0.00%) 74.00 ( 1.33%)
Lat 95.00th-qrtle-23 81.00 ( 0.00%) 79.00 ( 2.47%)
Lat 99.00th-qrtle-23 95.00 ( 0.00%) 90.00 ( 5.26%)
Lat 99.50th-qrtle-23 12624.00 ( 0.00%) 1050.00 ( 91.68%)
Lat 99.90th-qrtle-23 15184.00 ( 0.00%) 13872.00 ( 8.64%)

If you'd like to run these tests on your own machines they're all
available at https://github.com/gormanm/mmtests.git.