2023-09-26 08:19:55

by Chen Yu

[permalink] [raw]
Subject: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

RFC -> v1:
- drop RFC
- Only record the short sleeping time for each task, to better honor the
burst sleeping tasks. (Mathieu Desnoyers)
- Keep the forward movement monotonic for runqueue's cache-hot timeout value.
(Mathieu Desnoyers, Aaron Lu)
- Introduce a new helper function cache_hot_cpu() that considers
rq->cache_hot_timeout. (Aaron Lu)
- Add analysis of why inhibiting task migration could bring better throughput
for some benchmarks. (Gautham R. Shenoy)
- Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
(K Prateek Nayak)

Thanks for your comments and review!

----------------------------------------------------------------------

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inhibit the task migration while keeping the work conservation of
scheduler could benefit many workloads. Inspired by Mathieu's
proposal to limit the task migration ratio[1], this patch considers
the task average sleep duration. If the task is a short sleeping one,
then tag its previous CPU as cache hot for a short while. During this
reservation period, other wakees are not allowed to pick this idle CPU
until a timeout. Later if the task is woken up again, it can find its
previous CPU still idle, and choose it in select_idle_sibling().

This test is based on tip/sched/core, on top of
Commit afc1996859a2
("sched/fair: Ratelimit update to tg->load_avg")

patch afc1996859a2 has significantly reduced the cost of task migration,
the SIS_CACHE further reduces that cost. SIS_CACHE shows noticeable
throughput improvement of netperf/tbench around 100% load.

[patch 1/2] records the task's average short sleeping time in
its per sched_entity structure.
[patch 2/2] introduces the SIS_CACHE to skip the cache-hot
idle CPU during wakeup.

Link: https://lore.kernel.org/lkml/[email protected]/ #1

Chen Yu (2):
sched/fair: Record the short sleeping time of a task
sched/fair: skip the cache hot CPU in select_idle_cpu()

include/linux/sched.h | 3 ++
kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 87 insertions(+), 4 deletions(-)

--
2.25.1


2023-09-26 08:42:54

by Chen Yu

[permalink] [raw]
Subject: [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Problem statement:
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Proposal:
Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
the sleep time of the task for better task placement. Based on the
task's short sleeping history, keep p's previous CPU idle for a short
while. Later when p is woken up, it can choose its previous CPU in
select_idle_sibling(). When p's previous CPU is reserved, another wakee
is not allowed to choose this CPU in select_idle_idle(). The reservation
period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.

This does not break the work conservation of the scheduler, because
wakee will still try its best to find an idle CPU. The difference is that
different idle CPUs might have different priorities.

Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
which brings task stacking. Mitigate this by returning the first cache-hot
idle CPU if no cache-cold CPU is found.

Benchmarks:
Tested on Intel Xeon Sapphire Rapids, 56C/112T x 2 sockets. CPU frequency
governor is set to performance, CPU C-state deeper than C1 are disabled.
There is noticeable improvement of netperf/tbench under 100% load.
0.7% improvement is observed on an OLTP (on-line transaction processing)
benchmark.

hackbench throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 4.34) +8.43 ( 10.67)
process-pipe 2-groups 1.00 ( 7.02) +0.48 ( 6.66)
process-pipe 4-groups 1.00 ( 1.69) +3.06 ( 5.26)
process-sockets 1-groups 1.00 ( 3.35) +0.51 ( 0.99)
process-sockets 2-groups 1.00 ( 3.56) +1.56 ( 0.85)
process-sockets 4-groups 1.00 ( 0.10) -0.10 ( 0.07)
threads-pipe 1-groups 1.00 ( 9.86) +8.21 ( 0.46)
threads-pipe 2-groups 1.00 ( 12.26) +1.95 ( 7.06)
threads-pipe 4-groups 1.00 ( 8.33) -5.81 ( 7.16)
threads-sockets 1-groups 1.00 ( 1.77) +0.41 ( 2.29)
threads-sockets 2-groups 1.00 ( 5.26) -7.86 ( 9.41)
threads-sockets 4-groups 1.00 ( 0.04) +0.02 ( 0.08)

netperf throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 1.57) +0.08 ( 1.23)
TCP_RR 112-threads 1.00 ( 1.01) -0.41 ( 0.28)
TCP_RR 168-threads 1.00 ( 0.95) -1.00 ( 0.70)
TCP_RR 224-threads 1.00 ( 0.70) +119.95 ( 34.26)
TCP_RR 280-threads 1.00 ( 16.77) -0.37 ( 17.40)
TCP_RR 336-threads 1.00 ( 20.37) -0.33 ( 22.35)
TCP_RR 392-threads 1.00 ( 28.30) -0.48 ( 31.66)
TCP_RR 448-threads 1.00 ( 39.48) -0.13 ( 38.72)
UDP_RR 56-threads 1.00 ( 1.47) -0.09 ( 1.57)
UDP_RR 112-threads 1.00 ( 7.88) -0.05 ( 9.98)
UDP_RR 168-threads 1.00 ( 10.48) -1.27 ( 16.89)
UDP_RR 224-threads 1.00 ( 15.08) -2.13 ( 23.49)
UDP_RR 280-threads 1.00 ( 21.16) +0.17 ( 23.05)
UDP_RR 336-threads 1.00 ( 24.41) +0.46 ( 22.11)
UDP_RR 392-threads 1.00 ( 27.58) +0.01 ( 29.47)
UDP_RR 448-threads 1.00 ( 37.33) +0.08 ( 56.11)

tbench throughput
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 1.87) -0.45 ( 2.10)
loopback 112-threads 1.00 ( 0.21) -0.86 ( 0.12)
loopback 168-threads 1.00 ( 0.08) -2.47 ( 0.21)
loopback 224-threads 1.00 ( 2.57) +13.39 ( 0.79)
loopback 280-threads 1.00 ( 0.20) -0.07 ( 0.39)
loopback 336-threads 1.00 ( 0.14) +0.14 ( 0.22)
loopback 392-threads 1.00 ( 0.32) +0.25 ( 0.23)
loopback 448-threads 1.00 ( 0.87) +1.45 ( 0.17)

schbench 99.0th latency
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.36) +1.02 ( 0.63)
normal 2-mthreads 1.00 ( 2.44) -3.74 ( 3.92)
normal 4-mthreads 1.00 ( 3.14) +4.17 ( 3.87)

Analysis:
Gautham is interested in the reason why waking up the task on its previous
CPU brings benefits. The following digs into that.

Take netperf 224 case as an example, the reason why netperf improves a
lot is due to less task migration brings better cache locality. Described
as below.

bpftrace script to track the migration number within 10 seconds:
kretfunc:select_task_rq_fair
{
$p = (struct task_struct *)args->p;
if ($p->comm == "netperf") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_netperf = count();
} else {
@wakeup_prev_netperf = count();
}
}
}

interval:s:10
{
time("\n%H:%M:%S Wakeup statistics: \n");
print(@wakeup_migrate_netperf);
clear(@wakeup_migrate_netperf);
print(@wakeup_prev_netperf);
clear(@wakeup_prev_netperf);
}

NO_SIS_CACHE:
@wakeup_migrate_netperf: 48002669
@wakeup_prev_netperf: 14853103

SIS_CACHE:
@wakeup_migrate_netperf: 571
@wakeup_prev_netperf: 136620030

The ratio the task migration has been reduced a lot with SIS_CACHE.

perf topdown to track the PMU events, by running the following commands:
perf stat -M TopdownL1 -- sleep 10
perf stat -M tma_backend_bound_group -- sleep 10
perf stat -M tma_core_bound_group -- sleep 10
perf stat -M tma_ports_utilization_group -- sleep 10
perf stat -M tma_memory_bound_group -- sleep 10

NO_SIS_CACHE:
19.1 % tma_backend_bound <--------
7.1 % tma_core_bound
0.2 % tma_divider
35.0 % tma_ports_utilization
13.9 % tma_ports_utilized_0
15.0 % tma_ports_utilized_1
16.5 % tma_ports_utilized_3m
8.3 % tma_memory_bound
4.8 % tma_dram_bound
2.1 % tma_l2_bound
14.4 % tma_bad_speculation
43.0 % tma_frontend_bound
23.4 % tma_retiring

SIS_CACHE:
5.4 % tma_backend_bound <--------
7.5 % tma_core_bound
0.2 % tma_divider
39.8 % tma_ports_utilization
13.9 % tma_ports_utilized_0
15.3 % tma_ports_utilized_1
16.5 % tma_ports_utilized_3m
6.7 % tma_memory_bound
4.5 % tma_dram_bound
2.2 % tma_l2_bound
16.6 % tma_bad_speculation
44.7 % tma_frontend_bound
33.3 % tma_retiring

The tma_backend_bound ratio has been decreased a lot. Inside the backend
bound metrics, the tma_ports_utilized_x is the fraction of cycles CPU
executed x uops per cycle on all execution ports(on this platform it is
the Logical Processor cycles). It is the RESOURCE_STALLS cycles on this
Logical Processor. It reflects that waking up on its previous logical
CPU would reduce the cycle on uops execution. Meanwhile, the l2_bound
is also reduced due to higher L2 cache locality.

The reason why netperf has improvement on TCP_RR over UDP_RR might be
that, the TCP_RR has a shorter sleep time than UDP_RR(per the bpftrace
result), which could be easier to be detected and taken care of by
SIS_CACHE. TCP has to maintain a connection between the sender and
receiver, so the CPU needs to do more CPU work before falling asleep.
UDP is much faster to finish the package sending and enters sleep more
frequently.

Limitations:
As pointed out by Gautham, some other tasks could be woken up on that
reservation CPU due to:

1) wake-affine choosing that CPU
2) newidle-balance pulling the other task on that CPU
3) !wake-affine && that CPU was also the other task's previous CPU

Currently, we do not deal with these cases and only propose a simple enough
method, later there could be some incremental optimizations.

Link: https://lore.kernel.org/lkml/[email protected]/ #1
Suggested-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 297b9470829c..12a8b4594dff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6632,6 +6632,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
now = sched_clock_cpu(cpu_of(rq));
p->se.prev_dequeue_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+ /*
+ * If this rq will become idle, and if the dequeued
+ * task is a short sleeping one, check if we can reserve
+ * this idle CPU for that task for a short while.
+ * During this reservation period, other wakees will
+ * skip this 'idle' CPU in select_idle_cpu(), and this
+ * short sleeping task can pick its previous CPU in
+ * select_idle_sibling(), which brings better cache
+ * locality.
+ */
+ if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+ p->se.sis_rsv_avg)
+ rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sis_rsv_avg);
+#endif
}

#ifdef CONFIG_SMP
@@ -6982,6 +6997,25 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
return new_cpu;
}

+/*
+ * If a short sleeping task has once run on this idle CPU
+ * not long ago, return 1 to indicate that the CPU is still
+ * cache-hot for that task, and no one except for that task
+ * should pick this cache-hot CPU during wakeup.
+ *
+ * The CPU is expected to be idle when invoking this function.
+ */
+static int cache_hot_cpu(int cpu)
+{
+ if (!sched_feat(SIS_CACHE))
+ return 0;
+
+ if (sched_clock_cpu(cpu) >= cpu_rq(cpu)->cache_hot_timeout)
+ return 0;
+
+ return 1;
+}
+
static inline int __select_idle_cpu(int cpu, struct task_struct *p)
{
if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
@@ -7125,7 +7159,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
- int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ int i, cpu, idle_cpu = -1, nr = INT_MAX, first_hot_cpu = -1;
struct sched_domain_shared *sd_share;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
@@ -7180,18 +7214,41 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
- if ((unsigned int)i < nr_cpumask_bits)
+ /*
+ * Only the cache-cold idle CPU is returned. If no
+ * cache-cold CPU is found, choose the first idle cpu
+ * stored in idle_cpu.
+ */
+ if ((unsigned int)i < nr_cpumask_bits && !cache_hot_cpu(i))
return i;

} else {
if (!--nr)
return -1;
idle_cpu = __select_idle_cpu(cpu, p);
- if ((unsigned int)idle_cpu < nr_cpumask_bits)
- break;
+ if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+ if (!cache_hot_cpu(idle_cpu)) {
+ first_hot_cpu = -1;
+ break;
+ }
+
+ /*
+ * Record the first cache-hot idle CPU
+ * as the last resort. This is to deal
+ * with the case that, every CPU is
+ * cache-hot, and we want to choose an
+ * idle CPU over a non-idle target CPU.
+ */
+ if (first_hot_cpu == -1)
+ first_hot_cpu = idle_cpu;
+ }
}
}

+ /* all idle CPUs are cache-hot, choose the first cache-hot one */
+ if (first_hot_cpu != -1)
+ idle_cpu = first_hot_cpu;
+
if (has_idle_core)
set_idle_cores(target, false);

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index f770168230ae..04ed9fcf67f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CACHE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 887468c48ff6..7aa728289395 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1079,6 +1079,7 @@ struct rq {
#endif
u64 idle_stamp;
u64 avg_idle;
+ u64 cache_hot_timeout;

unsigned long wake_stamp;
u64 wake_avg_idle;
--
2.25.1

2023-09-27 05:18:20

by Chen Yu

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: Record the short sleeping time of a task

During task wakeup, the wakee firstly checks if its previous
running CPU is idle. If yes, choose that CPU as its first
choice. However, in most cases, the wakee's previous CPU
could be chosen by someone else, which breaks the cache
locality.

Proposes a mechanism to reserve the task's previous
CPU for a short while. In this reservation period, other
tasks are not allowed to pick that CPU until a timeout.
The reservation period is defined as the average short
sleep time of the task. To be more specific, it is the
time delta between this task being dequeued and enqueued.
Only the sleep time shorter than sysctl_sched_migration_cost
will be recorded. If the sleep time is longer than
sysctl_sched_migration_cost, give the reservation period
a penalty by shrinking it to half. In this way, the 'burst'
sleeping time of the task is honored, meanwhile, if that
task becomes a long-sleeper, the reservation time of that
task is shrunk to reduce the impact on task wakeup.

Suggested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 21 +++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc37ae787e33..4a0ac0276384 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -561,6 +561,9 @@ struct sched_entity {
u64 vruntime;
s64 vlag;
u64 slice;
+ u64 prev_dequeue_time;
+ /* the reservation period of this task during wakeup */
+ u64 sis_rsv_avg;

u64 nr_migrations;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0877878bcdb..297b9470829c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ u64 last_dequeue = p->se.prev_dequeue_time;
+ u64 now = sched_clock_cpu(task_cpu(p));
+
+ /*
+ * If the task is a short-sleepting task, there is no need
+ * to migrate it to other CPUs. Estimate the average short sleeping
+ * time of the wakee. This sleep time is used as a hint to reserve
+ * the dequeued task's previous CPU for a short while. During this
+ * reservation period, select_idle_cpu() prevents other wakees from
+ * choosing this CPU. This could bring a better cache locality.
+ */
+ if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
+ now > last_dequeue) {
+ if (now - last_dequeue < sysctl_sched_migration_cost)
+ update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
+ else
+ p->se.sis_rsv_avg >>= 1;
+ }

/*
* The code below (indirectly) updates schedutil which looks at
@@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
+ u64 now;

util_est_dequeue(&rq->cfs, p);

@@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
+ now = sched_clock_cpu(cpu_of(rq));
+ p->se.prev_dequeue_time = task_sleep ? now : 0;
}

#ifdef CONFIG_SMP
--
2.25.1

2023-09-27 08:55:04

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task

On Tue, Sep 26, 2023 at 01:11:02PM +0800, Chen Yu wrote:
> During task wakeup, the wakee firstly checks if its previous
> running CPU is idle. If yes, choose that CPU as its first
> choice. However, in most cases, the wakee's previous CPU
> could be chosen by someone else, which breaks the cache
> locality.
>
> Proposes a mechanism to reserve the task's previous
> CPU for a short while. In this reservation period, other
> tasks are not allowed to pick that CPU until a timeout.
> The reservation period is defined as the average short
> sleep time of the task. To be more specific, it is the
> time delta between this task being dequeued and enqueued.
> Only the sleep time shorter than sysctl_sched_migration_cost
> will be recorded. If the sleep time is longer than
> sysctl_sched_migration_cost, give the reservation period
> a penalty by shrinking it to half. In this way, the 'burst'
> sleeping time of the task is honored, meanwhile, if that
> task becomes a long-sleeper, the reservation time of that
> task is shrunk to reduce the impact on task wakeup.
>
> Suggested-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Chen Yu <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 21 +++++++++++++++++++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc37ae787e33..4a0ac0276384 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -561,6 +561,9 @@ struct sched_entity {
> u64 vruntime;
> s64 vlag;
> u64 slice;
> + u64 prev_dequeue_time;
> + /* the reservation period of this task during wakeup */
> + u64 sis_rsv_avg;

Nit: these info are only relavant for task, not group so might be better
to move them to task_struct to save a little memory?

>
> u64 nr_migrations;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0877878bcdb..297b9470829c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct sched_entity *se = &p->se;
> int idle_h_nr_running = task_has_idle_policy(p);
> int task_new = !(flags & ENQUEUE_WAKEUP);
> + u64 last_dequeue = p->se.prev_dequeue_time;
> + u64 now = sched_clock_cpu(task_cpu(p));

I think cpu_of(rq) is more clear than task_cpu(p). Using task_cpu(p)
seems to suggest task_cpu(p) can be different from cpu_of(rq).

> +
> + /*
> + * If the task is a short-sleepting task, there is no need
> + * to migrate it to other CPUs. Estimate the average short sleeping
> + * time of the wakee. This sleep time is used as a hint to reserve
> + * the dequeued task's previous CPU for a short while. During this
> + * reservation period, select_idle_cpu() prevents other wakees from
> + * choosing this CPU. This could bring a better cache locality.
> + */
> + if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&

Hmm...the cpu_online() check looks weird. If the cpu is offlined, no task
will be enqueued there, right?

Thanks,
Aaron

> + now > last_dequeue) {
> + if (now - last_dequeue < sysctl_sched_migration_cost)
> + update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
> + else
> + p->se.sis_rsv_avg >>= 1;
> + }
>
> /*
> * The code below (indirectly) updates schedutil which looks at
> @@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> int task_sleep = flags & DEQUEUE_SLEEP;
> int idle_h_nr_running = task_has_idle_policy(p);
> bool was_sched_idle = sched_idle_rq(rq);
> + u64 now;
>
> util_est_dequeue(&rq->cfs, p);
>
> @@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> dequeue_throttle:
> util_est_update(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> + now = sched_clock_cpu(cpu_of(rq));
> + p->se.prev_dequeue_time = task_sleep ? now : 0;
> }
>
> #ifdef CONFIG_SMP
> --
> 2.25.1
>

2023-09-27 09:25:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup


* Chen Yu <[email protected]> wrote:

> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Inhibit the task migration while keeping the work conservation of
> scheduler could benefit many workloads. Inspired by Mathieu's
> proposal to limit the task migration ratio[1], this patch considers
> the task average sleep duration. If the task is a short sleeping one,
> then tag its previous CPU as cache hot for a short while. During this
> reservation period, other wakees are not allowed to pick this idle CPU
> until a timeout. Later if the task is woken up again, it can find its
> previous CPU still idle, and choose it in select_idle_sibling().

Yeah, so I'm not convinced about this at this stage.

By allowing a task to basically hog a CPU after it has gone idle already,
however briefly, we reduce resource utilization efficiency for the sake
of singular benchmark workloads.

In a mixed environment the cost of leaving CPUs idle longer than necessary
will show up - and none of these benchmarks show that kind of side effect
and indirect overhead.

This feature would be a lot more convincing if it tried to measure overhead
in the pathological case, not the case it's been written for.

Thanks,

Ingo

2023-09-27 21:42:27

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

On Wed, 2023-09-27 at 10:00 +0200, Ingo Molnar wrote:
> * Chen Yu <[email protected]> wrote:
>
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
>
> Yeah, so I'm not convinced about this at this stage.
>
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
>
> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
>
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
>

Ingo,

Mathieu's patches on detecting overly high task migrations and then
rate limiting migration is a way to detect that tasks are getting 
crazy doing CPU musical chairs and in a pathological state.

Will the migration rate be a reasonable indicator that we need to
do something to reduce pathological migrations like SIS_CACHE proposal so the
tasks don't get jerked all over?
Or you have some other better indicators in mind?

We did some experiments on the OLTP workload on a 112 core 2 socket
SPR machine. The OLTP workload have a mixture of threads
handling database updates on disks and handling transaction
queries over network.

For Mathieu's original task migration rate limit patches,
we saw 1.2% improvement and for Chen Yu's SIS_CACHE proposal, we 
saw 0.7% improvement. System is running at
~94% busy so is under high utilization. The variation of this workload
is less than 0.2%. There are improvements for such mix workload
though it is not as much as the microbenchmarks. These
data are perliminary and we are still doing more experiments.

For the OLTP experiments, each socket with 64 cores are divided
with sub-numa clusters of 4 nodes of 16 cores each so the scheduling
overhead in idle CPU search is much less if SNC is off.

Thanks.

Tim

2023-09-27 22:49:30

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

On 9/26/23 06:11, Chen Yu wrote:
> Problem statement:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Proposal:
> Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
> the sleep time of the task for better task placement. Based on the
> task's short sleeping history, keep p's previous CPU idle for a short
> while. Later when p is woken up, it can choose its previous CPU in
> select_idle_sibling(). When p's previous CPU is reserved, another wakee
> is not allowed to choose this CPU in select_idle_idle(). The reservation
> period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.
>
> This does not break the work conservation of the scheduler, because
> wakee will still try its best to find an idle CPU. The difference is that
> different idle CPUs might have different priorities.
>
> Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
> cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
> which brings task stacking. Mitigate this by returning the first cache-hot
> idle CPU if no cache-cold CPU is found.

I've tried your patches on my reference hackbench workload:

./hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

Unfortunately they don't appear to help for that specific load.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-09-28 10:02:49

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task

Hi Aaron,

On 2023-09-27 at 15:53:33 +0800, Aaron Lu wrote:
> On Tue, Sep 26, 2023 at 01:11:02PM +0800, Chen Yu wrote:
> > During task wakeup, the wakee firstly checks if its previous
> > running CPU is idle. If yes, choose that CPU as its first
> > choice. However, in most cases, the wakee's previous CPU
> > could be chosen by someone else, which breaks the cache
> > locality.
> >
> > Proposes a mechanism to reserve the task's previous
> > CPU for a short while. In this reservation period, other
> > tasks are not allowed to pick that CPU until a timeout.
> > The reservation period is defined as the average short
> > sleep time of the task. To be more specific, it is the
> > time delta between this task being dequeued and enqueued.
> > Only the sleep time shorter than sysctl_sched_migration_cost
> > will be recorded. If the sleep time is longer than
> > sysctl_sched_migration_cost, give the reservation period
> > a penalty by shrinking it to half. In this way, the 'burst'
> > sleeping time of the task is honored, meanwhile, if that
> > task becomes a long-sleeper, the reservation time of that
> > task is shrunk to reduce the impact on task wakeup.
> >
> > Suggested-by: Mathieu Desnoyers <[email protected]>
> > Signed-off-by: Chen Yu <[email protected]>
> > ---
> > include/linux/sched.h | 3 +++
> > kernel/sched/fair.c | 21 +++++++++++++++++++++
> > 2 files changed, 24 insertions(+)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dc37ae787e33..4a0ac0276384 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -561,6 +561,9 @@ struct sched_entity {
> > u64 vruntime;
> > s64 vlag;
> > u64 slice;
> > + u64 prev_dequeue_time;
> > + /* the reservation period of this task during wakeup */
> > + u64 sis_rsv_avg;
>
> Nit: these info are only relavant for task, not group so might be better
> to move them to task_struct to save a little memory?
>

Yes, I'll try to do this.

> >
> > u64 nr_migrations;
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d0877878bcdb..297b9470829c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > struct sched_entity *se = &p->se;
> > int idle_h_nr_running = task_has_idle_policy(p);
> > int task_new = !(flags & ENQUEUE_WAKEUP);
> > + u64 last_dequeue = p->se.prev_dequeue_time;
> > + u64 now = sched_clock_cpu(task_cpu(p));
>
> I think cpu_of(rq) is more clear than task_cpu(p). Using task_cpu(p)
> seems to suggest task_cpu(p) can be different from cpu_of(rq).
>

You are right. My original thought is to use task's previous CPU rather than
the current rq. But at this stage the task's CPU has already been updated
to the same as rq. I'll think more about how to deal with this properly.

> > +
> > + /*
> > + * If the task is a short-sleepting task, there is no need
> > + * to migrate it to other CPUs. Estimate the average short sleeping
> > + * time of the wakee. This sleep time is used as a hint to reserve
> > + * the dequeued task's previous CPU for a short while. During this
> > + * reservation period, select_idle_cpu() prevents other wakees from
> > + * choosing this CPU. This could bring a better cache locality.
> > + */
> > + if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
>
> Hmm...the cpu_online() check looks weird. If the cpu is offlined, no task
> will be enqueued there, right?
>

Right. If rq and the task's CPU are the same, there is no need to check cpu_online.

thanks,
Chenyu

2023-09-28 15:09:48

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hi Mathieu,

On 2023-09-27 at 12:11:33 -0400, Mathieu Desnoyers wrote:
> On 9/26/23 06:11, Chen Yu wrote:
> > Problem statement:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Proposal:
> > Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
> > the sleep time of the task for better task placement. Based on the
> > task's short sleeping history, keep p's previous CPU idle for a short
> > while. Later when p is woken up, it can choose its previous CPU in
> > select_idle_sibling(). When p's previous CPU is reserved, another wakee
> > is not allowed to choose this CPU in select_idle_idle(). The reservation
> > period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.
> >
> > This does not break the work conservation of the scheduler, because
> > wakee will still try its best to find an idle CPU. The difference is that
> > different idle CPUs might have different priorities.
> >
> > Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
> > cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
> > which brings task stacking. Mitigate this by returning the first cache-hot
> > idle CPU if no cache-cold CPU is found.
>
> I've tried your patches on my reference hackbench workload:
>
> ./hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> Unfortunately they don't appear to help for that specific load.
>

I just ran the same test on a 224 CPU system and it seems that
there is no much difference with/without SIS_CACHE. To figure out
the reason, I used the bpftrace to track how often hackbench is woken
up on its previous CPU:

kretfunc:select_task_rq_fair
{
$p = (struct task_struct *)args->p;
if ($p->comm == "sender") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_sender = count();
} else {
@wakeup_prev_sender = count();
}
}

if ($p->comm == "receiver") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_receiver = count();
} else {
@wakeup_prev_receiver = count();
}
}
}

and print the data every 10 seconds:

NO_SIS_CACHE:
23:50:24 Wakeup statistics:
@wakeup_migrate_sender: 9043961
@wakeup_prev_sender: 20073128
@wakeup_migrate_receiver: 12071462
@wakeup_prev_receiver: 19587895

sender: migration/previous = 45.06%
receiver: migration/previous = 61.612%


SIS_CACHE:
23:49:21 Wakeup statistics:
@wakeup_migrate_sender: 6716902
@wakeup_prev_sender: 22727513
@wakeup_migrate_receiver: 11547623
@wakeup_prev_receiver: 24615810

sender: migration/previous = 29.55%
receiver: migration/previous = 46.91%

Both the sender and receiver in hackbench has raised the chance
to be woken up on its previous CPU, but not as much as netperf.
Why there is no much score difference? I checked the bottleneck
via perf topdown.

perf stat -M TopdownL1 -- sleep 10
perf stat -M tma_frontend_bound_group -- sleep 10
perf stat -M tma_fetch_latency_group -- sleep 10

NO_SIS_CACHE:
15.2 % tma_backend_bound
14.9 % tma_bad_speculation
43.9 % tma_frontend_bound
30.3 % tma_fetch_latency
9.7 % tma_ms_switches
14.0 % tma_fetch_bandwidth
26.1 % tma_retiring


SIS_CACHE:
14.5 % tma_backend_bound
15.3 % tma_bad_speculation
44.5 % tma_frontend_bound
31.5 % tma_fetch_latency
10.6 % tma_ms_switches
13.0 % tma_fetch_bandwidth
25.8 % tma_retiring

There is no much ratio changed with/without SIS_CACHE enabled.
This is because SIS_CACHE might bring benefit if tasks have a large
cache foot print(backend like netperf), but it seems that hackbench
pipe mode is frontend bound and its bottleneck is the complexy of the
instruction being executes(tma_ms_switch: MS is to decode the complex
instructions, the increase of MS switch counter usually means that the
workload is running some complex instruction), that is to say, the
pipe_read/write's code path could be the bottleneck. Your original
rate limitation on task migration might be more aggressive to reduce
the ratio of tma_backend_bound, and that might bring the score benefit,
like netperf case. Let me apply your original patch to confirm
if this is the case.

thanks,
Chenyu

2023-09-29 00:26:43

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Ingo,

On 2023-09-27 at 10:00:11 +0200, Ingo Molnar wrote:
>
> * Chen Yu <[email protected]> wrote:
>
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
>
> Yeah, so I'm not convinced about this at this stage.
>
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
>

Currently in the code we do not really reserve the idle CPU or force it
to be idle. We just give other wakee a search sequence suggestion to find
the idle CPU. If all idle CPUs are in reserved state, the first reserved idle
CPU will be picked up rather than left it in idle. This can fully utilize the
idle CPU resource. The main impact is the wakeup latency if I understand
correctly. Let me run the latest schbench and monitor these latency statistics
in detail.

> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
>
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
>

Thanks for the suggestion, Ingo. Yes, we should launch more tests to evaluate this
proposal. As Tim mentioned, we have previously tested it using OLTP benchmark
as described in PATCH [2/2]. I'm thinking of running more benchmarks to get
a wider understanding of how this change would impact them, both positive and
negative part.

thanks,
Chenyu

2023-10-05 16:42:26

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hello Chenyu,

On 9/26/2023 10:40 AM, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for your comments and review!

Sorry for the delay! I'll leave the test results from a 3rd Generation
EPYC system below.

tl;dr

- Small regression in tbench and netperf possible due to more searching
for an idle CPU.

- Small regression in schbench (old) at 256 workers albeit with large
run to run variance.

- Other benchmarks are more or less same.

I'll leave the full result below

o System details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)


o Kernel Details

- tip: tip:sched/core at commit 5fe7765997b1 (sched/deadline: Make
dl_rq->pushable_dl_tasks update drive dl_rq->overloaded)

- SIS_CACHE: tip + this series


o Benchmark results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.36) 1.01 [ -1.47]( 3.02)
2-groups 1.00 [ -0.00]( 2.35) 0.99 [ 0.92]( 1.01)
4-groups 1.00 [ -0.00]( 1.79) 0.98 [ 2.34]( 0.63)
8-groups 1.00 [ -0.00]( 0.84) 0.98 [ 1.73]( 1.02)
16-groups 1.00 [ -0.00]( 2.39) 0.97 [ 2.76]( 2.33)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ 0.00]( 0.86) 0.97 [ -2.68]( 0.74)
2 1.00 [ 0.00]( 0.99) 0.98 [ -2.18]( 0.17)
4 1.00 [ 0.00]( 0.49) 0.98 [ -2.47]( 1.15)
8 1.00 [ 0.00]( 0.96) 0.96 [ -3.81]( 0.24)
16 1.00 [ 0.00]( 1.38) 0.96 [ -4.33]( 1.31)
32 1.00 [ 0.00]( 1.64) 0.95 [ -4.70]( 1.59)
64 1.00 [ 0.00]( 0.92) 0.97 [ -2.97]( 0.49)
128 1.00 [ 0.00]( 0.57) 0.99 [ -1.15]( 0.57)
256 1.00 [ 0.00]( 0.38) 1.00 [ 0.03]( 0.79)
512 1.00 [ 0.00]( 0.04) 1.00 [ 0.43]( 0.34)
1024 1.00 [ 0.00]( 0.20) 1.00 [ 0.41]( 0.13)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 2.52) 0.93 [ -6.90]( 6.75)
Scale 1.00 [ 0.00]( 6.38) 0.99 [ -1.18]( 7.45)
Add 1.00 [ 0.00]( 6.54) 0.97 [ -2.55]( 7.34)
Triad 1.00 [ 0.00]( 5.18) 0.95 [ -4.64]( 6.81)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 0.74) 1.00 [ -0.20]( 1.69)
Scale 1.00 [ 0.00]( 6.25) 1.03 [ 3.46]( 0.55)
Add 1.00 [ 0.00]( 6.53) 1.05 [ 4.58]( 0.43)
Triad 1.00 [ 0.00]( 5.14) 0.98 [ -1.78]( 6.24)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.27) 0.98 [ -1.50]( 0.14)
2-clients 1.00 [ 0.00]( 1.32) 0.98 [ -2.35]( 0.54)
4-clients 1.00 [ 0.00]( 0.40) 0.98 [ -2.35]( 0.56)
8-clients 1.00 [ 0.00]( 0.97) 0.97 [ -2.72]( 0.50)
16-clients 1.00 [ 0.00]( 0.54) 0.96 [ -3.92]( 0.86)
32-clients 1.00 [ 0.00]( 1.38) 0.97 [ -3.10]( 0.44)
64-clients 1.00 [ 0.00]( 1.78) 0.97 [ -3.44]( 1.70)
128-clients 1.00 [ 0.00]( 1.09) 0.94 [ -5.75]( 2.67)
256-clients 1.00 [ 0.00]( 4.45) 0.97 [ -2.61]( 4.93)
512-clients 1.00 [ 0.00](54.70) 0.98 [ -1.64](55.09)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ -0.00]( 3.95) 0.97 [ 2.56](10.42)
2 1.00 [ -0.00]( 5.89) 0.83 [ 16.67](22.56)
4 1.00 [ -0.00](14.28) 1.00 [ -0.00](14.75)
8 1.00 [ -0.00]( 4.90) 0.84 [ 15.69]( 6.01)
16 1.00 [ -0.00]( 4.15) 1.00 [ -0.00]( 4.41)
32 1.00 [ -0.00]( 5.10) 1.01 [ -1.10]( 3.44)
64 1.00 [ -0.00]( 2.69) 1.04 [ -3.72]( 2.57)
128 1.00 [ -0.00]( 2.63) 0.94 [ 6.29]( 2.55)
256 1.00 [ -0.00](26.75) 1.51 [-50.57](11.40)
512 1.00 [ -0.00]( 2.93) 0.96 [ 3.52]( 3.56)

==================================================================
Test : ycsb-cassandra
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Metric tip SIS_CACHE(pct imp)
Throughput 1.00 1.00 (%diff: 0.27%)


==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Metric tip SIS_CACHE(pct imp)
Throughput 1.00 1.00 (%diff: -0.45%)


==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip SIS_CACHE(pct imp)
1CCD 1 1.00 1.00 (%diff: -0.47%)
2CCD 2 1.00 0.98 (%diff: -2.34%)
4CCD 4 1.00 1.00 (%diff: -0.29%)
8CCD 8 1.00 1.01 (%diff: 0.54%)

>
> ----------------------------------------------------------------------
>
> [..snip..]
>

--
Thanks and Regards,
Prateek

2023-10-07 03:23:49

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Prateek,

On 2023-10-05 at 11:52:13 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/26/2023 10:40 AM, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> > burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> > (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> > rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> > for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> > select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> > (K Prateek Nayak)
> >
> > Thanks for your comments and review!
>
> Sorry for the delay! I'll leave the test results from a 3rd Generation
> EPYC system below.
>
> tl;dr
>
> - Small regression in tbench and netperf possible due to more searching
> for an idle CPU.
>
> - Small regression in schbench (old) at 256 workers albeit with large
> run to run variance.
>
> - Other benchmarks are more or less same.
>
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1 1.00 [ -0.00]( 3.95) 0.97 [ 2.56](10.42)
> 2 1.00 [ -0.00]( 5.89) 0.83 [ 16.67](22.56)
> 4 1.00 [ -0.00](14.28) 1.00 [ -0.00](14.75)
> 8 1.00 [ -0.00]( 4.90) 0.84 [ 15.69]( 6.01)
> 16 1.00 [ -0.00]( 4.15) 1.00 [ -0.00]( 4.41)
> 32 1.00 [ -0.00]( 5.10) 1.01 [ -1.10]( 3.44)
> 64 1.00 [ -0.00]( 2.69) 1.04 [ -3.72]( 2.57)
> 128 1.00 [ -0.00]( 2.63) 0.94 [ 6.29]( 2.55)
> 256 1.00 [ -0.00](26.75) 1.51 [-50.57](11.40)

Thanks for the testing. So the latency regression from schbench is
quite obvious, and as you mentioned, it is possible due to longer
scan time during select_idle_cpu(). I'll run the same test with split
LLC to see if I can reproduce the issue or not.
I'm also working with Mathieu on another direction to choose previous CPU
over current CPU when the system is overloaded, and that should be
more moderate and I'll post the test result later.

thanks,
Chenyu

2023-10-17 09:53:10

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Chen Yu,

On 26/09/23 10:40, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for your comments and review!
>
> ----------------------------------------------------------------------

Regarding making the scan for finding an idle cpu longer vs cache benefits,
I ran some benchmarks.

Tested the patch on power system with 12 cores. Total of 96 CPU's.
System has two NUMA nodes.

Below are some of the benchmark results

schbench 99.0th latency (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)


schbench results are showing that there is not much impact in wakeup latencies due to more iterations
in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations.

producer_consumer avg time/access (lower is better)
========
loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
mainly when loads per consumer iteration is lower.

hackbench normalized time in seconds (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)

hackbench results are similar in both kernels except the case where there is an improvement of
29% in case of threads-pipe case with 1 groups.

Daytrader throughput (higher is better)
========

As per Ingo suggestion, ran a real life workload daytrader

baseline:
===================================================================================
Instance 1
Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
================ =============== =============== ===============
10124.5 2 0 3970

SIS_CACHE:
===================================================================================
Instance 1
Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
================ =============== =============== ===============
10319.5 2 0 5771

In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.

Thanks and Regards
Madadi Vineeth Reddy

2023-10-17 11:10:56

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Madadi,

On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 26/09/23 10:40, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> > burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> > (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> > rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> > for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> > select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> > (K Prateek Nayak)
> >
> > Thanks for your comments and review!
> >
> > ----------------------------------------------------------------------
>
> Regarding making the scan for finding an idle cpu longer vs cache benefits,
> I ran some benchmarks.
>

Thanks very much for your interest and your time on the patch.

> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> System has two NUMA nodes.
>
> Below are some of the benchmark results
>
> schbench 99.0th latency (lower is better)
> ========
> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
>
>
> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
> for SIS_CACHE in case of 4-mthreads.

The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.

> I think we can ignore the last case due to huge run to run variations.

Although the run-to-run variation is large, it seems that the decrease is within that range.
Prateek has also reported that when the system is overloaded there could be some regression
from schbench:
https://lore.kernel.org/lkml/[email protected]/
Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
latency in detail.

> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
> mainly when loads per consumer iteration is lower.
>
> hackbench normalized time in seconds (lower is better)
> ========
> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
>
> hackbench results are similar in both kernels except the case where there is an improvement of
> 29% in case of threads-pipe case with 1 groups.
>
> Daytrader throughput (higher is better)
> ========
>
> As per Ingo suggestion, ran a real life workload daytrader
>
> baseline:
> ===================================================================================
> Instance 1
> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
> ================ =============== =============== ===============
> 10124.5 2 0 3970
>
> SIS_CACHE:
> ===================================================================================
> Instance 1
> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
> ================ =============== =============== ===============
> 10319.5 2 0 5771
>
> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>

Thanks for bringing this good news, a real life workload benefits from this change.
I'll tune this patch a little bit to address the regression from schbench. Also to mention
that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
platform benefit from this change.
https://lore.kernel.org/lkml/[email protected]/

thanks,
Chenyu

2023-10-18 19:39:52

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Chen Yu,
On 17/10/23 16:39, Chen Yu wrote:
> Hi Madadi,
>
> On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 26/09/23 10:40, Chen Yu wrote:
>>> RFC -> v1:
>>> - drop RFC
>>> - Only record the short sleeping time for each task, to better honor the
>>> burst sleeping tasks. (Mathieu Desnoyers)
>>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>>> (Mathieu Desnoyers, Aaron Lu)
>>> - Introduce a new helper function cache_hot_cpu() that considers
>>> rq->cache_hot_timeout. (Aaron Lu)
>>> - Add analysis of why inhibiting task migration could bring better throughput
>>> for some benchmarks. (Gautham R. Shenoy)
>>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>>> (K Prateek Nayak)
>>>
>>> Thanks for your comments and review!
>>>
>>> ----------------------------------------------------------------------
>>
>> Regarding making the scan for finding an idle cpu longer vs cache benefits,
>> I ran some benchmarks.
>>
>
> Thanks very much for your interest and your time on the patch.
>
>> Tested the patch on power system with 12 cores. Total of 96 CPU's.
>> System has two NUMA nodes.
>>
>> Below are some of the benchmark results
>>
>> schbench 99.0th latency (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
>> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
>> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
>> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
>>
>>
>> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
>> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
>> for SIS_CACHE in case of 4-mthreads.
>
> The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
>
>> I think we can ignore the last case due to huge run to run variations.
>
> Although the run-to-run variation is large, it seems that the decrease is within that range.
> Prateek has also reported that when the system is overloaded there could be some regression
> from schbench:
> https://lore.kernel.org/lkml/[email protected]/
> Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> latency in detail.
>

raw data by schbench(old) with 6-mthreads
======================

Baseline (5 runs)
========
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 981
99.5000th: 4424
99.9000th: 9200
min=0, max=29497

Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 35
95.0000th: 38
*99.0000th: 495
99.5000th: 3924
99.9000th: 9872
min=0, max=29997

Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 36
95.0000th: 39
*99.0000th: 1326
99.5000th: 4744
99.9000th: 10000
min=0, max=23394

Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 55
99.5000th: 3292
99.9000th: 9104
min=0, max=25196

Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 711
99.5000th: 4600
99.9000th: 9424
min=0, max=19997

SIS_CACHE (5 runs)
=========
Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 35
95.0000th: 38
*99.0000th: 1894
99.5000th: 5464
99.9000th: 10000
min=0, max=19157

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2396
99.5000th: 6664
99.9000th: 10000
min=0, max=24029

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2132
99.5000th: 6296
99.9000th: 10000
min=0, max=25313

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 1090
99.5000th: 6232
99.9000th: 9744
min=0, max=27264

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 38
*99.0000th: 1786
99.5000th: 5240
99.9000th: 9968
min=0, max=24754

The above data as indicated has large run to run variation and in general, the latency is
high in case of SIS_CACHE for the 99th %ile.


schbench(new) with 6-mthreads
=============

Baseline
========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
50.0th: 8 (43672 samples)
90.0th: 13 (83908 samples)
* 99.0th: 20 (18323 samples)
99.9th: 775 (1785 samples)
min=1, max=8400
Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
50.0th: 13648 (59873 samples)
90.0th: 14000 (82767 samples)
* 99.0th: 14320 (16342 samples)
99.9th: 18720 (1670 samples)
min=5130, max=38334
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (8 samples)
* 50.0th: 6984 (23 samples)
90.0th: 6984 (0 samples)
min=6835, max=6991
average rps: 6984.77


SIS_CACHE
=========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
50.0th: 9 (49267 samples)
90.0th: 14 (86522 samples)
* 99.0th: 21 (14091 samples)
99.9th: 1146 (1722 samples)
min=1, max=10427
Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
50.0th: 13616 (62838 samples)
90.0th: 14000 (85301 samples)
* 99.0th: 14352 (16149 samples)
99.9th: 21408 (1660 samples)
min=5070, max=41866
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (7 samples)
* 50.0th: 6984 (21 samples)
90.0th: 6984 (0 samples)
min=6672, max=6996
average rps: 6981.07

In new schbench, I didn't observe run to run variation and also there was no regression
in case of SIS_CACHE for the 99th %ile.


>> producer_consumer avg time/access (lower is better)
>> ========
>> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
>> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
>> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>>
>> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
>> mainly when loads per consumer iteration is lower.
>>
>> hackbench normalized time in seconds (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
>> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
>> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
>> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
>> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
>> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
>> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
>> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
>>
>> hackbench results are similar in both kernels except the case where there is an improvement of
>> 29% in case of threads-pipe case with 1 groups.
>>
>> Daytrader throughput (higher is better)
>> ========
>>
>> As per Ingo suggestion, ran a real life workload daytrader
>>
>> baseline:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10124.5 2 0 3970
>>
>> SIS_CACHE:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10319.5 2 0 5771
>>
>> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>>
>
> Thanks for bringing this good news, a real life workload benefits from this change.
> I'll tune this patch a little bit to address the regression from schbench. Also to mention
> that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
> CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
> platform benefit from this change.
> https://lore.kernel.org/lkml/[email protected]/

Oh..ok. Thanks for the pointer!

>
> thanks,
> Chenyu
>

Thanks and Regards
Madadi Vineeth Reddy

2023-10-19 10:58:03

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> On 17/10/23 16:39, Chen Yu wrote:
> > Hi Madadi,
> >
> > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> >> Hi Chen Yu,
> >>
> >> On 26/09/23 10:40, Chen Yu wrote:
> >>> RFC -> v1:
> >>> - drop RFC
> >>> - Only record the short sleeping time for each task, to better honor the
> >>> burst sleeping tasks. (Mathieu Desnoyers)
> >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >>> (Mathieu Desnoyers, Aaron Lu)
> >>> - Introduce a new helper function cache_hot_cpu() that considers
> >>> rq->cache_hot_timeout. (Aaron Lu)
> >>> - Add analysis of why inhibiting task migration could bring better throughput
> >>> for some benchmarks. (Gautham R. Shenoy)
> >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >>> (K Prateek Nayak)
> >>>
> >>> Thanks for your comments and review!
> >>>
> >>> ----------------------------------------------------------------------
> >>
> >> Regarding making the scan for finding an idle cpu longer vs cache benefits,
> >> I ran some benchmarks.
> >>
> >
> > Thanks very much for your interest and your time on the patch.
> >
> >> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> >> System has two NUMA nodes.
> >>
> >> Below are some of the benchmark results
> >>
> >> schbench 99.0th latency (lower is better)
> >> ========
> >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> >> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
> >> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
> >> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
> >> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
> >>
> >>
> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
> >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
> >> for SIS_CACHE in case of 4-mthreads.
> >
> > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> >
> >> I think we can ignore the last case due to huge run to run variations.
> >
> > Although the run-to-run variation is large, it seems that the decrease is within that range.
> > Prateek has also reported that when the system is overloaded there could be some regression
> > from schbench:
> > https://lore.kernel.org/lkml/[email protected]/
> > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> > latency in detail.
> >
>
> raw data by schbench(old) with 6-mthreads
> ======================
>
> Baseline (5 runs)
> ========
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 981
> 99.5000th: 4424
> 99.9000th: 9200
> min=0, max=29497
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 35
> 95.0000th: 38
> *99.0000th: 495
> 99.5000th: 3924
> 99.9000th: 9872
> min=0, max=29997
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 30
> 90.0000th: 36
> 95.0000th: 39
> *99.0000th: 1326
> 99.5000th: 4744
> 99.9000th: 10000
> min=0, max=23394
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 55
> 99.5000th: 3292
> 99.9000th: 9104
> min=0, max=25196
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 711
> 99.5000th: 4600
> 99.9000th: 9424
> min=0, max=19997
>
> SIS_CACHE (5 runs)
> =========
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 30
> 90.0000th: 35
> 95.0000th: 38
> *99.0000th: 1894
> 99.5000th: 5464
> 99.9000th: 10000
> min=0, max=19157
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 2396
> 99.5000th: 6664
> 99.9000th: 10000
> min=0, max=24029
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 2132
> 99.5000th: 6296
> 99.9000th: 10000
> min=0, max=25313
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 1090
> 99.5000th: 6232
> 99.9000th: 9744
> min=0, max=27264
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 38
> *99.0000th: 1786
> 99.5000th: 5240
> 99.9000th: 9968
> min=0, max=24754
>
> The above data as indicated has large run to run variation and in general, the latency is
> high in case of SIS_CACHE for the 99th %ile.
>
>
> schbench(new) with 6-mthreads
> =============
>
> Baseline
> ========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
> 50.0th: 8 (43672 samples)
> 90.0th: 13 (83908 samples)
> * 99.0th: 20 (18323 samples)
> 99.9th: 775 (1785 samples)
> min=1, max=8400
> Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
> 50.0th: 13648 (59873 samples)
> 90.0th: 14000 (82767 samples)
> * 99.0th: 14320 (16342 samples)
> 99.9th: 18720 (1670 samples)
> min=5130, max=38334
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 6968 (8 samples)
> * 50.0th: 6984 (23 samples)
> 90.0th: 6984 (0 samples)
> min=6835, max=6991
> average rps: 6984.77
>
>
> SIS_CACHE
> =========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
> 50.0th: 9 (49267 samples)
> 90.0th: 14 (86522 samples)
> * 99.0th: 21 (14091 samples)
> 99.9th: 1146 (1722 samples)
> min=1, max=10427
> Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
> 50.0th: 13616 (62838 samples)
> 90.0th: 14000 (85301 samples)
> * 99.0th: 14352 (16149 samples)
> 99.9th: 21408 (1660 samples)
> min=5070, max=41866
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 6968 (7 samples)
> * 50.0th: 6984 (21 samples)
> 90.0th: 6984 (0 samples)
> min=6672, max=6996
> average rps: 6981.07
>
> In new schbench, I didn't observe run to run variation and also there was no regression
> in case of SIS_CACHE for the 99th %ile.
>

Thanks for the test Madadi, in my opinion we can stick with the new schbench
in the future. I'll have a double check on my test machine.

thanks,
Chenyu