2023-11-21 07:40:46

by Chen Yu

[permalink] [raw]
Subject: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

v1 -> v2:
- Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
- Refine the task sleep duration calculation based on task's previous running
CPU. (Aaron Lu)
- Limit the cache-hot idle CPU scan depth to reduce the time spend on
searching, to fix the regression. (K Prateek Nayak)
- Add test results of the real life workload per request from Ingo
Daytrader on a power system. (Madadi Vineeth Reddy)
OLTP workload on Xeon Sapphire Rapids.
- Refined the commit log, added Reviewed-by tag to PATCH 1/3
(Mathieu Desnoyers).

RFC -> v1:
- drop RFC
- Only record the short sleeping time for each task, to better honor the
burst sleeping tasks. (Mathieu Desnoyers)
- Keep the forward movement monotonic for runqueue's cache-hot timeout value.
(Mathieu Desnoyers, Aaron Lu)
- Introduce a new helper function cache_hot_cpu() that considers
rq->cache_hot_timeout. (Aaron Lu)
- Add analysis of why inhibiting task migration could bring better throughput
for some benchmarks. (Gautham R. Shenoy)
- Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
(K Prateek Nayak)

Thanks for the comments and tests!

----------------------------------------------------------------------

This series aims to continue the discussion of how to make the wakee
to choose its previous CPU easier.

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inhibit the task migration could benefit many workloads. Inspired by
Mathieu's proposal to limit the task migration ratio[1], introduce
the SIS_CACHE. It considers the sleep time of the task for better
task placement. Based on the task's short sleeping history, tag p's
previous CPU as cache-hot. Later when p is woken up, it can choose
its previous CPU in select_idle_sibling(). When other task is
woken up, skip this cache-hot idle CPU and try the next idle CPU
when possible. The idea of SIS_CACHE is to optimize the idle CPU
scan sequence. The extra scan time is minimized by restricting the
scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.

This test is based on tip/sched/core, on top of
Commit ada87d23b734
("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")

This patch set has shown 15% ~ 70% improvements for client/server
workloads like netperf and tbench. It shows 0.7% improvement of
OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
There is 2% improvement of another real life workload Daytrader
per the test of Madadi on a power system with 96 CPUs. Prateek
has helped check there is no obvious microbenchmark regression
of the v2 on a 3rd Generation EPYC System with 128 CPUs.

Link: https://lore.kernel.org/lkml/[email protected]/ #1

Chen Yu (3):
sched/fair: Record the task sleeping time as the cache hot duration
sched/fair: Calculate the cache-hot time of the idle CPU
sched/fair: skip the cache hot CPU in select_idle_cpu()

include/linux/sched.h | 4 ++
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 91 insertions(+), 3 deletions(-)

--
2.25.1


2023-11-21 07:43:01

by Chen Yu

[permalink] [raw]
Subject: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

Problem statement:
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Proposal:
Introduce the SIS_CACHE. It considers the sleep time of the task for
better task placement. Based on the task's short sleeping history,
tag p's previous CPU as cache-hot. Later when p is woken up, it can
choose its previous CPU in select_idle_sibling(). When other task is
woken up, skip this cache-hot idle CPU when possible.

SIS_CACHE still prefers to choose an idle CPU during task wakeup,
the idea is to optimize the idle CPU scan sequence.

As pointed out by Prateek, this has the potential that all idle CPUs
are cache-hot and skipped. Mitigate this by returning the first
cache-hot idle CPU. Meanwhile, to reduce the time spend on scanning,
limit the max number of cache-hot CPU search depth to half of the number
suggested by SIS_UTIL.

Tested on Xeon 2 x 60C/120T platforms:

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 60-threads 1.00 ( 1.37) +0.04 ( 1.47)
TCP_RR 120-threads 1.00 ( 1.77) -1.03 ( 1.31)
TCP_RR 180-threads 1.00 ( 2.03) +1.25 ( 1.66)
TCP_RR 240-threads 1.00 ( 41.31) +73.71 ( 22.02)
TCP_RR 300-threads 1.00 ( 12.79) -0.11 ( 15.84)

tbench
======
case load baseline(std%) compare%( std%)
loopback 60-threads 1.00 ( 0.35) +0.40 ( 0.31)
loopback 120-threads 1.00 ( 1.94) -1.89 ( 1.17)
loopback 180-threads 1.00 ( 13.59) +9.97 ( 0.93)
loopback 240-threads 1.00 ( 11.68) +42.85 ( 7.28)
loopback 300-threads 1.00 ( 4.47) +15.12 ( 1.40)

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 9.21) -7.88 ( 2.03)
process-pipe 2-groups 1.00 ( 7.09) +5.47 ( 9.02)
process-pipe 4-groups 1.00 ( 1.60) +1.53 ( 1.70)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.98) +0.26 ( 0.37)
normal 2-mthreads 1.00 ( 3.99) -7.97 ( 7.33)
normal 4-mthreads 1.00 ( 3.07) -1.55 ( 3.27)

Also did some experiments on the OLTP workload on a 112 core 2 socket
SPR machine. The OLTP workload have a mixture of threads handling
database updates on disks and handling transaction queries over network.
Around 0.7% improvement is observed with less than 0.2% run-to-run
variation.

Thanks Madadi for testing the SIS_CACHE on a power system with 96 CPUs.
The results showed a max of 29% improvements in hackbench, 13% improvements
in producer_consumer workload, and 2% improvements in real life workload
named Daytrader.

Thanks Prateek for running microbenchmarks on top of the latest patch on
a 3rd Generation EPYC System:
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
No consistent regression was observed in v2 version.

Analysis:
The reason why waking up the task on its previous CPU brings benefits
is due to less task migration and higher local resource locality.

Take netperf 240 case as an example, run the following script
to track the migration number within 10 seconds. Use perf topdown
to track the PMU events. The task migration and stall cycles
have been reduced a lot with SIS_CACHE:

kretfunc:select_task_rq_fair
{
$p = (struct task_struct *)args->p;
if ($p->comm == "netperf") {
if ($p->thread_info.cpu != retval) {
@wakeup_migrate_netperf = count();
} else {
@wakeup_prev_netperf = count();
}
}
}

NO_SIS_CACHE:
@wakeup_migrate_netperf: 57473509
@wakeup_prev_netperf: 14935964
RESOURCE_STALLS: 19.1% * 7.1% * 35.0%

SIS_CACHE:
@wakeup_migrate_netperf: 799
@wakeup_prev_netperf: 132937436
RESOURCE_STALLS: 5.4% * 7.5% * 39.8%

Suggested-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c309b3d203c0..d149eca74fca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7360,7 +7360,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
* Return true if the idle CPU is cache-hot for someone,
* return false otherwise.
*/
-static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
+static bool cache_hot_cpu(int cpu, int *hot_cpu)
{
if (!sched_feat(SIS_CACHE))
return false;
@@ -7383,7 +7383,7 @@ static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
- int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ int i, cpu, idle_cpu = -1, nr = INT_MAX, nr_hot = 0, hot_cpu = -1;
struct sched_domain_shared *sd_share;

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -7396,6 +7396,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
/* overloaded LLC is unlikely to have idle cpu/core */
if (nr == 1)
return -1;
+
+ /* max number of cache-hot idle cpu check */
+ nr_hot = nr >> 1;
}
}

@@ -7426,18 +7429,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
- if ((unsigned int)i < nr_cpumask_bits)
+ if ((unsigned int)i < nr_cpumask_bits) {
+ if (--nr_hot >= 0 && cache_hot_cpu(i, &hot_cpu))
+ continue;
+
return i;
+ }

} else {
if (--nr <= 0)
return -1;
idle_cpu = __select_idle_cpu(cpu, p);
- if ((unsigned int)idle_cpu < nr_cpumask_bits)
+ if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+ if (--nr_hot >= 0 && cache_hot_cpu(idle_cpu, &hot_cpu))
+ continue;
+
break;
+ }
}
}

+ /* pick the first cache-hot CPU as the last resort */
+ if (idle_cpu == -1 && hot_cpu != -1)
+ idle_cpu = hot_cpu;
+
if (has_idle_core)
set_idle_cores(target, false);

--
2.25.1

2023-11-21 07:44:34

by Chen Yu

[permalink] [raw]
Subject: [PATCH v2 1/3] sched/fair: Record the task sleeping time as the cache hot duration

The cache hot duration is calculated by the average sleeping
time of a task, which is the time delta between the task
being dequeued and enqueued.

The cache hot duration of a task is introduced to describe
how soon this dequeue task could be woken up. During this
cache hot period, the task's previous CPU is regarded as
still cache-hot for the task. This information will be used
by SIS_CACHE to improve cache locality for short-sleeping tasks.

Suggested-by: Mathieu Desnoyers <[email protected]>
Suggested-by: Aaron Lu <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d258162deb0..7d0fafd29345 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1347,6 +1347,10 @@ struct task_struct {
struct callback_head cid_work;
#endif

+ u64 last_dequeue_time;
+ u64 avg_hot_dur; /* Average cache hot duration */
+ int last_dequeue_cpu;
+
struct tlbflush_unmap_batch tlb_ubc;

/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53e7bf2ccc44..672616503e35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6667,6 +6667,36 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
int task_new = !(flags & ENQUEUE_WAKEUP);
+ u64 last_dequeue = p->last_dequeue_time;
+
+ if ((flags & ENQUEUE_WAKEUP) && last_dequeue &&
+ cpu_online(p->last_dequeue_cpu)) {
+ /*
+ * The enqueue task_cpu(p) has already been assigned
+ * with a new one. Need to calculate the task's sleeping
+ * time based on its previous running CPU.
+ */
+ u64 now = sched_clock_cpu(p->last_dequeue_cpu);
+
+ /*
+ * Record the task's short sleep time. This sleep time
+ * indicates how soon this task might be woken up again.
+ * The task's previous running CPU is regarded as cache-hot
+ * in the sleep time. So, define the average sleep time of
+ * the task as its cache-hot duration. The SIS could leverage
+ * the cache-hot duration for better idle CPU selection.
+ * This improves cache locality for short-sleeping tasks.
+ *
+ * If the sleep time is longer than sysctl_sched_migration_cost,
+ * give the cache hot duration a penalty by cutting it to half.
+ */
+ if (now > last_dequeue) {
+ if (now - last_dequeue < sysctl_sched_migration_cost)
+ update_avg(&p->avg_hot_dur, now - last_dequeue);
+ else
+ p->avg_hot_dur >>= 1;
+ }
+ }

/*
* The code below (indirectly) updates schedutil which looks at
@@ -6821,6 +6851,15 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+
+ if (task_sleep) {
+ p->last_dequeue_time = sched_clock_cpu(cpu_of(rq));
+ p->last_dequeue_cpu = cpu_of(rq);
+ } else {
+ /* 0 indicates the dequeue is not caused by sleep */
+ p->last_dequeue_time = 0;
+ }
+
hrtick_update(rq);
}

--
2.25.1

2023-11-21 07:44:42

by Chen Yu

[permalink] [raw]
Subject: [PATCH v2 2/3] sched/fair: Calculate the cache-hot time of the idle CPU

When a CPU is about to become idle due to task dequeue, uses
the dequeued task's average sleep time to set the cache
hot timeout of this idle CPU. This information can facilitate
SIS to skip the cache-hot idle CPU and scan for the next
cache-cold one. When that task is woken up again, it can choose
its previous CPU and reuses its hot-cache.

This is a preparation for the next patch to introduce SIS_CACHE
based task wakeup.

Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 672616503e35..c309b3d203c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6853,8 +6853,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
util_est_update(&rq->cfs, p, task_sleep);

if (task_sleep) {
- p->last_dequeue_time = sched_clock_cpu(cpu_of(rq));
+ u64 now = sched_clock_cpu(cpu_of(rq));
+
+ p->last_dequeue_time = now;
p->last_dequeue_cpu = cpu_of(rq);
+
+#ifdef CONFIG_SMP
+ /* this rq becomes idle, update its cache hot timeout */
+ if (sched_feat(SIS_CACHE) && !rq->nr_running &&
+ p->avg_hot_dur)
+ rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->avg_hot_dur);
+#endif
} else {
/* 0 indicates the dequeue is not caused by sleep */
p->last_dequeue_time = 0;
@@ -7347,6 +7356,25 @@ static inline int select_idle_smt(struct task_struct *p, int target)

#endif /* CONFIG_SCHED_SMT */

+/*
+ * Return true if the idle CPU is cache-hot for someone,
+ * return false otherwise.
+ */
+static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
+{
+ if (!sched_feat(SIS_CACHE))
+ return false;
+
+ if (sched_clock_cpu(cpu) >= cpu_rq(cpu)->cache_hot_timeout)
+ return false;
+
+ /* record the first cache hot idle cpu as the backup */
+ if (*hot_cpu == -1)
+ *hot_cpu = cpu;
+
+ return true;
+}
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..0af282712cd1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CACHE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..191ed62ef06d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1083,6 +1083,7 @@ struct rq {
#endif
u64 idle_stamp;
u64 avg_idle;
+ u64 cache_hot_timeout;

/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
--
2.25.1

2023-11-25 07:13:47

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Calculate the cache-hot time of the idle CPU

Hi Chen Yu,

On 21/11/23 13:09, Chen Yu wrote:
> When a CPU is about to become idle due to task dequeue, uses
> the dequeued task's average sleep time to set the cache
> hot timeout of this idle CPU. This information can facilitate
> SIS to skip the cache-hot idle CPU and scan for the next
> cache-cold one. When that task is woken up again, it can choose
> its previous CPU and reuses its hot-cache.
>
> This is a preparation for the next patch to introduce SIS_CACHE
> based task wakeup.
>
> Signed-off-by: Chen Yu <[email protected]>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 672616503e35..c309b3d203c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6853,8 +6853,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> util_est_update(&rq->cfs, p, task_sleep);
>
> if (task_sleep) {
> - p->last_dequeue_time = sched_clock_cpu(cpu_of(rq));
> + u64 now = sched_clock_cpu(cpu_of(rq));
> +
> + p->last_dequeue_time = now;
> p->last_dequeue_cpu = cpu_of(rq);
> +
> +#ifdef CONFIG_SMP
> + /* this rq becomes idle, update its cache hot timeout */
> + if (sched_feat(SIS_CACHE) && !rq->nr_running &&
> + p->avg_hot_dur)
> + rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->avg_hot_dur);

As per the discussion in the rfc patch, you mentioned that SIS_CACHE only honors the average sleep time
of the latest dequeued task and that we don't know how much of the cache is polluted by the latest task.

So I was wondering what made you to put max here.

Thanks and Regards
Madadi Vineeth Reddy

2023-11-26 07:22:19

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/fair: Calculate the cache-hot time of the idle CPU

Hi Madadi,

On 2023-11-25 at 12:40:18 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 21/11/23 13:09, Chen Yu wrote:
> > When a CPU is about to become idle due to task dequeue, uses
> > the dequeued task's average sleep time to set the cache
> > hot timeout of this idle CPU. This information can facilitate
> > SIS to skip the cache-hot idle CPU and scan for the next
> > cache-cold one. When that task is woken up again, it can choose
> > its previous CPU and reuses its hot-cache.
> >
> > This is a preparation for the next patch to introduce SIS_CACHE
> > based task wakeup.
> >
> > Signed-off-by: Chen Yu <[email protected]>
> > ---
> > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 31 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 672616503e35..c309b3d203c0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6853,8 +6853,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > util_est_update(&rq->cfs, p, task_sleep);
> >
> > if (task_sleep) {
> > - p->last_dequeue_time = sched_clock_cpu(cpu_of(rq));
> > + u64 now = sched_clock_cpu(cpu_of(rq));
> > +
> > + p->last_dequeue_time = now;
> > p->last_dequeue_cpu = cpu_of(rq);
> > +
> > +#ifdef CONFIG_SMP
> > + /* this rq becomes idle, update its cache hot timeout */
> > + if (sched_feat(SIS_CACHE) && !rq->nr_running &&
> > + p->avg_hot_dur)
> > + rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->avg_hot_dur);
>
> As per the discussion in the rfc patch, you mentioned that SIS_CACHE only honors the average sleep time
> of the latest dequeued task and that we don't know how much of the cache is polluted by the latest task.
>
> So I was wondering what made you to put max here.
>

Thanks for taking a look. Yes, previously SIS_CACHE only honors the latest dequeue task.
But as Mathieu pointed out[1], the latest dequeue task might not have enough time to scribble
the cache footprint of some older dequeue tasks, and we should honor the sleep time of
those older dequeue tasks. Consider the following scenario:

task p1 is dequeued with an average sleep time of 2 msec. Then p2 is scheduled in
on this cpu, but only runs for 10 us(does not pollute the cache footprint) and
dequeued with average sleep time of 1 msec. Should we tag the CPU runqueue's timeout
as 2 msec or 1 msec later? We choose 2 msec. The idea is to make the timeout moving
forward so the SIS_CACHE could make it easier for the p1 to be woken up on its previous
CPU.

[1] https://lore.kernel.org/lkml/[email protected]/

thanks,
Chenyu

2023-11-26 08:46:01

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Chen Yu,

On 21/11/23 13:09, Chen Yu wrote:
> v1 -> v2:
> - Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
> - Refine the task sleep duration calculation based on task's previous running
> CPU. (Aaron Lu)
> - Limit the cache-hot idle CPU scan depth to reduce the time spend on
> searching, to fix the regression. (K Prateek Nayak)
> - Add test results of the real life workload per request from Ingo
> Daytrader on a power system. (Madadi Vineeth Reddy)
> OLTP workload on Xeon Sapphire Rapids.
> - Refined the commit log, added Reviewed-by tag to PATCH 1/3
> (Mathieu Desnoyers).
>
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for the comments and tests!
>
> ----------------------------------------------------------------------
>
> This series aims to continue the discussion of how to make the wakee
> to choose its previous CPU easier.
>
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Inhibit the task migration could benefit many workloads. Inspired by
> Mathieu's proposal to limit the task migration ratio[1], introduce
> the SIS_CACHE. It considers the sleep time of the task for better
> task placement. Based on the task's short sleeping history, tag p's
> previous CPU as cache-hot. Later when p is woken up, it can choose
> its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU and try the next idle CPU
> when possible. The idea of SIS_CACHE is to optimize the idle CPU
> scan sequence. The extra scan time is minimized by restricting the
> scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.
>
> This test is based on tip/sched/core, on top of
> Commit ada87d23b734
> ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")
>
> This patch set has shown 15% ~ 70% improvements for client/server
> workloads like netperf and tbench. It shows 0.7% improvement of
> OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
> There is 2% improvement of another real life workload Daytrader
> per the test of Madadi on a power system with 96 CPUs. Prateek
> has helped check there is no obvious microbenchmark regression
> of the v2 on a 3rd Generation EPYC System with 128 CPUs.
>

Tested the patch on power system with 46 cores. Total of 368 CPU's.
System has 8 NUMA nodes.

Below are some of the benchmark results.

schbench(new) 99.0th latency (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
normal 1-mthreads 1.00 [ 0.00]( 4.34) 1.02 [ -2.00]( 5.98)
normal 2-mthreads 1.00 [ 0.00]( 13.95) 1.08 [ -8.00]( 10.39)
normal 4-mthreads 1.00 [ 0.00]( 6.20) 0.94 [ +6.00]( 10.90)
normal 6-mthreads 1.00 [ 0.00]( 12.76) 1.03 [ -3.00]( 9.33)

It seems like schbench is not much impacted with this patch(The pct imp of schbench is within the std%).
I expected some regression in wakeup latency while searching for an idle cpu which is not cache hot.
But I guess limiting the search depth had helped.


producer_consumer avg time/access (lower is better)
========
loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
5 1.00 [ 0.00]( 0.00) 0.93 [ +7.00]( 4.77)
10 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
20 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
when loads per consumer iteration is lower.


hackbench normalized time in seconds (lower is better)
========
case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
process-sockets 1-groups 1.00 [ 0.00]( 4.78) 0.99 [ +1.00]( 6.45)
process-sockets 2-groups 1.00 [ 0.00]( 0.97) 1.02 [ -2.00]( 1.87)
process-sockets 4-groups 1.00 [ 0.00]( 3.63) 1.01 [ -1.00]( 2.96)
process-sockets 8-groups 1.00 [ 0.00]( 0.43) 1.00 [ 0.00]( 0.27)
process-pipe 1-groups 1.00 [ 0.00](23.77) 0.88 [+12.00](22.77)
process-pipe 2-groups 1.00 [ 0.00]( 3.44) 1.03 [ -3.00]( 4.00)
process-pipe 4-groups 1.00 [ 0.00]( 2.41) 0.98 [ +2.00]( 3.88)
process-pipe 8-groups 1.00 [ 0.00]( 7.09) 1.07 [ -7.00]( 4.25)
threads-pipe 1-groups 1.00 [ 0.00](18.47) 1.11 [-11.00](24.21)
threads-pipe 2-groups 1.00 [ 0.00]( 6.45) 0.97 [ +3.00]( 5.58)
threads-pipe 4-groups 1.00 [ 0.00]( 5.63) 0.96 [ +2.00]( 5.90)
threads-pipe 8-groups 1.00 [ 0.00]( 1.65) 1.03 [ -3.00]( 3.97)
threads-sockets 1-groups 1.00 [ 0.00]( 2.00) 1.00 [ 0.00]( 0.65)
threads-sockets 2-groups 1.00 [ 0.00]( 1.69) 1.02 [ -2.00]( 1.48)
threads-sockets 4-groups 1.00 [ 0.00]( 5.66) 1.01 [ -1.00]( 3.56)
threads-sockets 8-groups 1.00 [ 0.00]( 0.26) 0.99 [ +1.00]( 0.36)

hackbench is not impacted.


Daytrader throughput (higher is better)
========
instances,users baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
3,30 1.00 [ 0.00]( 2.30) 1.02 [ +2.00]( 1.64)
3,60 1.00 [ 0.00]( 0.55) 1.01 [ +1.00]( 1.41)
3,90 1.00 [ 0.00]( 1.20) 1.02 [ +2.00]( 1.04)
3,120 1.00 [ 0.00]( 0.84) 1.02 [ +2.00]( 1.02)

A real life workload like daytrader is benefiting slightly with this patch.


Tested-by: Madadi Vineeth Reddy <[email protected]>

Thanks and Regards
Madadi Vineeth Reddy

2023-11-26 09:26:59

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

On 2023-11-26 at 14:14:20 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 21/11/23 13:09, Chen Yu wrote:
> > v1 -> v2:
> > - Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
> > - Refine the task sleep duration calculation based on task's previous running
> > CPU. (Aaron Lu)
> > - Limit the cache-hot idle CPU scan depth to reduce the time spend on
> > searching, to fix the regression. (K Prateek Nayak)
> > - Add test results of the real life workload per request from Ingo
> > Daytrader on a power system. (Madadi Vineeth Reddy)
> > OLTP workload on Xeon Sapphire Rapids.
> > - Refined the commit log, added Reviewed-by tag to PATCH 1/3
> > (Mathieu Desnoyers).
> >
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> > burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> > (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> > rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> > for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> > select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> > (K Prateek Nayak)
> >
> > Thanks for the comments and tests!
> >
> > ----------------------------------------------------------------------
> >
> > This series aims to continue the discussion of how to make the wakee
> > to choose its previous CPU easier.
> >
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Inhibit the task migration could benefit many workloads. Inspired by
> > Mathieu's proposal to limit the task migration ratio[1], introduce
> > the SIS_CACHE. It considers the sleep time of the task for better
> > task placement. Based on the task's short sleeping history, tag p's
> > previous CPU as cache-hot. Later when p is woken up, it can choose
> > its previous CPU in select_idle_sibling(). When other task is
> > woken up, skip this cache-hot idle CPU and try the next idle CPU
> > when possible. The idea of SIS_CACHE is to optimize the idle CPU
> > scan sequence. The extra scan time is minimized by restricting the
> > scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.
> >
> > This test is based on tip/sched/core, on top of
> > Commit ada87d23b734
> > ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")
> >
> > This patch set has shown 15% ~ 70% improvements for client/server
> > workloads like netperf and tbench. It shows 0.7% improvement of
> > OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
> > There is 2% improvement of another real life workload Daytrader
> > per the test of Madadi on a power system with 96 CPUs. Prateek
> > has helped check there is no obvious microbenchmark regression
> > of the v2 on a 3rd Generation EPYC System with 128 CPUs.
> >
>
> Tested the patch on power system with 46 cores. Total of 368 CPU's.
> System has 8 NUMA nodes.
>
> Below are some of the benchmark results.
>
> schbench(new) 99.0th latency (lower is better)
> ========
> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> normal 1-mthreads 1.00 [ 0.00]( 4.34) 1.02 [ -2.00]( 5.98)
> normal 2-mthreads 1.00 [ 0.00]( 13.95) 1.08 [ -8.00]( 10.39)
> normal 4-mthreads 1.00 [ 0.00]( 6.20) 0.94 [ +6.00]( 10.90)
> normal 6-mthreads 1.00 [ 0.00]( 12.76) 1.03 [ -3.00]( 9.33)
>
> It seems like schbench is not much impacted with this patch(The pct imp of schbench is within the std%).
> I expected some regression in wakeup latency while searching for an idle cpu which is not cache hot.
> But I guess limiting the search depth had helped.
>

I think so. Cutting the cache-hot cpu scan depth to 50% seems to also cure the regression
reported by Prateek.

>
> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> 5 1.00 [ 0.00]( 0.00) 0.93 [ +7.00]( 4.77)
> 10 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
> 20 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
> when loads per consumer iteration is lower.
>
>
> hackbench normalized time in seconds (lower is better)
> ========
> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> process-sockets 1-groups 1.00 [ 0.00]( 4.78) 0.99 [ +1.00]( 6.45)
> process-sockets 2-groups 1.00 [ 0.00]( 0.97) 1.02 [ -2.00]( 1.87)
> process-sockets 4-groups 1.00 [ 0.00]( 3.63) 1.01 [ -1.00]( 2.96)
> process-sockets 8-groups 1.00 [ 0.00]( 0.43) 1.00 [ 0.00]( 0.27)
> process-pipe 1-groups 1.00 [ 0.00](23.77) 0.88 [+12.00](22.77)
> process-pipe 2-groups 1.00 [ 0.00]( 3.44) 1.03 [ -3.00]( 4.00)
> process-pipe 4-groups 1.00 [ 0.00]( 2.41) 0.98 [ +2.00]( 3.88)
> process-pipe 8-groups 1.00 [ 0.00]( 7.09) 1.07 [ -7.00]( 4.25)
> threads-pipe 1-groups 1.00 [ 0.00](18.47) 1.11 [-11.00](24.21)
> threads-pipe 2-groups 1.00 [ 0.00]( 6.45) 0.97 [ +3.00]( 5.58)
> threads-pipe 4-groups 1.00 [ 0.00]( 5.63) 0.96 [ +2.00]( 5.90)
> threads-pipe 8-groups 1.00 [ 0.00]( 1.65) 1.03 [ -3.00]( 3.97)
> threads-sockets 1-groups 1.00 [ 0.00]( 2.00) 1.00 [ 0.00]( 0.65)
> threads-sockets 2-groups 1.00 [ 0.00]( 1.69) 1.02 [ -2.00]( 1.48)
> threads-sockets 4-groups 1.00 [ 0.00]( 5.66) 1.01 [ -1.00]( 3.56)
> threads-sockets 8-groups 1.00 [ 0.00]( 0.26) 0.99 [ +1.00]( 0.36)
>
> hackbench is not impacted.
>
>
> Daytrader throughput (higher is better)
> ========
> instances,users baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> 3,30 1.00 [ 0.00]( 2.30) 1.02 [ +2.00]( 1.64)
> 3,60 1.00 [ 0.00]( 0.55) 1.01 [ +1.00]( 1.41)
> 3,90 1.00 [ 0.00]( 1.20) 1.02 [ +2.00]( 1.04)
> 3,120 1.00 [ 0.00]( 0.84) 1.02 [ +2.00]( 1.02)
>
> A real life workload like daytrader is benefiting slightly with this patch.
>
>
> Tested-by: Madadi Vineeth Reddy <[email protected]>
>

Thanks!

Best,
Chenyu
> Thanks and Regards
> Madadi Vineeth Reddy

2023-11-29 17:29:26

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

Hi Chen Yu,

On 21/11/23 13:10, Chen Yu wrote:
> Problem statement:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Proposal:
> Introduce the SIS_CACHE. It considers the sleep time of the task for
> better task placement. Based on the task's short sleeping history,
> tag p's previous CPU as cache-hot. Later when p is woken up, it can
> choose its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU when possible.
>
> SIS_CACHE still prefers to choose an idle CPU during task wakeup,
> the idea is to optimize the idle CPU scan sequence.
>
> As pointed out by Prateek, this has the potential that all idle CPUs
> are cache-hot and skipped. Mitigate this by returning the first
> cache-hot idle CPU. Meanwhile, to reduce the time spend on scanning,
> limit the max number of cache-hot CPU search depth to half of the number
> suggested by SIS_UTIL.
>
> Tested on Xeon 2 x 60C/120T platforms:
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 60-threads 1.00 ( 1.37) +0.04 ( 1.47)
> TCP_RR 120-threads 1.00 ( 1.77) -1.03 ( 1.31)
> TCP_RR 180-threads 1.00 ( 2.03) +1.25 ( 1.66)
> TCP_RR 240-threads 1.00 ( 41.31) +73.71 ( 22.02)
> TCP_RR 300-threads 1.00 ( 12.79) -0.11 ( 15.84)
>
> tbench
> ======
> case load baseline(std%) compare%( std%)
> loopback 60-threads 1.00 ( 0.35) +0.40 ( 0.31)
> loopback 120-threads 1.00 ( 1.94) -1.89 ( 1.17)
> loopback 180-threads 1.00 ( 13.59) +9.97 ( 0.93)
> loopback 240-threads 1.00 ( 11.68) +42.85 ( 7.28)
> loopback 300-threads 1.00 ( 4.47) +15.12 ( 1.40)
>
> hackbench
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1-groups 1.00 ( 9.21) -7.88 ( 2.03)
> process-pipe 2-groups 1.00 ( 7.09) +5.47 ( 9.02)
> process-pipe 4-groups 1.00 ( 1.60) +1.53 ( 1.70)
>
> schbench
> ========
> case load baseline(std%) compare%( std%)
> normal 1-mthreads 1.00 ( 0.98) +0.26 ( 0.37)
> normal 2-mthreads 1.00 ( 3.99) -7.97 ( 7.33)
> normal 4-mthreads 1.00 ( 3.07) -1.55 ( 3.27)
>
> Also did some experiments on the OLTP workload on a 112 core 2 socket
> SPR machine. The OLTP workload have a mixture of threads handling
> database updates on disks and handling transaction queries over network.
> Around 0.7% improvement is observed with less than 0.2% run-to-run
> variation.
>
> Thanks Madadi for testing the SIS_CACHE on a power system with 96 CPUs.
> The results showed a max of 29% improvements in hackbench, 13% improvements
> in producer_consumer workload, and 2% improvements in real life workload
> named Daytrader.
>
> Thanks Prateek for running microbenchmarks on top of the latest patch on
> a 3rd Generation EPYC System:
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> No consistent regression was observed in v2 version.
>
> Analysis:
> The reason why waking up the task on its previous CPU brings benefits
> is due to less task migration and higher local resource locality.
>
> Take netperf 240 case as an example, run the following script
> to track the migration number within 10 seconds. Use perf topdown
> to track the PMU events. The task migration and stall cycles
> have been reduced a lot with SIS_CACHE:
>
> kretfunc:select_task_rq_fair
> {
> $p = (struct task_struct *)args->p;
> if ($p->comm == "netperf") {
> if ($p->thread_info.cpu != retval) {
> @wakeup_migrate_netperf = count();
> } else {
> @wakeup_prev_netperf = count();
> }
> }
> }
>
> NO_SIS_CACHE:
> @wakeup_migrate_netperf: 57473509
> @wakeup_prev_netperf: 14935964
> RESOURCE_STALLS: 19.1% * 7.1% * 35.0%
>
> SIS_CACHE:
> @wakeup_migrate_netperf: 799
> @wakeup_prev_netperf: 132937436
> RESOURCE_STALLS: 5.4% * 7.5% * 39.8%
>
> Suggested-by: Tim Chen <[email protected]>
> Signed-off-by: Chen Yu <[email protected]>
> ---
> kernel/sched/fair.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c309b3d203c0..d149eca74fca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7360,7 +7360,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
> * Return true if the idle CPU is cache-hot for someone,
> * return false otherwise.
> */
> -static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
> +static bool cache_hot_cpu(int cpu, int *hot_cpu)
> {
> if (!sched_feat(SIS_CACHE))
> return false;
> @@ -7383,7 +7383,7 @@ static __maybe_unused bool cache_hot_cpu(int cpu, int *hot_cpu)
> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + int i, cpu, idle_cpu = -1, nr = INT_MAX, nr_hot = 0, hot_cpu = -1;
> struct sched_domain_shared *sd_share;
>
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -7396,6 +7396,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> /* overloaded LLC is unlikely to have idle cpu/core */
> if (nr == 1)
> return -1;
> +
> + /* max number of cache-hot idle cpu check */
> + nr_hot = nr >> 1;
> }
> }
>
> @@ -7426,18 +7429,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> for_each_cpu_wrap(cpu, cpus, target + 1) {
> if (has_idle_core) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> - if ((unsigned int)i < nr_cpumask_bits)
> + if ((unsigned int)i < nr_cpumask_bits) {
> + if (--nr_hot >= 0 && cache_hot_cpu(i, &hot_cpu))
> + continue;
> +
> return i;
> + }
>
> } else {
> if (--nr <= 0)
> return -1;
> idle_cpu = __select_idle_cpu(cpu, p);
> - if ((unsigned int)idle_cpu < nr_cpumask_bits)
> + if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> + if (--nr_hot >= 0 && cache_hot_cpu(idle_cpu, &hot_cpu))
> + continue;
> +
> break;
> + }
> }
> }
>
> + /* pick the first cache-hot CPU as the last resort */
> + if (idle_cpu == -1 && hot_cpu != -1)
> + idle_cpu = hot_cpu;
> +
> if (has_idle_core)
> set_idle_cores(target, false);
>

As per my understanding, if the task which tagged a particular CPU as cache hot has woken up before
the cache_hot_timeout, we still don't allow that task to run on that particular CPU. Is this correct?

If correct, then why don't we let the task to select the CPU if it's the one that tagged it?

Thanks and Regards
Madadi Vineeth Reddy

2023-11-30 06:43:20

by Yu Chen

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

Hi Madadi,

On Thu, Nov 30, 2023 at 1:26 AM Madadi Vineeth Reddy
<[email protected]> wrote:
>
> Hi Chen Yu,
>
[snip...]

>
> As per my understanding, if the task which tagged a particular CPU as cache hot has woken up before
> the cache_hot_timeout, we still don't allow that task to run on that particular CPU. Is this correct?
>

Not exactly. When we reached select_idle_cpu(), we have already
checked if the wakee's previous CPU
is idle or not in select_idle_sibling(), if it is idle, we don't check
the cache-hot tag and return the wakee's
previous CPU directly in select_idle_sibling().

thanks,
Chenyu

> If correct, then why don't we let the task to select the CPU if it's the one that tagged it?
>
> Thanks and Regards
> Madadi Vineeth Reddy

2023-12-01 13:57:15

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

On 30/11/23 12:13, Chen Yu wrote:
> Hi Madadi,
>
> On Thu, Nov 30, 2023 at 1:26 AM Madadi Vineeth Reddy
> <[email protected]> wrote:
>>
>> Hi Chen Yu,
>>
> [snip...]
>
>>
>> As per my understanding, if the task which tagged a particular CPU as cache hot has woken up before
>> the cache_hot_timeout, we still don't allow that task to run on that particular CPU. Is this correct?
>>
>
> Not exactly. When we reached select_idle_cpu(), we have already
> checked if the wakee's previous CPU
> is idle or not in select_idle_sibling(), if it is idle, we don't check
> the cache-hot tag and return the wakee's
> previous CPU directly in select_idle_sibling().

Yeah..got it. Thanks.

So, the way another task(t') can select the cache hot tagged cpu(tagged for task t) is when t' gets
it's target cpu as this cache hot tagged cpu from wake_affine() in select_task_rq_fair.
The other way being /* pick the first cache-hot CPU as the last resort */.

>
> thanks,
> Chenyu
>
>> If correct, then why don't we let the task to select the CPU if it's the one that tagged it?
>>
>> Thanks and Regards
>> Madadi Vineeth Reddy

Thanks and Regards
Madadi Vineeth Reddy

2024-02-18 09:28:21

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Chen Yu,

On 21/11/23 13:09, Chen Yu wrote:
> v1 -> v2:
> - Move the task sleep duration from sched_entity to task_struct. (Aaron Lu)
> - Refine the task sleep duration calculation based on task's previous running
> CPU. (Aaron Lu)
> - Limit the cache-hot idle CPU scan depth to reduce the time spend on
> searching, to fix the regression. (K Prateek Nayak)
> - Add test results of the real life workload per request from Ingo
> Daytrader on a power system. (Madadi Vineeth Reddy)
> OLTP workload on Xeon Sapphire Rapids.
> - Refined the commit log, added Reviewed-by tag to PATCH 1/3
> (Mathieu Desnoyers).
>
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
> burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
> rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
> for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> (K Prateek Nayak)
>
> Thanks for the comments and tests!
>
> ----------------------------------------------------------------------
>
> This series aims to continue the discussion of how to make the wakee
> to choose its previous CPU easier.
>
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Inhibit the task migration could benefit many workloads. Inspired by
> Mathieu's proposal to limit the task migration ratio[1], introduce
> the SIS_CACHE. It considers the sleep time of the task for better
> task placement. Based on the task's short sleeping history, tag p's
> previous CPU as cache-hot. Later when p is woken up, it can choose
> its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU and try the next idle CPU
> when possible. The idea of SIS_CACHE is to optimize the idle CPU
> scan sequence. The extra scan time is minimized by restricting the
> scan depth of cache-hot CPUs to 50% of the scan depth of SIS_UTIL.
>
> This test is based on tip/sched/core, on top of
> Commit ada87d23b734
> ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram")
>
> This patch set has shown 15% ~ 70% improvements for client/server
> workloads like netperf and tbench. It shows 0.7% improvement of
> OLTP with 0.2% run-to-run variation on Xeon 240 CPUs system.
> There is 2% improvement of another real life workload Daytrader
> per the test of Madadi on a power system with 96 CPUs. Prateek
> has helped check there is no obvious microbenchmark regression
> of the v2 on a 3rd Generation EPYC System with 128 CPUs.
>
> Link: https://lore.kernel.org/lkml/[email protected]/ #1
>
> Chen Yu (3):
> sched/fair: Record the task sleeping time as the cache hot duration
> sched/fair: Calculate the cache-hot time of the idle CPU
> sched/fair: skip the cache hot CPU in select_idle_cpu()
>
> include/linux/sched.h | 4 ++
> kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++--
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 1 +
> 4 files changed, 91 insertions(+), 3 deletions(-)
>

Any update or progress regarding this patch?

I was working on a patch that improves scheduler performance in power10 by making changes
to the order in which domains are accessed for cpu selection during wakeup. It turns out
that this patch is helpful in that regard and my patch is giving better performance on top
of this patch.

So, looking forward to know the progress/status of this patch.

Thanks and Regards
Madadi Vineeth Reddy


2024-02-18 13:03:39

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Madadi,

On 2024-02-18 at 14:57:17 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
>
> On 21/11/23 13:09, Chen Yu wrote:
>
> Any update or progress regarding this patch?
>
> I was working on a patch that improves scheduler performance in power10 by making changes
> to the order in which domains are accessed for cpu selection during wakeup. It turns out
> that this patch is helpful in that regard and my patch is giving better performance on top
> of this patch.
>
> So, looking forward to know the progress/status of this patch.
>

Thank you for your continuous interest in this proposal. Glad to hear
that with your patch applied on top of this can get better performance.
Let me rebase the patch on the latest kernel/do some code cleanup and
send the new version out.

thanks,
Chenyu

2024-02-19 11:56:06

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

On Tue, 21 Nov 2023 15:40:14 +0800 Chen Yu <[email protected]>
> Problem statement:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Proposal:
> Introduce the SIS_CACHE. It considers the sleep time of the task for
> better task placement. Based on the task's short sleeping history,
> tag p's previous CPU as cache-hot. Later when p is woken up, it can
> choose its previous CPU in select_idle_sibling(). When other task is
> woken up, skip this cache-hot idle CPU when possible.
>
> SIS_CACHE still prefers to choose an idle CPU during task wakeup,
> the idea is to optimize the idle CPU scan sequence.

Could you specify why the currently selected cpu fails to work in the
scenario described above?

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
asym_fits_cpu(task_util, util_min, util_max, prev))
return prev;

2024-02-19 14:26:48

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: do not scribble cache-hot CPU in select_idle_cpu()

Hi Hillf,

On 2024-02-19 at 19:50:14 +0800, Hillf Danton wrote:
> On Tue, 21 Nov 2023 15:40:14 +0800 Chen Yu <[email protected]>
> > Problem statement:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Proposal:
> > Introduce the SIS_CACHE. It considers the sleep time of the task for
> > better task placement. Based on the task's short sleeping history,
> > tag p's previous CPU as cache-hot. Later when p is woken up, it can
> > choose its previous CPU in select_idle_sibling(). When other task is
> > woken up, skip this cache-hot idle CPU when possible.
> >
> > SIS_CACHE still prefers to choose an idle CPU during task wakeup,
> > the idea is to optimize the idle CPU scan sequence.
>
> Could you specify why the currently selected cpu fails to work in the
> scenario described above?
>

Thank you for your review.

I assume that "currently select cpu" means "target". First, the "target"
could be chosen by wake_affine() -> wake_affine_weight(), which can
return non-idle candidate CPU. That is to say, when we reached below code,
the target and the prev could both be non-idle. Second, when target and
prev are both non-idle, select_idle_sibling() has to traverse the other CPUs
to find an idle one. What we do in SIS_CACHE is to increase the possibility
that the prev is idle and return directly in below code path.

Say, task p1's previous CPU is CPU1, and p1 is sleeping. With SIS_CACHE,
when another task p2 is woken up in select_idle_sibling()->select_idle_cpu(),
p2 tries to find CPU2 or CPU3 or CPU4... but not CPU1. This makes it easier
for p1 to find that CPU1 is idle when p1 is woken up, and choose CPU1 when
possible.

thanks,
Chenyu
> /*
> * If the previous CPU is cache affine and idle, don't be stupid:
> */
> if (prev != target && cpus_share_cache(prev, target) &&
> (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> asym_fits_cpu(task_util, util_min, util_max, prev))
> return prev;