2019-04-25 09:38:57

by Viresh Kumar

[permalink] [raw]
Subject: [RFC V2 0/2] sched/fair: Fallback to sched-idle CPU for better performance

Hi,

Here is another attempt to get some benefit out of the sched-idle
policy. The previous version [1] focused on getting better power numbers
and this version tries to get better performance or lower response time
for the tasks.

The first patch is unchanged from v1 and accumulates
information about sched-idle tasks per CPU.

The second patch changes the way the target CPU is selected in the fast
path. Currently, we target for an idle CPU in select_idle_sibling() to
run the next task, but in case we don't find idle CPUs it is better to
pick a CPU which will run the task the soonest, for performance reason.
A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU shall give better
results as it should be able to run the task sooner than an idle CPU
(which requires to be woken up from an idle state).

Basic testing is done with the help of rt-app currently to make sure the
task is getting placed correctly.

--
viresh

Viresh Kumar (2):
sched: Start tracking SCHED_IDLE tasks count in cfs_rq
sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 2 ++
2 files changed, 35 insertions(+), 9 deletions(-)

--
2.21.0.rc0.269.g1a574e7a288b

[1] https://lore.kernel.org/lkml/[email protected]/


2019-04-25 09:39:03

by Viresh Kumar

[permalink] [raw]
Subject: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

We target for an idle CPU in select_idle_sibling() to run the next task,
but in case we don't find idle CPUs it is better to pick a CPU which
will run the task the soonest, for performance reason. A CPU which isn't
idle but has only SCHED_IDLE activity queued on it should be a good
target based on this criteria as any normal fair task will most likely
preempt the currently running SCHED_IDLE task immediately. In fact,
choosing a SCHED_IDLE CPU shall give better results as it should be able
to run the task sooner than an idle CPU (which requires to be woken up
from an idle state).

This patch updates the fast path to fallback to a sched-idle CPU if the
idle CPU isn't found, the slow path can be updated separately later.

Following is the order in which select_idle_sibling() picks up next CPU
to run the task now:

1. idle_cpu(target) OR sched_idle_cpu(target)
2. idle_cpu(prev) OR sched_idle_cpu(prev)
3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)
4. idle core(sd)
5. idle_cpu(sd)
6. sched_idle_cpu(sd)
7. idle_cpu(p) - smt
8. sched_idle_cpu(p)- smt

Though the policy can be tweaked a bit if we want to have different
priorities.

Signed-off-by: Viresh Kumar <[email protected]>
---
kernel/sched/fair.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6511cb57acdd..fbaefb9a9296 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6057,6 +6057,15 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
return new_cpu;
}

+/* CPU only has SCHED_IDLE tasks enqueued */
+static int sched_idle_cpu(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+ rq->nr_running);
+}
+
#ifdef CONFIG_SCHED_SMT
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6154,7 +6163,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
*/
static int select_idle_smt(struct task_struct *p, int target)
{
- int cpu;
+ int cpu, si_cpu = -1;

if (!static_branch_likely(&sched_smt_present))
return -1;
@@ -6164,9 +6173,11 @@ static int select_idle_smt(struct task_struct *p, int target)
continue;
if (available_idle_cpu(cpu))
return cpu;
+ if (si_cpu == -1 && sched_idle_cpu(cpu))
+ si_cpu = cpu;
}

- return -1;
+ return si_cpu;
}

#else /* CONFIG_SCHED_SMT */
@@ -6194,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, nr = INT_MAX;
+ int cpu, nr = INT_MAX, si_cpu = -1;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6222,11 +6233,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
if (!--nr)
- return -1;
+ return si_cpu;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
if (available_idle_cpu(cpu))
break;
+ if (si_cpu == -1 && sched_idle_cpu(cpu))
+ si_cpu = cpu;
}

time = local_clock() - time;
@@ -6245,13 +6258,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd;
int i, recent_used_cpu;

- if (available_idle_cpu(target))
+ if (available_idle_cpu(target) || sched_idle_cpu(target))
return target;

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev))
+ if (prev != target && cpus_share_cache(prev, target) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev)))
return prev;

/* Check a recently used CPU as a potential idle candidate: */
@@ -6259,7 +6273,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
- available_idle_cpu(recent_used_cpu) &&
+ (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
/*
* Replace recent_used_cpu with prev as it is a potential
--
2.21.0.rc0.269.g1a574e7a288b

2019-04-25 09:39:35

by Viresh Kumar

[permalink] [raw]
Subject: [RFC V2 1/2] sched: Start tracking SCHED_IDLE tasks count in cfs_rq

Start tracking how many tasks with SCHED_IDLE policy are present in each
cfs_rq. This will be used by later commits.

Signed-off-by: Viresh Kumar <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++++++--
kernel/sched/sched.h | 2 ++
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4275eb07c0b2..6511cb57acdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4475,7 +4475,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
- long task_delta, dequeue = 1;
+ long task_delta, idle_task_delta, dequeue = 1;
bool empty;

se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
@@ -4486,6 +4486,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
rcu_read_unlock();

task_delta = cfs_rq->h_nr_running;
+ idle_task_delta = cfs_rq->idle_h_nr_running;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
@@ -4495,6 +4496,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (dequeue)
dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
qcfs_rq->h_nr_running -= task_delta;
+ qcfs_rq->idle_h_nr_running -= idle_task_delta;

if (qcfs_rq->load.weight)
dequeue = 0;
@@ -4534,7 +4536,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
int enqueue = 1;
- long task_delta;
+ long task_delta, idle_task_delta;

se = cfs_rq->tg->se[cpu_of(rq)];

@@ -4554,6 +4556,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
return;

task_delta = cfs_rq->h_nr_running;
+ idle_task_delta = cfs_rq->idle_h_nr_running;
for_each_sched_entity(se) {
if (se->on_rq)
enqueue = 0;
@@ -4562,6 +4565,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (enqueue)
enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
cfs_rq->h_nr_running += task_delta;
+ cfs_rq->idle_h_nr_running += idle_task_delta;

if (cfs_rq_throttled(cfs_rq))
break;
@@ -5166,6 +5170,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int idle_h_nr_running = unlikely(task_has_idle_policy(p)) ? 1 : 0;

/*
* The code below (indirectly) updates schedutil which looks at
@@ -5198,6 +5203,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;
cfs_rq->h_nr_running++;
+ cfs_rq->idle_h_nr_running += idle_h_nr_running;

flags = ENQUEUE_WAKEUP;
}
@@ -5205,6 +5211,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running++;
+ cfs_rq->idle_h_nr_running += idle_h_nr_running;

if (cfs_rq_throttled(cfs_rq))
break;
@@ -5266,6 +5273,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
+ int idle_h_nr_running = unlikely(task_has_idle_policy(p)) ? 1 : 0;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -5280,6 +5288,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_throttled(cfs_rq))
break;
cfs_rq->h_nr_running--;
+ cfs_rq->idle_h_nr_running -= idle_h_nr_running;

/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
@@ -5299,6 +5308,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running--;
+ cfs_rq->idle_h_nr_running -= idle_h_nr_running;

if (cfs_rq_throttled(cfs_rq))
break;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1ada0be..0f07d9b55a79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -490,6 +490,8 @@ struct cfs_rq {
unsigned long runnable_weight;
unsigned int nr_running;
unsigned int h_nr_running;
+ /* h_nr_running for SCHED_IDLE tasks */
+ unsigned int idle_h_nr_running;

u64 exec_clock;
u64 min_vruntime;
--
2.21.0.rc0.269.g1a574e7a288b

2019-05-09 21:55:54

by Song Liu

[permalink] [raw]
Subject: Re: [RFC V2 0/2] sched/fair: Fallback to sched-idle CPU for better performance

On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <[email protected]> wrote:
>
> Hi,
>
> Here is another attempt to get some benefit out of the sched-idle
> policy. The previous version [1] focused on getting better power numbers
> and this version tries to get better performance or lower response time
> for the tasks.
>
> The first patch is unchanged from v1 and accumulates
> information about sched-idle tasks per CPU.
>
> The second patch changes the way the target CPU is selected in the fast
> path. Currently, we target for an idle CPU in select_idle_sibling() to
> run the next task, but in case we don't find idle CPUs it is better to
> pick a CPU which will run the task the soonest, for performance reason.
> A CPU which isn't idle but has only SCHED_IDLE activity queued on it
> should be a good target based on this criteria as any normal fair task
> will most likely preempt the currently running SCHED_IDLE task
> immediately. In fact, choosing a SCHED_IDLE CPU shall give better
> results as it should be able to run the task sooner than an idle CPU
> (which requires to be woken up from an idle state).
>
> Basic testing is done with the help of rt-app currently to make sure the
> task is getting placed correctly.

I run some test with this set on our (Facebook's) web service (main job)
and ffmpeg (side job). The result looks promising.

For all the tests below, I run the web service with same load level; and
the side job with SCHED_IDLE. I presented schedule latency distribution
of the main job. The latency distribution is measured with the runqlat tool:
https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt

I modified the tool to only track wake up latency of the main job.

4 cases are compared here:

1. w/o this set, w/o side job;
2. w/ this set, w/o side job;
3. w/o this set, w/ side job;
4. w/ this set, w/ side job;


Case #1. w/o this set, w/o side job
@usecs:
[1] 1705 | |
[2, 4) 1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8) 329160 |@@@@@@@@@@@@@@@ |
[8, 16) 34135 |@ |
[16, 32) 37466 |@ |
[32, 64) 15700 | |
[64, 128) 8759 | |
[128, 256) 5714 | |
[256, 512) 3718 | |
[512, 1K) 2152 | |
[1K, 2K) 882 | |
[2K, 4K) 256 | |
[4K, 8K) 48 | |
[8K, 16K) 2 | |

Case #2. w/ this set, w/o side job;
@usecs:
[1] 2121 | |
[2, 4) 1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8) 401517 |@@@@@@@@@@@@@@@@ |
[8, 16) 64325 |@@ |
[16, 32) 74352 |@@@ |
[32, 64) 40583 |@ |
[64, 128) 26129 |@ |
[128, 256) 18612 | |
[256, 512) 12863 | |
[512, 1K) 8304 | |
[1K, 2K) 4072 | |
[2K, 4K) 1569 | |
[4K, 8K) 411 | |
[8K, 16K) 70 | |
[16K, 32K) 1 | |

From #1 and #2, we see this set probably adds a little overhead to
scheduling when there is no side job. But the overhead is clearly very
small.


Case #3. w/o this set, w/ side job;
@usecs:
[1] 1282 | |
[2, 4) 260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 8) 446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 136927 |@@@@@@@@@@@@@@@ |
[16, 32) 148758 |@@@@@@@@@@@@@@@@@ |
[32, 64) 160291 |@@@@@@@@@@@@@@@@@@ |
[64, 128) 177292 |@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 184573 |@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 173405 |@@@@@@@@@@@@@@@@@@@@ |
[512, 1K) 149662 |@@@@@@@@@@@@@@@@@ |
[1K, 2K) 120217 |@@@@@@@@@@@@@@ |
[2K, 4K) 80275 |@@@@@@@@@ |
[4K, 8K) 36108 |@@@@ |
[8K, 16K) 11121 |@ |
[16K, 32K) 736 | |
[32K, 64K) 19 | |

Case #4. w/ this set, w/ side job;
@usecs:
[1] 407 | |
[2, 4) 535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 8) 795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 93032 |@@@@@@ |
[16, 32) 89890 |@@@@@ |
[32, 64) 82775 |@@@@@ |
[64, 128) 84413 |@@@@@ |
[128, 256) 84413 |@@@@@ |
[256, 512) 77202 |@@@@@ |
[512, 1K) 66043 |@@@@ |
[1K, 2K) 49276 |@@@ |
[2K, 4K) 30114 |@ |
[4K, 8K) 11145 | |
[8K, 16K) 2328 | |
[16K, 32K) 88 | |

#3 and #4 clearly showed the benefit of this set. With this set, we see
significantly fewer latency values in the 8usecs+ ranges.

Thanks,
Song

>
> --
> viresh
>
> Viresh Kumar (2):
> sched: Start tracking SCHED_IDLE tasks count in cfs_rq
> sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
>
> kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++---------
> kernel/sched/sched.h | 2 ++
> 2 files changed, 35 insertions(+), 9 deletions(-)
>
> --
> 2.21.0.rc0.269.g1a574e7a288b
>
> [1] https://lore.kernel.org/lkml/[email protected]/

2019-05-10 06:57:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC V2 1/2] sched: Start tracking SCHED_IDLE tasks count in cfs_rq

On Thu, Apr 25, 2019 at 03:07:39PM +0530, Viresh Kumar wrote:
> @@ -5166,6 +5170,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> + int idle_h_nr_running = unlikely(task_has_idle_policy(p)) ? 1 : 0;
>
> /*
> * The code below (indirectly) updates schedutil which looks at

> @@ -5266,6 +5273,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> int task_sleep = flags & DEQUEUE_SLEEP;
> + int idle_h_nr_running = unlikely(task_has_idle_policy(p)) ? 1 : 0;
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);

That's a completely pointless branch there (and I suspect the compiler
will see that too), just write:

int idle_h_nr_running = task_has_idle_policy(p);

2019-05-10 07:32:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

On Thu, Apr 25, 2019 at 03:07:40PM +0530, Viresh Kumar wrote:
> We target for an idle CPU in select_idle_sibling() to run the next task,
> but in case we don't find idle CPUs it is better to pick a CPU which
> will run the task the soonest, for performance reason. A CPU which isn't
> idle but has only SCHED_IDLE activity queued on it should be a good
> target based on this criteria as any normal fair task will most likely
> preempt the currently running SCHED_IDLE task immediately. In fact,
> choosing a SCHED_IDLE CPU shall give better results as it should be able
> to run the task sooner than an idle CPU (which requires to be woken up
> from an idle state).
>
> This patch updates the fast path to fallback to a sched-idle CPU if the
> idle CPU isn't found, the slow path can be updated separately later.
>
> Following is the order in which select_idle_sibling() picks up next CPU
> to run the task now:
>
> 1. idle_cpu(target) OR sched_idle_cpu(target)
> 2. idle_cpu(prev) OR sched_idle_cpu(prev)
> 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)
> 4. idle core(sd)
> 5. idle_cpu(sd)
> 6. sched_idle_cpu(sd)
> 7. idle_cpu(p) - smt
> 8. sched_idle_cpu(p)- smt
>
> Though the policy can be tweaked a bit if we want to have different
> priorities.

I don't hate his per se; but the whole select_idle_sibling() thing is
something that needs looking at.

There was the task stealing thing from Steve that looked interesting and
that would render your apporach unfeasible.

2019-05-10 12:32:22

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC V2 0/2] sched/fair: Fallback to sched-idle CPU for better performance

Hi Song,

On Thu, 9 May 2019 at 23:54, Song Liu <[email protected]> wrote:
>
> On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <[email protected]> wrote:
> >
> > Hi,
> >
> > Here is another attempt to get some benefit out of the sched-idle
> > policy. The previous version [1] focused on getting better power numbers
> > and this version tries to get better performance or lower response time
> > for the tasks.
> >
> > The first patch is unchanged from v1 and accumulates
> > information about sched-idle tasks per CPU.
> >
> > The second patch changes the way the target CPU is selected in the fast
> > path. Currently, we target for an idle CPU in select_idle_sibling() to
> > run the next task, but in case we don't find idle CPUs it is better to
> > pick a CPU which will run the task the soonest, for performance reason.
> > A CPU which isn't idle but has only SCHED_IDLE activity queued on it
> > should be a good target based on this criteria as any normal fair task
> > will most likely preempt the currently running SCHED_IDLE task
> > immediately. In fact, choosing a SCHED_IDLE CPU shall give better
> > results as it should be able to run the task sooner than an idle CPU
> > (which requires to be woken up from an idle state).
> >
> > Basic testing is done with the help of rt-app currently to make sure the
> > task is getting placed correctly.
>
> I run some test with this set on our (Facebook's) web service (main job)
> and ffmpeg (side job). The result looks promising.
>
> For all the tests below, I run the web service with same load level; and
> the side job with SCHED_IDLE. I presented schedule latency distribution
> of the main job. The latency distribution is measured with the runqlat tool:
> https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt
>
> I modified the tool to only track wake up latency of the main job.
>
> 4 cases are compared here:
>
> 1. w/o this set, w/o side job;
> 2. w/ this set, w/o side job;
> 3. w/o this set, w/ side job;
> 4. w/ this set, w/ side job;
>
>
> Case #1. w/o this set, w/o side job
> @usecs:
> [1] 1705 | |
> [2, 4) 1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4, 8) 329160 |@@@@@@@@@@@@@@@ |
> [8, 16) 34135 |@ |
> [16, 32) 37466 |@ |
> [32, 64) 15700 | |
> [64, 128) 8759 | |
> [128, 256) 5714 | |
> [256, 512) 3718 | |
> [512, 1K) 2152 | |
> [1K, 2K) 882 | |
> [2K, 4K) 256 | |
> [4K, 8K) 48 | |
> [8K, 16K) 2 | |
>
> Case #2. w/ this set, w/o side job;
> @usecs:
> [1] 2121 | |
> [2, 4) 1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4, 8) 401517 |@@@@@@@@@@@@@@@@ |
> [8, 16) 64325 |@@ |
> [16, 32) 74352 |@@@ |
> [32, 64) 40583 |@ |
> [64, 128) 26129 |@ |
> [128, 256) 18612 | |
> [256, 512) 12863 | |
> [512, 1K) 8304 | |
> [1K, 2K) 4072 | |
> [2K, 4K) 1569 | |
> [4K, 8K) 411 | |
> [8K, 16K) 70 | |
> [16K, 32K) 1 | |
>
> From #1 and #2, we see this set probably adds a little overhead to
> scheduling when there is no side job. But the overhead is clearly very
> small.
>
>
> Case #3. w/o this set, w/ side job;
> @usecs:
> [1] 1282 | |
> [2, 4) 260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [4, 8) 446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8, 16) 136927 |@@@@@@@@@@@@@@@ |
> [16, 32) 148758 |@@@@@@@@@@@@@@@@@ |
> [32, 64) 160291 |@@@@@@@@@@@@@@@@@@ |
> [64, 128) 177292 |@@@@@@@@@@@@@@@@@@@@ |
> [128, 256) 184573 |@@@@@@@@@@@@@@@@@@@@@ |
> [256, 512) 173405 |@@@@@@@@@@@@@@@@@@@@ |
> [512, 1K) 149662 |@@@@@@@@@@@@@@@@@ |
> [1K, 2K) 120217 |@@@@@@@@@@@@@@ |
> [2K, 4K) 80275 |@@@@@@@@@ |
> [4K, 8K) 36108 |@@@@ |
> [8K, 16K) 11121 |@ |
> [16K, 32K) 736 | |
> [32K, 64K) 19 | |
>
> Case #4. w/ this set, w/ side job;
> @usecs:
> [1] 407 | |
> [2, 4) 535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [4, 8) 795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8, 16) 93032 |@@@@@@ |
> [16, 32) 89890 |@@@@@ |
> [32, 64) 82775 |@@@@@ |
> [64, 128) 84413 |@@@@@ |
> [128, 256) 84413 |@@@@@ |
> [256, 512) 77202 |@@@@@ |
> [512, 1K) 66043 |@@@@ |
> [1K, 2K) 49276 |@@@ |
> [2K, 4K) 30114 |@ |
> [4K, 8K) 11145 | |
> [8K, 16K) 2328 | |
> [16K, 32K) 88 | |
>
> #3 and #4 clearly showed the benefit of this set. With this set, we see
> significantly fewer latency values in the 8usecs+ ranges.
>

Thanks for running tests with this patchset, your results looks goods
with a significant decrease of long wakeup latency.

Vincent

> Thanks,
> Song
>
> >
> > --
> > viresh
> >
> > Viresh Kumar (2):
> > sched: Start tracking SCHED_IDLE tasks count in cfs_rq
> > sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
> >
> > kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++---------
> > kernel/sched/sched.h | 2 ++
> > 2 files changed, 35 insertions(+), 9 deletions(-)
> >
> > --
> > 2.21.0.rc0.269.g1a574e7a288b
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/

2019-05-13 10:22:25

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

On 10-05-19, 09:21, Peter Zijlstra wrote:
> On Thu, Apr 25, 2019 at 03:07:40PM +0530, Viresh Kumar wrote:
> > We target for an idle CPU in select_idle_sibling() to run the next task,
> > but in case we don't find idle CPUs it is better to pick a CPU which
> > will run the task the soonest, for performance reason. A CPU which isn't
> > idle but has only SCHED_IDLE activity queued on it should be a good
> > target based on this criteria as any normal fair task will most likely
> > preempt the currently running SCHED_IDLE task immediately. In fact,
> > choosing a SCHED_IDLE CPU shall give better results as it should be able
> > to run the task sooner than an idle CPU (which requires to be woken up
> > from an idle state).
> >
> > This patch updates the fast path to fallback to a sched-idle CPU if the
> > idle CPU isn't found, the slow path can be updated separately later.
> >
> > Following is the order in which select_idle_sibling() picks up next CPU
> > to run the task now:
> >
> > 1. idle_cpu(target) OR sched_idle_cpu(target)
> > 2. idle_cpu(prev) OR sched_idle_cpu(prev)
> > 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)
> > 4. idle core(sd)
> > 5. idle_cpu(sd)
> > 6. sched_idle_cpu(sd)
> > 7. idle_cpu(p) - smt
> > 8. sched_idle_cpu(p)- smt
> >
> > Though the policy can be tweaked a bit if we want to have different
> > priorities.
>
> I don't hate his per se; but the whole select_idle_sibling() thing is
> something that needs looking at.
>
> There was the task stealing thing from Steve that looked interesting and
> that would render your apporach unfeasible.

I am surely missing something as I don't see how that patchset will
make this patchset perform badly, than what it already does.

The idea of this patchset is to find a CPU which can run the task the
soonest if no other CPU is idle. If a CPU is idle we still want to run
the task on that one to finish work asap. This patchset only updates
the fast path right now and doesn't touch slow-path and periodic/idle
load-balance path. That would be the next step for sure though.

Steve's patchset (IIUC) adds a new fast way of doing idle-load-balance
at the LLC level, that is no different than normal idle-load-balancing
for this patchset. In fact, I will say that Steve's patchset makes our
work easier to extend going forward as we can capitalize on the new
*fast* infrastructure to pull tasks even when a CPU isn't fully idle
but only has sched-idle stuff on it.

Does this makes sense ?

@Song: Thanks for giving this a try and I am really happy to see your
results. I do see that we still don't get the performance we wanted,
perhaps because we only touch the fast path. Maybe load-balance screws
it up for us at a later point of time and CPUs are left with only
sched-idle tasks. Not sure though.

--
viresh

2019-05-13 12:38:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:
> On 10-05-19, 09:21, Peter Zijlstra wrote:

> > I don't hate his per se; but the whole select_idle_sibling() thing is
> > something that needs looking at.
> >
> > There was the task stealing thing from Steve that looked interesting and
> > that would render your apporach unfeasible.
>
> I am surely missing something as I don't see how that patchset will
> make this patchset perform badly, than what it already does.

Nah; I just misremembered. I know Oracle has a patch set poking at
select_idle_siblings() _somewhere_ (as do I), and I just found the wrong
one.

Basically everybody is complaining select_idle_sibling() is too
expensive for checking the entire LLC domain, except for FB (and thus
likely some other workloads too) that depend on it to kill their tail
latency.

But I suppose we could still do this, even if we scan only a subset of
the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
tasks and pick that if you do not (in your limited scan) find a better
candidate.


2019-05-14 16:07:33

by Steven Sistare

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

On 5/13/2019 7:35 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:
>> On 10-05-19, 09:21, Peter Zijlstra wrote:
>
>>> I don't hate his per se; but the whole select_idle_sibling() thing is
>>> something that needs looking at.
>>>
>>> There was the task stealing thing from Steve that looked interesting and
>>> that would render your apporach unfeasible.
>>
>> I am surely missing something as I don't see how that patchset will
>> make this patchset perform badly, than what it already does.
>
> Nah; I just misremembered. I know Oracle has a patch set poking at
> select_idle_siblings() _somewhere_ (as do I), and I just found the wrong
> one.
>
> Basically everybody is complaining select_idle_sibling() is too
> expensive for checking the entire LLC domain, except for FB (and thus
> likely some other workloads too) that depend on it to kill their tail
> latency.
>
> But I suppose we could still do this, even if we scan only a subset of
> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
> tasks and pick that if you do not (in your limited scan) find a better
> candidate.

Subhra posted a patch that incrementally searches for an idle CPU in the LLC,
remembering the last CPU examined, and searching a fixed number of CPUs from there.
That technique is compatible with the one that Viresh suggests; the incremental
search would stop if a SCHED_IDLE cpu was found.

I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs,
updated with atomic operations. Performance was basically unchanged for the
workloads I tested, and I inserted timers around the idle search showing it was
a very small fraction of time both before and after my changes. That led me to
ignore the push side and optimize the pull side with task stealing.

I would be very interested in hearing from folks that have workloads that demonstrate
that select_idle_sibling is too expensive.

- Steve

2019-05-14 17:34:34

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found


On 5/14/19 9:03 AM, Steven Sistare wrote:
> On 5/13/2019 7:35 AM, Peter Zijlstra wrote:
>> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:
>>> On 10-05-19, 09:21, Peter Zijlstra wrote:
>>>> I don't hate his per se; but the whole select_idle_sibling() thing is
>>>> something that needs looking at.
>>>>
>>>> There was the task stealing thing from Steve that looked interesting and
>>>> that would render your apporach unfeasible.
>>> I am surely missing something as I don't see how that patchset will
>>> make this patchset perform badly, than what it already does.
>> Nah; I just misremembered. I know Oracle has a patch set poking at
>> select_idle_siblings() _somewhere_ (as do I), and I just found the wrong
>> one.
>>
>> Basically everybody is complaining select_idle_sibling() is too
>> expensive for checking the entire LLC domain, except for FB (and thus
>> likely some other workloads too) that depend on it to kill their tail
>> latency.
>>
>> But I suppose we could still do this, even if we scan only a subset of
>> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
>> tasks and pick that if you do not (in your limited scan) find a better
>> candidate.
> Subhra posted a patch that incrementally searches for an idle CPU in the LLC,
> remembering the last CPU examined, and searching a fixed number of CPUs from there.
> That technique is compatible with the one that Viresh suggests; the incremental
> search would stop if a SCHED_IDLE cpu was found.
This was the last version of patchset I sent:
https://lkml.org/lkml/2018/6/28/810

Also select_idle_core is a net -ve for certain workloads like OLTP. So I
had put a SCHED_FEAT to be able to disable it.

Thanks,
Subhra
>
> I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs,
> updated with atomic operations. Performance was basically unchanged for the
> workloads I tested, and I inserted timers around the idle search showing it was
> a very small fraction of time both before and after my changes. That led me to
> ignore the push side and optimize the pull side with task stealing.
>
> I would be very interested in hearing from folks that have workloads that demonstrate
> that select_idle_sibling is too expensive.
>
> - Steve

2019-05-14 17:43:17

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found


On 5/14/19 10:27 AM, Subhra Mazumdar wrote:
>
> On 5/14/19 9:03 AM, Steven Sistare wrote:
>> On 5/13/2019 7:35 AM, Peter Zijlstra wrote:
>>> On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:
>>>> On 10-05-19, 09:21, Peter Zijlstra wrote:
>>>>> I don't hate his per se; but the whole select_idle_sibling() thing is
>>>>> something that needs looking at.
>>>>>
>>>>> There was the task stealing thing from Steve that looked
>>>>> interesting and
>>>>> that would render your apporach unfeasible.
>>>> I am surely missing something as I don't see how that patchset will
>>>> make this patchset perform badly, than what it already does.
>>> Nah; I just misremembered. I know Oracle has a patch set poking at
>>> select_idle_siblings() _somewhere_ (as do I), and I just found the
>>> wrong
>>> one.
>>>
>>> Basically everybody is complaining select_idle_sibling() is too
>>> expensive for checking the entire LLC domain, except for FB (and thus
>>> likely some other workloads too) that depend on it to kill their tail
>>> latency.
>>>
>>> But I suppose we could still do this, even if we scan only a subset of
>>> the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
>>> tasks and pick that if you do not (in your limited scan) find a better
>>> candidate.
>> Subhra posted a patch that incrementally searches for an idle CPU in
>> the LLC,
>> remembering the last CPU examined, and searching a fixed number of
>> CPUs from there.
>> That technique is compatible with the one that Viresh suggests; the
>> incremental
>> search would stop if a SCHED_IDLE cpu was found.
> This was the last version of patchset I sent:
> https://lkml.org/lkml/2018/6/28/810
>
> Also select_idle_core is a net -ve for certain workloads like OLTP. So I
> had put a SCHED_FEAT to be able to disable it.
Forgot to add, the cpumask_weight computation may not be O(1) with large
number of CPUs, so needs to be precomputed in a per-cpu variable to further
optimize. That part is missing from the above patchset.
>
> Thanks,
> Subhra
>>
>> I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap
>> of idle CPUs,
>> updated with atomic operations.  Performance was basically unchanged
>> for the
>> workloads I tested, and I inserted timers around the idle search
>> showing it was
>> a very small fraction of time both before and after my changes.  That
>> led me to
>> ignore the push side and optimize the pull side with task stealing.
>>
>> I would be very interested in hearing from folks that have workloads
>> that demonstrate
>> that select_idle_sibling is too expensive.
>>
>> - Steve

2019-05-15 11:19:08

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC V2 0/2] sched/fair: Fallback to sched-idle CPU for better performance

On 25-04-19, 15:07, Viresh Kumar wrote:
> Hi,
>
> Here is another attempt to get some benefit out of the sched-idle
> policy. The previous version [1] focused on getting better power numbers
> and this version tries to get better performance or lower response time
> for the tasks.
>
> The first patch is unchanged from v1 and accumulates
> information about sched-idle tasks per CPU.
>
> The second patch changes the way the target CPU is selected in the fast
> path. Currently, we target for an idle CPU in select_idle_sibling() to
> run the next task, but in case we don't find idle CPUs it is better to
> pick a CPU which will run the task the soonest, for performance reason.
> A CPU which isn't idle but has only SCHED_IDLE activity queued on it
> should be a good target based on this criteria as any normal fair task
> will most likely preempt the currently running SCHED_IDLE task
> immediately. In fact, choosing a SCHED_IDLE CPU shall give better
> results as it should be able to run the task sooner than an idle CPU
> (which requires to be woken up from an idle state).
>
> Basic testing is done with the help of rt-app currently to make sure the
> task is getting placed correctly.

More results here:

- Tested on Octacore Hikey platform (all CPUs change frequency
together).

- rt-app json attached here. It creates few tasks and we monitor the
scheduling latency for them by looking at "wu_lat" field (usec).

- The histograms are created using
https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10

- the stats are accumulated using: https://github.com/nferraz/st

- NOTE: The % values shown don't add up, just look at total numbers
instead


Test 1: Create 8 CFS tasks (no SCHED_IDLE tasks) without this
patchset:

0 - 100 : ################################################## 72% (3688)
100 - 200 : ################ 24% (1253)
200 - 300 : ## 2% (149)
300 - 400 : 0% (22)
400 - 500 : 0% (1)
500 - 600 : 0% (3)
600 - 700 : 0% (1)
700 - 800 : 0% (1)
800 - 900 :
900 - 1000 : 0% (1)
>1000 : 0% (17)

N min max sum mean stddev
5136 0 2452 535985 104.358 104.585


Test 2: Create 8 CFS tasks and 5 SCHED_IDLE tasks:

A. Without sched-idle patchset:

0 - 100 : ################################################## 88% (3102)
100 - 200 : ## 4% (148)
200 - 300 : 1% (41)
300 - 400 : 0% (27)
400 - 500 : 0% (33)
500 - 600 : 0% (32)
600 - 700 : 1% (36)
700 - 800 : 0% (27)
800 - 900 : 0% (19)
900 - 1000 : 0% (26)
>1000 : 34% (1218)

N min max sum mean stddev
4710 0 67664 5.25956e+06 1116.68 2315.09


B. With sched-idle patchset:

0 - 100 : ################################################## 99% (5042)
100 - 200 : 0% (8)
200 - 300 :
300 - 400 :
400 - 500 : 0% (2)
500 - 600 : 0% (1)
600 - 700 :
700 - 800 : 0% (1)
800 - 900 : 0% (1)
900 - 1000 :
>1000 : 0% (40)

N min max sum mean stddev
5095 0 7773 523170 102.683 475.482


The mean latency dropped to 10% and the stddev to around 25% with this
patchset.

I have tried more combinations of CFS and SCHED_IDLE tasks and see
expected improvement in scheduling latency for all of them.

--
viresh


Attachments:
(No filename) (4.76 kB)
sched-idle.json (840.00 B)
Download all attachments

2019-06-04 03:00:22

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

On 25-04-19, 15:07, Viresh Kumar wrote:
> We target for an idle CPU in select_idle_sibling() to run the next task,
> but in case we don't find idle CPUs it is better to pick a CPU which
> will run the task the soonest, for performance reason. A CPU which isn't
> idle but has only SCHED_IDLE activity queued on it should be a good
> target based on this criteria as any normal fair task will most likely
> preempt the currently running SCHED_IDLE task immediately. In fact,
> choosing a SCHED_IDLE CPU shall give better results as it should be able
> to run the task sooner than an idle CPU (which requires to be woken up
> from an idle state).
>
> This patch updates the fast path to fallback to a sched-idle CPU if the
> idle CPU isn't found, the slow path can be updated separately later.
>
> Following is the order in which select_idle_sibling() picks up next CPU
> to run the task now:
>
> 1. idle_cpu(target) OR sched_idle_cpu(target)
> 2. idle_cpu(prev) OR sched_idle_cpu(prev)
> 3. idle_cpu(recent_used_cpu) OR sched_idle_cpu(recent_used_cpu)
> 4. idle core(sd)
> 5. idle_cpu(sd)
> 6. sched_idle_cpu(sd)
> 7. idle_cpu(p) - smt
> 8. sched_idle_cpu(p)- smt
>
> Though the policy can be tweaked a bit if we want to have different
> priorities.
>
> Signed-off-by: Viresh Kumar <[email protected]>
> ---
> kernel/sched/fair.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)

Hi Peter,

I was looking to send V3 with the changes you suggested for the patch 1/2, are
there any changes that I should be doing in this patch along with it ?

--
viresh