2020-05-07 13:40:16

by Parth Shah

[permalink] [raw]
Subject: [RFC 0/4] IDLE gating in presence of latency-sensitive tasks

Abstract:
=========
The IDLE states provides a way to save power at the cost of wakeup
latencies. The deeper the IDLE state selected by the CPUIDLE governor, the
more the exit latency of the state will be. This exit latency adds to the
task's wakeup latency. Hence choosing the best trade-off between the power
save feature and the increase observable latency is what CPUIDLE governor
focus upon.

But CPUIDLE governor is generic in nature to provide best of both the
worlds. However, the CPUIDLE governor does not have the capability to
distinguish between latency sensitivity of tasks queued on the runqueue.

With the introduction of latency-nice attribute to provide the latency
requirements from the userspace, the CPUIDLE governor can use it to
identify latency sensitive tasks.

Hence, this patch-set restricts the CPU running latency-sensitive tasks to
go into any IDLE state, thus minimizing the impact of exit latency for such
tasks. Expected results is better power-saving and comparable performance
than compared to disabling all IDLE states.

- Why not use PM_QoS?
PM_QoS provide ways to assist CPUIDLE governor decision with per device(s)
(like CPU, network, etc.) attributes. This behavior of decision assistance
at "per device" level lacks in providing the best of both power saving and
the latency minimization for specific task only. Hence, PM_QoS like feature
can be clubbed with scheduler to retrieve latency_nice information of a
task and provide better decision at per task level. This leads to
providing better performance to only those CPUs which is queued up for
running latency sensitive tasks and thus saving power from other CPUs.


Implementation:
===============
- latency sensitive tasks are the task's marked with latency_nice == -20.
- Use the per-CPU variable to keep track of the (to be) queued up latency
sensitive tasks.
- CPUIDLE governor does not choose a non-polling idle state on such marked
CPUS until the percpu counter goes back to zero

This strategy solves many latency related problems for the tasks showing
sleep-wake-sleep pattern (basically most of GPU workloads, schbench,
RT-app, database workloads, and many more).

This series is based on latency_nice patch-set
PATCH v5 https://lkml.org/lkml/2020/2/28/166

One may use below file to set latency_nice value:
- lnice.c: lnice -l -20 <workload>
https://github.com/parthsl/tools/blob/master/misc_scripts/lnice.c
or
- relnice.c: relnice -p <PID> -l 19
https://github.com/parthsl/tools/blob/master/misc_scripts/relnice.c


Results:
========
# Baseline = tip/sched/core + latency_nice patches
# w/ patch = Baseline + this patch-set + workload's latency_nice = -20

=> Schbench (Lower is better)
-----------------------------
- Baseline:
$> schbench -r 30
+---------------------+----------+-------------------------+-----------+
| %ile Latency (in us)| Baseline | cpupower idle-set -D 10 | w/ patch |
+---------------------+----------+-------------------------+-----------+
| 50 | 371 | 21 | 22 |
| 90 | 729 | 33 | 37 |
| 99 | 889 | 40 (-95%) | 48 (-94%) |
| max | 2197 | 59 | 655 |
| Avg. Energy (Watts) | 73 | 96 (+31%) | 88 (+20%) |
+---------------------+----------+-------------------------+-----------+

$> schbench -r 10 -m 2 -t 1
+---------------------+----------+-------------------------+-----------+
| %ile Latency (in us)| Baseline | cpupower idle-set -D 10 | w/ patch |
+---------------------+----------+-------------------------+-----------+
| 50 | 336 | 5 | 4 |
| 90 | 464 | 7 | 6 |
| 99 | 579 | 10 (-98%) | 11 (-98%) |
| max | 691 | 12 | 489 |
| Avg. Energy (Watts) | 28 | 40 (+42%) | 33 (+17%) |
+---------------------+----------+-------------------------+-----------+


=> PostgreSQL (lower is better):
----------------------------------
- 44 Clients running in parallel
$> pgbench -T 30 -S -n -R 10 -c 44
+---------------------+----------+-------------------------+--------------+
| | Baseline | cpupower idle-set -D 10 | w/ patch |
+---------------------+----------+-------------------------+--------------+
| latency avg. (ms) | 2.028 | 0.424 (-80%) | 1.202 (-40%) |
| latency stddev | 3.149 | 0.473 | 0.234 |
| trans. completed | 294 | 304 (+3%) | 300 (+2%) |
| Avg. Energy (Watts) | 23.6 | 42.5 (+80%) | 26.5 (+20%) |
+---------------------+----------+-------------------------+--------------+

- 1 Client running
$> pgbench -T 30 -S -n -R 10 -c 1
+---------------------+----------+-------------------------+--------------+
| | Baseline | cpupower idle-set -D 10 | w/ patch |
+---------------------+----------+-------------------------+--------------+
| latency avg. (ms) | 1.292 | 0.282 (-78%) | 0.237 (-81%) |
| latency stddev | 0.572 | 0.126 | 0.116 |
| trans. completed | 294 | 268 (-8%) | 315 (+7%) |
| Avg. Energy (Watts) | 9.8 | 29.6 (+302%) | 27.7 (+282%) |
+---------------------+----------+-------------------------+--------------+
*trans. completed = Total transactions processed (Higher is better)


Parth Shah (4):
sched/core: Introduce per_cpu counter to track latency sensitive tasks
sched/core: Set nr_lat_sensitive counter at various scheduler
entry/exit points
sched/idle: Disable idle call on least latency requirements
sched/idle: Add debugging bits to validate inconsistency in latency
sensitive task calculations

kernel/sched/core.c | 32 ++++++++++++++++++++++++++++++--
kernel/sched/idle.c | 8 +++++++-
kernel/sched/sched.h | 7 +++++++
3 files changed, 44 insertions(+), 3 deletions(-)

--
2.17.2


2020-05-07 13:40:54

by Parth Shah

[permalink] [raw]
Subject: [RFC 3/4] sched/idle: Disable idle call on least latency requirements

Restrict the call to deeper idle states when the given CPU has been set for
the least latency requirements

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/idle.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf38f08f..85d72a6e2521 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -262,7 +262,8 @@ static void do_idle(void)
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
- if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+ if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
+ per_cpu(nr_lat_sensitive, cpu)) {
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else {
--
2.17.2

2020-05-07 13:41:25

by Parth Shah

[permalink] [raw]
Subject: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

Monitor tasks at:
1. wake_up_new_task() - forked tasks

2. set_task_cpu() - task migrations, Load balancer

3. __sched_setscheduler() - set/unset latency_nice value
Increment the nr_lat_sensitive count on the CPU with task marked with
latency_nice == -20.
Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
with >-20 latency_nice task.

4. finish_task_switch() - dying task

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
kernel/sched/sched.h | 5 +++++
2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8b76f41d61..ad396c36eba6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
trace_sched_migrate_task(p, new_cpu);

if (task_cpu(p) != new_cpu) {
+ if (task_is_lat_sensitive(p)) {
+ per_cpu(nr_lat_sensitive, task_cpu(p))--;
+ per_cpu(nr_lat_sensitive, new_cpu)++;
+ }
+
if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
@@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
+ int target_cpu = 0;

raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
p->state = TASK_RUNNING;
@@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+ target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
+ __set_task_cpu(p, target_cpu);
+
#endif
rq = __task_rq_lock(p, &rf);
+
+#ifdef CONFIG_SMP
+ if (task_is_lat_sensitive(p))
+ per_cpu(nr_lat_sensitive, target_cpu)++;
+#endif
+
update_rq_clock(rq);
post_init_entity_util_avg(p);

@@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);

+ if (task_is_lat_sensitive(prev))
+ per_cpu(nr_lat_sensitive, prev->cpu)--;
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
@@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
p->normal_prio = normal_prio(p);
set_load_weight(p, true);

- if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+ if (p->state != TASK_DEAD &&
+ attr->sched_latency_nice != p->latency_nice) {
+ if (attr->sched_latency_nice == MIN_LATENCY_NICE)
+ per_cpu(nr_lat_sensitive, task_cpu(p))++;
+ else if (task_is_lat_sensitive(p))
+ per_cpu(nr_lat_sensitive, task_cpu(p))--;
+ }
+
p->latency_nice = attr->sched_latency_nice;
+ }
}

/* Actually do priority change: must hold pi & rq lock. */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c41020c530e..56f885e37451 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}

+static inline int task_is_lat_sensitive(struct task_struct *p)
+{
+ return p->latency_nice == MIN_LATENCY_NICE;
+}
+
#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)

/*
--
2.17.2

2020-05-07 13:42:58

by Parth Shah

[permalink] [raw]
Subject: [RFC 4/4] sched/idle: Add debugging bits to validate inconsistency in latency sensitive task calculations

We monitor the task entering/exiting the scheduler, but there might be
unhandled situations which can lead to inconsistent value of the
nr_lat_sensitive counter. This may lead to restricting the use of IDLE
states despite absence of any latency sensitive workload.
Hence, add pr_info() if a negative value of nr_lat_sensitive value is found.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/idle.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 85d72a6e2521..7aa0775e69c0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -231,6 +231,11 @@ static void cpuidle_idle_call(void)
static void do_idle(void)
{
int cpu = smp_processor_id();
+ int pm_disabled = per_cpu(nr_lat_sensitive, cpu);
+
+ if (pm_disabled < 0)
+ pr_info("Inconsistent value of nr_lat_sensitive counter\n");
+
/*
* If the arch has a polling bit, we maintain an invariant:
*
@@ -263,7 +268,7 @@ static void do_idle(void)
* idle as we know that the IPI is going to arrive right away.
*/
if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
- per_cpu(nr_lat_sensitive, cpu)) {
+ pm_disabled) {
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else {
--
2.17.2

2020-05-08 08:36:17

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

Hi Parth,

On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> Monitor tasks at:
> 1. wake_up_new_task() - forked tasks
>
> 2. set_task_cpu() - task migrations, Load balancer
>
> 3. __sched_setscheduler() - set/unset latency_nice value
> Increment the nr_lat_sensitive count on the CPU with task marked with
> latency_nice == -20.
> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> with >-20 latency_nice task.
>
> 4. finish_task_switch() - dying task
>


> Signed-off-by: Parth Shah <[email protected]>
> ---
> kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 5 +++++
> 2 files changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8b76f41d61..ad396c36eba6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> trace_sched_migrate_task(p, new_cpu);
>
> if (task_cpu(p) != new_cpu) {
> + if (task_is_lat_sensitive(p)) {
> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
> + per_cpu(nr_lat_sensitive, new_cpu)++;
> + }
> +

Since we can come here without rq locks, there is a possibility
of a race and incorrect updates can happen. Since the counters
are used to prevent C-states, we don't want that to happen.

> if (p->sched_class->migrate_task_rq)
> p->sched_class->migrate_task_rq(p, new_cpu);
> p->se.nr_migrations++;
> @@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
> {
> struct rq_flags rf;
> struct rq *rq;
> + int target_cpu = 0;
>
> raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> p->state = TASK_RUNNING;
> @@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
> * as we're not fully set-up yet.
> */
> p->recent_used_cpu = task_cpu(p);
> - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> + target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
> + __set_task_cpu(p, target_cpu);
> +

The target_cpu variable can be eliminated by using task_cpu(p) directly
in the below update.

> #endif
> rq = __task_rq_lock(p, &rf);
> +
> +#ifdef CONFIG_SMP
> + if (task_is_lat_sensitive(p))
> + per_cpu(nr_lat_sensitive, target_cpu)++;
> +#endif
> +

Is the SMP check intentional? In some parts of this patch, updates to
nr_lat_sensitive are done without SMP checks. For example,
finish_task_switch() below.

> update_rq_clock(rq);
> post_init_entity_util_avg(p);
>
> @@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
>
> + if (task_is_lat_sensitive(prev))
> + per_cpu(nr_lat_sensitive, prev->cpu)--;
> +
> /*
> * Remove function-return probe instances associated with this
> * task and put them back on the free list.
> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
> p->normal_prio = normal_prio(p);
> set_load_weight(p, true);
>
> - if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> + if (p->state != TASK_DEAD &&
> + attr->sched_latency_nice != p->latency_nice) {
> + if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> + per_cpu(nr_lat_sensitive, task_cpu(p))++;
> + else if (task_is_lat_sensitive(p))
> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
> + }
> +
> p->latency_nice = attr->sched_latency_nice;
> + }
> }

There is a potential race here due to which we can mess up the refcount.

- A latency sensitive task is marked TASK_DEAD
<snip>
- sched_setattr() called on the task to clear the latency nice. Since
we check the task state here, we skip the decrement.
- The task is finally context switched out and we skip the decrement again
since it is not a latency senstivie task.

>
> /* Actually do priority change: must hold pi & rq lock. */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5c41020c530e..56f885e37451 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct *p)
> return dl_policy(p->policy);
> }
>
> +static inline int task_is_lat_sensitive(struct task_struct *p)
> +{
> + return p->latency_nice == MIN_LATENCY_NICE;
> +}
> +
> #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>
> /*
> --
> 2.17.2
>

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-05-08 08:38:17

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements

On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
> Restrict the call to deeper idle states when the given CPU has been set for
> the least latency requirements
>
> Signed-off-by: Parth Shah <[email protected]>
> ---
> kernel/sched/idle.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..85d72a6e2521 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -262,7 +262,8 @@ static void do_idle(void)
> * broadcast device expired for us, we don't want to go deep
> * idle as we know that the IPI is going to arrive right away.
> */
> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> + if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
> + per_cpu(nr_lat_sensitive, cpu)) {
> tick_nohz_idle_restart_tick();
> cpu_idle_poll();
> } else {
> --
> 2.17.2
>

Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
task becomes non-latency sensitive task), we may need to add this condition
in cpu_idle_poll() as well.

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-05-08 11:18:04

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

Hi Pavan,

Thanks for going through this patch-set.

On 5/8/20 2:03 PM, Pavan Kondeti wrote:
> Hi Parth,
>
> On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
>> Monitor tasks at:
>> 1. wake_up_new_task() - forked tasks
>>
>> 2. set_task_cpu() - task migrations, Load balancer
>>
>> 3. __sched_setscheduler() - set/unset latency_nice value
>> Increment the nr_lat_sensitive count on the CPU with task marked with
>> latency_nice == -20.
>> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
>> with >-20 latency_nice task.
>>
>> 4. finish_task_switch() - dying task
>>
>
>
>> Signed-off-by: Parth Shah <[email protected]>
>> ---
>> kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
>> kernel/sched/sched.h | 5 +++++
>> 2 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2d8b76f41d61..ad396c36eba6 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> trace_sched_migrate_task(p, new_cpu);
>>
>> if (task_cpu(p) != new_cpu) {
>> + if (task_is_lat_sensitive(p)) {
>> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
>> + per_cpu(nr_lat_sensitive, new_cpu)++;
>> + }
>> +
>
> Since we can come here without rq locks, there is a possibility
> of a race and incorrect updates can happen. Since the counters
> are used to prevent C-states, we don't want that to happen.

I did tried using task_lock(p) wherever we do change refcount and when
latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
type.

After lots of thinking to optimize it and thinking that we anyways hold rq
lock, I thought of not using any lock here and see if scheduler community
has well known solution for this :-)

But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
refcount should solve problem, right?

If you or anyone have solution for this kind of pattern, then that surely
will be helpful.

>
>> if (p->sched_class->migrate_task_rq)
>> p->sched_class->migrate_task_rq(p, new_cpu);
>> p->se.nr_migrations++;
>> @@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
>> {
>> struct rq_flags rf;
>> struct rq *rq;
>> + int target_cpu = 0;
>>
>> raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>> p->state = TASK_RUNNING;
>> @@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
>> * as we're not fully set-up yet.
>> */
>> p->recent_used_cpu = task_cpu(p);
>> - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> + target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
>> + __set_task_cpu(p, target_cpu);
>> +
>
> The target_cpu variable can be eliminated by using task_cpu(p) directly
> in the below update.

Right. Will change it thus saving one diff line.

>
>> #endif
>> rq = __task_rq_lock(p, &rf);
>> +
>> +#ifdef CONFIG_SMP
>> + if (task_is_lat_sensitive(p))
>> + per_cpu(nr_lat_sensitive, target_cpu)++;
>> +#endif
>> +
>
> Is the SMP check intentional? In some parts of this patch, updates to
> nr_lat_sensitive are done without SMP checks. For example,
> finish_task_switch() below.

No. just forget to remove. I will remove SMP checks in next revision.

>
>> update_rq_clock(rq);
>> post_init_entity_util_avg(p);
>>
>> @@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>> if (prev->sched_class->task_dead)
>> prev->sched_class->task_dead(prev);
>>
>> + if (task_is_lat_sensitive(prev))
>> + per_cpu(nr_lat_sensitive, prev->cpu)--;
>> +
>> /*
>> * Remove function-return probe instances associated with this
>> * task and put them back on the free list.
>> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
>> p->normal_prio = normal_prio(p);
>> set_load_weight(p, true);
>>
>> - if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
>> + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
>> + if (p->state != TASK_DEAD &&
>> + attr->sched_latency_nice != p->latency_nice) {
>> + if (attr->sched_latency_nice == MIN_LATENCY_NICE)
>> + per_cpu(nr_lat_sensitive, task_cpu(p))++;
>> + else if (task_is_lat_sensitive(p))
>> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
>> + }
>> +
>> p->latency_nice = attr->sched_latency_nice;
>> + }
>> }
>
> There is a potential race here due to which we can mess up the refcount.
>
> - A latency sensitive task is marked TASK_DEAD
> <snip>
> - sched_setattr() called on the task to clear the latency nice. Since
> we check the task state here, we skip the decrement.
> - The task is finally context switched out and we skip the decrement again
> since it is not a latency senstivie task.

if task is already marked TASK_DEAD then we should have already decremented
its refcount in finish_task_switch().
am I missing something?

>
>>
>> /* Actually do priority change: must hold pi & rq lock. */
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 5c41020c530e..56f885e37451 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct *p)
>> return dl_policy(p->policy);
>> }
>>
>> +static inline int task_is_lat_sensitive(struct task_struct *p)
>> +{
>> + return p->latency_nice == MIN_LATENCY_NICE;
>> +}
>> +
>> #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>>
>> /*
>> --
>> 2.17.2
>>
>
> Thanks,
> Pavan
>

Thanks,
Parth

2020-05-08 11:23:11

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements

Hi Pavan,

On 5/8/20 2:06 PM, Pavan Kondeti wrote:
> On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
>> Restrict the call to deeper idle states when the given CPU has been set for
>> the least latency requirements
>>
>> Signed-off-by: Parth Shah <[email protected]>
>> ---
>> kernel/sched/idle.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index b743bf38f08f..85d72a6e2521 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -262,7 +262,8 @@ static void do_idle(void)
>> * broadcast device expired for us, we don't want to go deep
>> * idle as we know that the IPI is going to arrive right away.
>> */
>> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>> + if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
>> + per_cpu(nr_lat_sensitive, cpu)) {
>> tick_nohz_idle_restart_tick();
>> cpu_idle_poll();
>> } else {
>> --
>> 2.17.2
>>
>
> Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
> task becomes non-latency sensitive task), we may need to add this condition
> in cpu_idle_poll() as well.
>

Right. but if the CPU is running idle_task then it will again come back to
do_idle and read the refcount. Its a penalty in power-saving for 1
do_idle() loop but it is difficult to put up checks for any change in
latency_nice value.

Thanks,
Parth

2020-05-09 02:22:44

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements

On Fri, May 08, 2020 at 04:49:04PM +0530, Parth Shah wrote:
> Hi Pavan,
>
> On 5/8/20 2:06 PM, Pavan Kondeti wrote:
> > On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
> >> Restrict the call to deeper idle states when the given CPU has been set for
> >> the least latency requirements
> >>
> >> Signed-off-by: Parth Shah <[email protected]>
> >> ---
> >> kernel/sched/idle.c | 3 ++-
> >> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index b743bf38f08f..85d72a6e2521 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -262,7 +262,8 @@ static void do_idle(void)
> >> * broadcast device expired for us, we don't want to go deep
> >> * idle as we know that the IPI is going to arrive right away.
> >> */
> >> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> >> + if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
> >> + per_cpu(nr_lat_sensitive, cpu)) {
> >> tick_nohz_idle_restart_tick();
> >> cpu_idle_poll();
> >> } else {
> >> --
> >> 2.17.2
> >>
> >
> > Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
> > task becomes non-latency sensitive task), we may need to add this condition
> > in cpu_idle_poll() as well.
> >
>
> Right. but if the CPU is running idle_task then it will again come back to
> do_idle and read the refcount. Its a penalty in power-saving for 1
> do_idle() loop but it is difficult to put up checks for any change in
> latency_nice value.
>

Yes, when the CPU exits to serve an interrupt, we note the change in latency
task count and enter C-state.

I asked this because all checks that are present here are also present in
cpu_idle_poll().

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-05-09 02:43:24

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
> Hi Pavan,
>
> Thanks for going through this patch-set.
>
> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
> > Hi Parth,
> >
> > On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> >> Monitor tasks at:
> >> 1. wake_up_new_task() - forked tasks
> >>
> >> 2. set_task_cpu() - task migrations, Load balancer
> >>
> >> 3. __sched_setscheduler() - set/unset latency_nice value
> >> Increment the nr_lat_sensitive count on the CPU with task marked with
> >> latency_nice == -20.
> >> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> >> with >-20 latency_nice task.
> >>
> >> 4. finish_task_switch() - dying task
> >>
> >
> >
> >> Signed-off-by: Parth Shah <[email protected]>
> >> ---
> >> kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
> >> kernel/sched/sched.h | 5 +++++
> >> 2 files changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 2d8b76f41d61..ad396c36eba6 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >> trace_sched_migrate_task(p, new_cpu);
> >>
> >> if (task_cpu(p) != new_cpu) {
> >> + if (task_is_lat_sensitive(p)) {
> >> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> + per_cpu(nr_lat_sensitive, new_cpu)++;
> >> + }
> >> +
> >
> > Since we can come here without rq locks, there is a possibility
> > of a race and incorrect updates can happen. Since the counters
> > are used to prevent C-states, we don't want that to happen.
>
> I did tried using task_lock(p) wherever we do change refcount and when
> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
> type.
>
> After lots of thinking to optimize it and thinking that we anyways hold rq
> lock, I thought of not using any lock here and see if scheduler community
> has well known solution for this :-)
>
> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
> refcount should solve problem, right?
>
> If you or anyone have solution for this kind of pattern, then that surely
> will be helpful.
>
I am not sure if task_lock() can help here, because we are operating the
counter on per CPU basis here. May be cmpxchg based loop works here to make
sure that increment/decrement operation happens atomically here.

> >
> >> if (p->sched_class->migrate_task_rq)
> >> p->sched_class->migrate_task_rq(p, new_cpu);
> >> p->se.nr_migrations++;

[...]

> >> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
> >> p->normal_prio = normal_prio(p);
> >> set_load_weight(p, true);
> >>
> >> - if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> >> + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> >> + if (p->state != TASK_DEAD &&
> >> + attr->sched_latency_nice != p->latency_nice) {
> >> + if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> >> + per_cpu(nr_lat_sensitive, task_cpu(p))++;
> >> + else if (task_is_lat_sensitive(p))
> >> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> + }
> >> +
> >> p->latency_nice = attr->sched_latency_nice;
> >> + }
> >> }
> >
> > There is a potential race here due to which we can mess up the refcount.
> >
> > - A latency sensitive task is marked TASK_DEAD
> > <snip>
> > - sched_setattr() called on the task to clear the latency nice. Since
> > we check the task state here, we skip the decrement.
> > - The task is finally context switched out and we skip the decrement again
> > since it is not a latency senstivie task.
>
> if task is already marked TASK_DEAD then we should have already decremented
> its refcount in finish_task_switch().
> am I missing something?

There is a window (context switch and dropping rq lock) between
marking a task DEAD (in do_task_dead()) and dropping the ref counter
(in finish_task_switch()) during which we can run into here and skip
the checking because task is marked as DEAD.

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-05-12 07:56:34

by Parth Shah

[permalink] [raw]
Subject: Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points



On 5/9/20 8:09 AM, Pavan Kondeti wrote:
> On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
>> Hi Pavan,
>>
>> Thanks for going through this patch-set.
>>
>> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
>>> Hi Parth,
>>>
>>> On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
>>>> Monitor tasks at:
>>>> 1. wake_up_new_task() - forked tasks
>>>>
>>>> 2. set_task_cpu() - task migrations, Load balancer
>>>>
>>>> 3. __sched_setscheduler() - set/unset latency_nice value
>>>> Increment the nr_lat_sensitive count on the CPU with task marked with
>>>> latency_nice == -20.
>>>> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
>>>> with >-20 latency_nice task.
>>>>
>>>> 4. finish_task_switch() - dying task
>>>>
>>>
>>>
>>>> Signed-off-by: Parth Shah <[email protected]>
>>>> ---
>>>> kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
>>>> kernel/sched/sched.h | 5 +++++
>>>> 2 files changed, 33 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8b76f41d61..ad396c36eba6 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>>> trace_sched_migrate_task(p, new_cpu);
>>>>
>>>> if (task_cpu(p) != new_cpu) {
>>>> + if (task_is_lat_sensitive(p)) {
>>>> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
>>>> + per_cpu(nr_lat_sensitive, new_cpu)++;
>>>> + }
>>>> +
>>>
>>> Since we can come here without rq locks, there is a possibility
>>> of a race and incorrect updates can happen. Since the counters
>>> are used to prevent C-states, we don't want that to happen.
>>
>> I did tried using task_lock(p) wherever we do change refcount and when
>> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
>> type.
>>
>> After lots of thinking to optimize it and thinking that we anyways hold rq
>> lock, I thought of not using any lock here and see if scheduler community
>> has well known solution for this :-)
>>
>> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
>> refcount should solve problem, right?
>>
>> If you or anyone have solution for this kind of pattern, then that surely
>> will be helpful.
>>
> I am not sure if task_lock() can help here, because we are operating the
> counter on per CPU basis here. May be cmpxchg based loop works here to make
> sure that increment/decrement operation happens atomically here.
>
>>>
>>>> if (p->sched_class->migrate_task_rq)
>>>> p->sched_class->migrate_task_rq(p, new_cpu);
>>>> p->se.nr_migrations++;
>
> [...]
>
>>>> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
>>>> p->normal_prio = normal_prio(p);
>>>> set_load_weight(p, true);
>>>>
>>>> - if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
>>>> + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
>>>> + if (p->state != TASK_DEAD &&
>>>> + attr->sched_latency_nice != p->latency_nice) {
>>>> + if (attr->sched_latency_nice == MIN_LATENCY_NICE)
>>>> + per_cpu(nr_lat_sensitive, task_cpu(p))++;
>>>> + else if (task_is_lat_sensitive(p))
>>>> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
>>>> + }
>>>> +
>>>> p->latency_nice = attr->sched_latency_nice;
>>>> + }
>>>> }
>>>
>>> There is a potential race here due to which we can mess up the refcount.
>>>
>>> - A latency sensitive task is marked TASK_DEAD
>>> <snip>
>>> - sched_setattr() called on the task to clear the latency nice. Since
>>> we check the task state here, we skip the decrement.
>>> - The task is finally context switched out and we skip the decrement again
>>> since it is not a latency senstivie task.
>>
>> if task is already marked TASK_DEAD then we should have already decremented
>> its refcount in finish_task_switch().
>> am I missing something?
>
> There is a window (context switch and dropping rq lock) between
> marking a task DEAD (in do_task_dead()) and dropping the ref counter
> (in finish_task_switch()) during which we can run into here and skip
> the checking because task is marked as DEAD.
>

Yeah, TASK_DEAD seems to be genuine race conditions. At every other places
we do hold task_rq_lock() except when the task is dying. There is a window
between do_task_dead() and finish_task_switch() which may create race
condition, so if marking TASK_DEAD is protected under task_rq_lock() then
this can be prevented. I will have to look at it more thoroughly at the
code and figure out a way to protect the refcount under such circumstances.


Thanks,
Parth