2020-08-06 16:32:32

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

Hi,

> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>
> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>
>>
>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>> From: Jiang Biao <[email protected]>
>>>>
>>>> No need to preempt when there are only one runable CFS task with
>>>> other IDLE tasks on runqueue. The only one CFS task would always
>>>> be picked in that case.
>>>>
>>>> Signed-off-by: Jiang Biao <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 04fa8dbcfa4d..8fb80636b010 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4527,7 +4527,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>> return;
>>>> #endif
>>>>
>>>> - if (cfs_rq->nr_running > 1)
>>>> + if (cfs_rq->nr_running > cfs_rq.idle_h_nr_running + 1)
>>>
>>> cfs_rq is a pointer.
>> It is. Sorry about that. :)
>>
>>>
>>>> check_preempt_tick(cfs_rq, curr);
>>>> }
>>>
>>> You can't compare cfs_rq->nr_running with cfs_rq->idle_h_nr_running!
>>>
>>> There is a difference between cfs_rq->h_nr_running and
>>> cfs_rq->nr_running. The '_h_' stands for hierarchical.
>>>
>>> The former gives you hierarchical task accounting whereas the latter is
>>> the number of sched entities (representing tasks or taskgroups) enqueued
>>> in cfs_rq.
>>>
>>> In entity_tick(), cfs_rq->nr_running has to be used for the condition to
>>> call check_preempt_tick(). We want to check if curr has to be preempted
>>> by __pick_first_entity(cfs_rq) on this cfs_rq.
>>>
>>> entity_tick() is called for each sched entity (and so for each
>>> cfs_rq_of(se)) of the task group hierarchy (e.g. task p running in
>>> taskgroup /A/B : se(p) -> se(A/B) -> se(A)).
>> That’s true. I was thinking adding a new cfs_rq->idle_nr_running member to
>> track the per cfs_rq's IDLE task number, and reducing preemption here based
>> on that.
>
> How would you deal with se's representing taskgroups which contain
> SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that?
I’m not sure I get the point. :) How about the following patch,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fa8dbcfa4d..8715f03ed6d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
list_add(&se->group_node, &rq->cfs_tasks);
}
#endif
+ if (task_has_idle_policy(task_of(se)))
+ cfs_rq->idle_nr_running++;
+
cfs_rq->nr_running++;
}

@@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
list_del_init(&se->group_node);
}
#endif
+ if (task_has_idle_policy(task_of(se)))
+ cfs_rq->idle_nr_running--;
+
cfs_rq->nr_running--;
}

@@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
return;
#endif

- if (cfs_rq->nr_running > 1)
+ if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 &&
+ cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > cfs_rq->idle_nr_running + 1)
check_preempt_tick(cfs_rq, curr);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 877fb08eb1b0..401090393e09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -500,6 +500,7 @@ struct cfs_bandwidth { };
struct cfs_rq {
struct load_weight load;
unsigned int nr_running;
+ unsigned int idle_nr_running;
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
>
>> I’m not sure if it’s ok to do that, because the IDLE class seems not to be so
>> pure that could tolerate starving.
>
> Not sure I understand but idle_sched_class is not the same as SCHED_IDLE
> (policy)?
The case is that we need tasks(low priority, called offline tasks) to utilize the
spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without
interfering the online tasks.
Offline tasks only run when there’s no runnable online tasks, and offline tasks
never preempt online tasks.
The SCHED_IDLE policy seems not to be abled to be qualified for that requirement,
because it has a weight(3), even though it’s small, but it can still preempt online
tasks considering the fairness. In that way, offline tasks of SCHED_IDLE policy
could interfere the online tasks.
On the other hand, idle_sched_class seems not to be qualified either. It’s too
simple and only used for per-cpu idle task currently.

Thx.
Regards,
Jiang

>
>> We need an absolutely low priority class that could tolerate starving, which
>> could be used to co-locate offline tasks. But IDLE class seems to be not
>> *low* enough, if considering the fairness of CFS, and IDLE class still has a
>> weight.
>
> [...]
>


2020-08-10 13:28:37

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>
>>>
>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>> From: Jiang Biao <[email protected]>

[...]

>> How would you deal with se's representing taskgroups which contain
>> SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that?
> I’m not sure I get the point. :) How about the following patch,
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04fa8dbcfa4d..8715f03ed6d7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> list_add(&se->group_node, &rq->cfs_tasks);
> }
> #endif
> + if (task_has_idle_policy(task_of(se)))
> + cfs_rq->idle_nr_running++;
> +
> cfs_rq->nr_running++;
> }
>
> @@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> list_del_init(&se->group_node);
> }
> #endif
> + if (task_has_idle_policy(task_of(se)))
> + cfs_rq->idle_nr_running--;
> +
> cfs_rq->nr_running--;
> }
>
> @@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> return;
> #endif
>
> - if (cfs_rq->nr_running > 1)
> + if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 &&
> + cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > cfs_rq->idle_nr_running + 1)
> check_preempt_tick(cfs_rq, curr);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 877fb08eb1b0..401090393e09 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -500,6 +500,7 @@ struct cfs_bandwidth { };
> struct cfs_rq {
> struct load_weight load;
> unsigned int nr_running;
> + unsigned int idle_nr_running;
> unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
> unsigned int idle_h_nr_running; /* SCHED_IDLE */

/
/ | \
A n0 i0
/ \
n1 i1

I don't think this will work. E.g. the patch would prevent tick
preemption between 'A' and 'n0' on '/' as well

(3 > 1 + 1) && (4 - 2 > 1 + 1)

You also have to make sure that a SCHED_IDLE task can tick preempt
another SCHED_IDLE task.

>>> I’m not sure if it’s ok to do that, because the IDLE class seems not to be so
>>> pure that could tolerate starving.
>>
>> Not sure I understand but idle_sched_class is not the same as SCHED_IDLE
>> (policy)?
> The case is that we need tasks(low priority, called offline tasks) to utilize the
> spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without
> interfering the online tasks.
> Offline tasks only run when there’s no runnable online tasks, and offline tasks
> never preempt online tasks.
> The SCHED_IDLE policy seems not to be abled to be qualified for that requirement,
> because it has a weight(3), even though it’s small, but it can still preempt online
> tasks considering the fairness. In that way, offline tasks of SCHED_IDLE policy
> could interfere the online tasks.

Because of this very small weight (weight=3), compared to a SCHED_NORMAL
nice 0 task (weight=1024), a SCHED_IDLE task is penalized by a huge
se->vruntime value (1024/3 higher than for a SCHED_NORMAL nice 0 task).
This should make sure it doesn't tick preempt a SCHED_NORMAL nice 0 task.

It's different when the SCHED_NORMAL task has nice 19 (weight=15) but
that's part of the CFS design.

> On the other hand, idle_sched_class seems not to be qualified either. It’s too
> simple and only used for per-cpu idle task currently.

Yeah, leave this for the rq->idle task (swapper/X).

>>> We need an absolutely low priority class that could tolerate starving, which
>>> could be used to co-locate offline tasks. But IDLE class seems to be not
>>> *low* enough, if considering the fairness of CFS, and IDLE class still has a
>>> weight.

Understood. But this (tick) preemption should happen extremely rarely,
especially if you have SCHED_NORMAL nice 0 tasks, right?

2020-08-11 00:44:58

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

Hi,

> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>
> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>
>>>>
>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>> From: Jiang Biao <[email protected]>
>
> [...]
>
>>> How would you deal with se's representing taskgroups which contain
>>> SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that?
>> I’m not sure I get the point. :) How about the following patch,
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 04fa8dbcfa4d..8715f03ed6d7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> list_add(&se->group_node, &rq->cfs_tasks);
>> }
>> #endif
>> + if (task_has_idle_policy(task_of(se)))
>> + cfs_rq->idle_nr_running++;
>> +
>> cfs_rq->nr_running++;
>> }
>>
>> @@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> list_del_init(&se->group_node);
>> }
>> #endif
>> + if (task_has_idle_policy(task_of(se)))
>> + cfs_rq->idle_nr_running--;
>> +
>> cfs_rq->nr_running--;
>> }
>>
>> @@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>> return;
>> #endif
>>
>> - if (cfs_rq->nr_running > 1)
>> + if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 &&
>> + cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > cfs_rq->idle_nr_running + 1)
>> check_preempt_tick(cfs_rq, curr);
>> }
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 877fb08eb1b0..401090393e09 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -500,6 +500,7 @@ struct cfs_bandwidth { };
>> struct cfs_rq {
>> struct load_weight load;
>> unsigned int nr_running;
>> + unsigned int idle_nr_running;
>> unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
>> unsigned int idle_h_nr_running; /* SCHED_IDLE */
>
> /
> / | \
> A n0 i0
> / \
> n1 i1
>
> I don't think this will work. E.g. the patch would prevent tick
> preemption between 'A' and 'n0' on '/' as well
>
> (3 > 1 + 1) && (4 - 2 > 1 + 1)
>
> You also have to make sure that a SCHED_IDLE task can tick preempt
> another SCHED_IDLE task.

That’s right. :)

>
>>>> I’m not sure if it’s ok to do that, because the IDLE class seems not to be so
>>>> pure that could tolerate starving.
>>>
>>> Not sure I understand but idle_sched_class is not the same as SCHED_IDLE
>>> (policy)?
>> The case is that we need tasks(low priority, called offline tasks) to utilize the
>> spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without
>> interfering the online tasks.
>> Offline tasks only run when there’s no runnable online tasks, and offline tasks
>> never preempt online tasks.
>> The SCHED_IDLE policy seems not to be abled to be qualified for that requirement,
>> because it has a weight(3), even though it’s small, but it can still preempt online
>> tasks considering the fairness. In that way, offline tasks of SCHED_IDLE policy
>> could interfere the online tasks.
>
> Because of this very small weight (weight=3), compared to a SCHED_NORMAL
> nice 0 task (weight=1024), a SCHED_IDLE task is penalized by a huge
> se->vruntime value (1024/3 higher than for a SCHED_NORMAL nice 0 task).
> This should make sure it doesn't tick preempt a SCHED_NORMAL nice 0 task.
Could you please explain how the huge penalization of vruntime(1024/3) could
make sure SCHED_IDLE not tick preempting SCHED_NORMAL nice 0 task?

Thanks a lot.

Regards,
Jiang

>
> It's different when the SCHED_NORMAL task has nice 19 (weight=15) but
> that's part of the CFS design.
>
>> On the other hand, idle_sched_class seems not to be qualified either. It’s too
>> simple and only used for per-cpu idle task currently.
>
> Yeah, leave this for the rq->idle task (swapper/X).
Got it.

>
>>>> We need an absolutely low priority class that could tolerate starving, which
>>>> could be used to co-locate offline tasks. But IDLE class seems to be not
>>>> *low* enough, if considering the fairness of CFS, and IDLE class still has a
>>>> weight.
>
> Understood. But this (tick) preemption should happen extremely rarely,
> especially if you have SCHED_NORMAL nice 0 tasks, right?

2020-08-11 15:58:09

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>> Hi,
>>>
>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>
>>>>>
>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>
>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>> From: Jiang Biao <[email protected]>

[...]

>> Because of this very small weight (weight=3), compared to a SCHED_NORMAL
>> nice 0 task (weight=1024), a SCHED_IDLE task is penalized by a huge
>> se->vruntime value (1024/3 higher than for a SCHED_NORMAL nice 0 task).
>> This should make sure it doesn't tick preempt a SCHED_NORMAL nice 0 task.
> Could you please explain how the huge penalization of vruntime(1024/3) could
> make sure SCHED_IDLE not tick preempting SCHED_NORMAL nice 0 task?
>
> Thanks a lot.

Trace a run of 2 SCHED_OTHER (nice 0) tasks and 1 SCHED_IDLE task on a
single CPU and trace_printk the conditions 'if (delta < 0)' and ' if
(delta > ideal_runtime)' in check_preempt_tick().

Then do the same with 3 SCHED_OTHER (nice 0) tasks. You can also change
the niceness of the 2 SCHED_OTHER task to 19 to see some differences in
the kernelshark's task layout.

rt-app (https://github.com/scheduler-tools/rt-app) is a nice tool to
craft those artificial use cases.

[...]

2020-08-12 03:20:58

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

Hi,

> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>
> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>
>>>>>>
>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>> From: Jiang Biao <[email protected]>
>
> [...]
>
>>> Because of this very small weight (weight=3), compared to a SCHED_NORMAL
>>> nice 0 task (weight=1024), a SCHED_IDLE task is penalized by a huge
>>> se->vruntime value (1024/3 higher than for a SCHED_NORMAL nice 0 task).
>>> This should make sure it doesn't tick preempt a SCHED_NORMAL nice 0 task.
>> Could you please explain how the huge penalization of vruntime(1024/3) could
>> make sure SCHED_IDLE not tick preempting SCHED_NORMAL nice 0 task?
>>
>> Thanks a lot.
>
> Trace a run of 2 SCHED_OTHER (nice 0) tasks and 1 SCHED_IDLE task on a
> single CPU and trace_printk the conditions 'if (delta < 0)' and ' if
> (delta > ideal_runtime)' in check_preempt_tick().
>
> Then do the same with 3 SCHED_OTHER (nice 0) tasks. You can also change
> the niceness of the 2 SCHED_OTHER task to 19 to see some differences in
> the kernelshark's task layout.
>
> rt-app (https://github.com/scheduler-tools/rt-app) is a nice tool to
> craft those artificial use cases.
With rt-app tool, sched_switch traced by ftrace, the result is as what I expected,

** 1normal+1idle: idle preempt normal every 200ms **
<...>-92016 [002] d... 2398066.902477: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120
<...>-91814 [002] d... 2398066.902527: sched_switch: prev_comm=idle0-0 prev_pid=91814 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=92016 next_prio=120
<...>-92016 [002] d... 2398066.922472: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120
<...>-91814 [002] d... 2398066.922522: sched_switch: prev_comm=idle0-0 prev_pid=91814 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=92016 next_prio=120
<...>-92016 [002] d... 2398066.942292: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120
<...>-91814 [002] d... 2398066.942343: sched_switch: prev_comm=idle0-0 prev_pid=91814 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=92016 next_prio=120
<...>-92016 [002] d... 2398066.962331: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120

** 2normal+1idle: idle preempt normal every 600+ms **
<...>-49009 [002] d... 2400562.746640: sched_switch: prev_comm=normal0-0 prev_pid=49009 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2400562.747502: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=S ==> next_comm=normal1-0 next_pid=198658 next_prio=120
<...>-198658 [002] d... 2400563.335262: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2400563.336258: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=120
<...>-198658 [002] d... 2400564.017663: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2400564.018661: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=120
<...>-198658 [002] d... 2400564.701063: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120

** 3normal+idle: idle preempt normal every 1000+ms **
<...>-198658 [002] d... 2400415.780701: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2400415.781699: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal2-0 next_pid=46478 next_prio=120
<...>-49009 [002] d... 2400416.806298: sched_switch: prev_comm=normal0-0 prev_pid=49009 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2400416.807297: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal2-0 next_pid=46478 next_prio=120
<...>-198658 [002] d... 2400417.826910: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2400417.827911: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal2-0 next_pid=46478 next_prio=120
<...>-49009 [002] d... 2400418.857497: sched_switch: prev_comm=normal0-0 prev_pid=49009 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120

** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms **
<...>-187466 [002] d... 2401740.134249: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=139
<...>-198658 [002] d... 2401740.162182: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=139 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2401740.165177: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=139
<...>-49009 [002] d... 2401740.193110: sched_switch: prev_comm=normal0-0 prev_pid=49009 prev_prio=139 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2401740.196104: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal1-0 next_pid=198658 next_prio=139
<...>-198658 [002] d... 2401740.228029: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=139 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120
<...>-187466 [002] d... 2401740.231022: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=139
<...>-198658 [002] d... 2401740.262946: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=139 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120

SCHED_IDLE tasks do tick preempt rarely, but can not be avoided with a weight.

I wonder if the result is what you expected? :)

Thanks a lot.
Regards,
Jiang

>
> [...]

2020-08-12 18:42:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>> Hi,
>>>
>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>
>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>> From: Jiang Biao <[email protected]>

[...]

>> Trace a run of 2 SCHED_OTHER (nice 0) tasks and 1 SCHED_IDLE task on a
>> single CPU and trace_printk the conditions 'if (delta < 0)' and ' if
>> (delta > ideal_runtime)' in check_preempt_tick().
>>
>> Then do the same with 3 SCHED_OTHER (nice 0) tasks. You can also change
>> the niceness of the 2 SCHED_OTHER task to 19 to see some differences in
>> the kernelshark's task layout.
>>
>> rt-app (https://github.com/scheduler-tools/rt-app) is a nice tool to
>> craft those artificial use cases.
> With rt-app tool, sched_switch traced by ftrace, the result is as what I expected,

I use:

{
"tasks" : {
"task_other" : {
"instance" : 2,
"loop" : 200,
"policy" : "SCHED_OTHER",
"run" : 8000,
"timer" : { "ref" : "unique1" , "period" : 16000, "mode" : "absolute" },
"priority" : 0
},
"task_idle" : {
"instance" : 1,
"loop" : 200,
"policy" : "SCHED_IDLE",
"run" : 8000,
"timer" : { "ref" : "unique2" , "period" : 16000, "mode" : "absolute" }
}
},
"global" : {
"calibration" : 243, <-- Has to be calibrated against the CPU you run on !!!
"default_policy" : "SCHED_OTHER",
"duration" : -1
}
}

to have 2 (periodic) SCHED_OTHER and 1 SCHED_IDLE task.

> ** 2normal+1idle: idle preempt normal every 600+ms **

During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
switched in only once, after ~2.5s.

> ** 3normal+idle: idle preempt normal every 1000+ms **

Ah, this was meant to be 3 SCHED_OTHER tasks only! To see the difference
in behavior.

> ** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms **

During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
switched in every ~45ms.

[...]

2020-08-13 23:57:14

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

Hi,

> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>
> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>> From: Jiang Biao <[email protected]>
>
> [...]
>
>>> Trace a run of 2 SCHED_OTHER (nice 0) tasks and 1 SCHED_IDLE task on a
>>> single CPU and trace_printk the conditions 'if (delta < 0)' and ' if
>>> (delta > ideal_runtime)' in check_preempt_tick().
>>>
>>> Then do the same with 3 SCHED_OTHER (nice 0) tasks. You can also change
>>> the niceness of the 2 SCHED_OTHER task to 19 to see some differences in
>>> the kernelshark's task layout.
>>>
>>> rt-app (https://github.com/scheduler-tools/rt-app) is a nice tool to
>>> craft those artificial use cases.
>> With rt-app tool, sched_switch traced by ftrace, the result is as what I expected,
>
> I use:
>
> {
> "tasks" : {
> "task_other" : {
> "instance" : 2,
> "loop" : 200,
> "policy" : "SCHED_OTHER",
> "run" : 8000,
> "timer" : { "ref" : "unique1" , "period" : 16000, "mode" : "absolute" },
> "priority" : 0
> },
> "task_idle" : {
> "instance" : 1,
> "loop" : 200,
> "policy" : "SCHED_IDLE",
> "run" : 8000,
> "timer" : { "ref" : "unique2" , "period" : 16000, "mode" : "absolute" }
> }
> },
> "global" : {
> "calibration" : 243, <-- Has to be calibrated against the CPU you run on !!!
> "default_policy" : "SCHED_OTHER",
> "duration" : -1
> }
> }
>
> to have 2 (periodic) SCHED_OTHER and 1 SCHED_IDLE task.
>
>> ** 2normal+1idle: idle preempt normal every 600+ms **
>
> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
> switched in only once, after ~2.5s.
Use your config with loop increased from 200 to 2000, to observe longer,

<...>-37620 [002] d... 47950.446191: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=S ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
<...>-37619 [002] d... 47955.687709: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
// The first preemption interval is 5.2s.
<...>-37620 [002] d... 47956.375716: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
<...>-37619 [002] d... 47957.060722: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
<...>-37620 [002] d... 47957.747728: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
<...>-37620 [002] d... 47958.423734: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
<...>-37620 [002] d... 47959.119740: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
// After the first preemption, the rest preemption intervals are all about 600ms+. :)

>
>> ** 3normal+idle: idle preempt normal every 1000+ms **
>
> Ah, this was meant to be 3 SCHED_OTHER tasks only! To see the difference
> in behavior.
With 3 SCHED_OHTER tasks only, the SCHED_OHTER task is switched in
Every 27ms.

>
>> ** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms **
>
> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
> switched in every ~45ms.
That’s as what I expected. :)

Thx.
Regards,
Jiang
>
> [...]

2020-08-17 09:00:55

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> Hi,
>
>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>> Hi,
>>>
>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>
>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>> From: Jiang Biao <[email protected]>

[...]

>>> ** 2normal+1idle: idle preempt normal every 600+ms **
>>
>> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
>> switched in only once, after ~2.5s.
> Use your config with loop increased from 200 to 2000, to observe longer,
>
> <...>-37620 [002] d... 47950.446191: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=S ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> <...>-37619 [002] d... 47955.687709: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> // The first preemption interval is 5.2s.
> <...>-37620 [002] d... 47956.375716: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> <...>-37619 [002] d... 47957.060722: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> <...>-37620 [002] d... 47957.747728: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> <...>-37620 [002] d... 47958.423734: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> <...>-37620 [002] d... 47959.119740: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
> // After the first preemption, the rest preemption intervals are all about 600ms+. :)

Are you sure about this?

The math is telling me for the:

idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms

normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms

(4ms - 250 Hz)

>>> ** 3normal+idle: idle preempt normal every 1000+ms **
>>
>> Ah, this was meant to be 3 SCHED_OTHER tasks only! To see the difference
>> in behavior.
> With 3 SCHED_OHTER tasks only, the SCHED_OHTER task is switched in
> Every 27ms.

normal task: (1024 / (1024 + 1024 + 1024))^(-1) * 4ms = 12ms

>>> ** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms **
>>
>> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
>> switched in every ~45ms.
> That’s as what I expected. :)

idle task: (3 / (15 + 15 + 3))^(-1) * 4ms = 44ms

normal task: (15 / (15 + 15 + 3))^(-1) * 4ms = 9ms

2020-08-17 12:07:34

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)



> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>
> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>> Hi,
>>
>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>
> [...]
>
>>>> ** 2normal+1idle: idle preempt normal every 600+ms **
>>>
>>> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
>>> switched in only once, after ~2.5s.
>> Use your config with loop increased from 200 to 2000, to observe longer,
>>
>> <...>-37620 [002] d... 47950.446191: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=S ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> <...>-37619 [002] d... 47955.687709: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> // The first preemption interval is 5.2s.
>> <...>-37620 [002] d... 47956.375716: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> <...>-37619 [002] d... 47957.060722: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> <...>-37620 [002] d... 47957.747728: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> <...>-37620 [002] d... 47958.423734: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> <...>-37620 [002] d... 47959.119740: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120
>> // After the first preemption, the rest preemption intervals are all about 600ms+. :)
>
> Are you sure about this?
Yes. :)
>
> The math is telling me for the:
>
> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>
> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>
> (4ms - 250 Hz)
My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)

Thx.
Regards,
Jiang
>
>>>> ** 3normal+idle: idle preempt normal every 1000+ms **
>>>
>>> Ah, this was meant to be 3 SCHED_OTHER tasks only! To see the difference
>>> in behavior.
>> With 3 SCHED_OHTER tasks only, the SCHED_OHTER task is switched in
>> Every 27ms.
>
> normal task: (1024 / (1024 + 1024 + 1024))^(-1) * 4ms = 12ms
>
>>>> ** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms **
>>>
>>> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is
>>> switched in every ~45ms.
>> That’s as what I expected. :)
>
> idle task: (3 / (15 + 15 + 3))^(-1) * 4ms = 44ms
>
> normal task: (15 / (15 + 15 + 3))^(-1) * 4ms = 9ms

2020-08-19 10:47:53

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>
>
>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>> Hi,
>>>
>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>
>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>> From: Jiang Biao <[email protected]>

[...]

>> Are you sure about this?
> Yes. :)
>>
>> The math is telling me for the:
>>
>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>
>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>
>> (4ms - 250 Hz)
> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)

OK, I see.

But here the different sched slices (check_preempt_tick()->
sched_slice()) between normal tasks and the idle task play a role to.

Normal tasks get ~3ms whereas the idle task gets <0.01ms.

So the idle task runs every ~680ms but only for 1 tick (1ms) (4 times
less than the normal tasks). The condition 'if (delta_exec >
ideal_runtime)' in check_preempt_tick() is only true at the 4th tick
when a normal task runs even though the slice is 3ms.

In the 250 Hz example the sched slice diffs are hidden behind the 4ms tick.

[...]

2020-08-19 11:06:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
>
> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> >
> >
> >> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> >>> Hi,
> >>>
> >>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> >>>>
> >>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> >>>>> Hi,
> >>>>>
> >>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>
> >>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> >>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>
> [...]
>
> >> Are you sure about this?
> > Yes. :)
> >>
> >> The math is telling me for the:
> >>
> >> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> >>
> >> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> >>
> >> (4ms - 250 Hz)
> > My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>
> OK, I see.
>
> But here the different sched slices (check_preempt_tick()->
> sched_slice()) between normal tasks and the idle task play a role to.
>
> Normal tasks get ~3ms whereas the idle task gets <0.01ms.

In fact that depends on the number of CPUs on the system
:sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
normal task will run around 12ms in one shoot and the idle task still
one tick period

Also, you can increase even more the period between 2 runs of idle
task by using cgroups and min shares value : 2

>
> So the idle task runs every ~680ms but only for 1 tick (1ms) (4 times
> less than the normal tasks). The condition 'if (delta_exec >
> ideal_runtime)' in check_preempt_tick() is only true at the 4th tick
> when a normal task runs even though the slice is 3ms.
>
> In the 250 Hz example the sched slice diffs are hidden behind the 4ms tick.
>
> [...]

2020-08-19 11:57:31

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On 19/08/2020 13:05, Vincent Guittot wrote:
> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>>>
>>>
>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>>>> Hi,
>>>>>
>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>
>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>>
>> [...]
>>
>>>> Are you sure about this?
>>> Yes. :)
>>>>
>>>> The math is telling me for the:
>>>>
>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>>>
>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>>>
>>>> (4ms - 250 Hz)
>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>>
>> OK, I see.
>>
>> But here the different sched slices (check_preempt_tick()->
>> sched_slice()) between normal tasks and the idle task play a role to.
>>
>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
>
> In fact that depends on the number of CPUs on the system
> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> normal task will run around 12ms in one shoot and the idle task still
> one tick period

True. This is on a single CPU.

> Also, you can increase even more the period between 2 runs of idle
> task by using cgroups and min shares value : 2

Ah yes, maybe this is what Jiang wants to do then? If his runtime does
not have other requirements preventing this.

[...]

2020-08-19 14:13:48

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)



> On Aug 19, 2020, at 6:46 PM, Dietmar Eggemann <[email protected]> wrote:
>
> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>>
>>
>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>>> Hi,
>>>>
>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>
> [...]
>
>>> Are you sure about this?
>> Yes. :)
>>>
>>> The math is telling me for the:
>>>
>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>>
>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>>
>>> (4ms - 250 Hz)
>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>
> OK, I see.
>
> But here the different sched slices (check_preempt_tick()->
> sched_slice()) between normal tasks and the idle task play a role to.
>
> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
>
> So the idle task runs every ~680ms but only for 1 tick (1ms) (4 times
> less than the normal tasks). The condition 'if (delta_exec >
> ideal_runtime)' in check_preempt_tick() is only true at the 4th tick
> when a normal task runs even though the slice is 3ms.
>
> In the 250 Hz example the sched slice diffs are hidden behind the 4ms tick.
Exactly. Tick size decides the single runtime and the interval between
two runs of idle task.
That makes sense for most of the test results above.

Thx.
Regards,
Jiang


2020-08-19 14:31:24

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)



> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
>
> On 19/08/2020 13:05, Vincent Guittot wrote:
>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>>>>
>>>>
>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>>>
>>> [...]
>>>
>>>>> Are you sure about this?
>>>> Yes. :)
>>>>>
>>>>> The math is telling me for the:
>>>>>
>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>>>>
>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>>>>
>>>>> (4ms - 250 Hz)
>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>>>
>>> OK, I see.
>>>
>>> But here the different sched slices (check_preempt_tick()->
>>> sched_slice()) between normal tasks and the idle task play a role to.
>>>
>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
>>
>> In fact that depends on the number of CPUs on the system
>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
>> normal task will run around 12ms in one shoot and the idle task still
>> one tick period
>
> True. This is on a single CPU.
Agree. :)

>
>> Also, you can increase even more the period between 2 runs of idle
>> task by using cgroups and min shares value : 2
>
> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> not have other requirements preventing this.
That could work for increasing the period between 2 runs. But could not
reduce the single runtime of idle task I guess, which means normal task
could have 1-tick schedule latency because of idle task.
OTOH, cgroups(shares) could introduce extra complexity. :)

I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
lower than SCHED_NORMAL(OTHER), which means no weights/shares
for them, and they run only when no other task’s runnable.
I guess there may be priority inversion issue if we do that. But maybe we
could avoid it by load-balance more aggressively, or it(priority inversion)
could be ignored in some special case.

Thx.
Regard,
Jiang

>
> [...]

2020-08-19 14:58:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
>
>
>
> > On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
> >
> > On 19/08/2020 13:05, Vincent Guittot wrote:
> >> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
> >>>
> >>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> >>>>
> >>>>
> >>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>
> >>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> >>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
> >>>
> >>> [...]
> >>>
> >>>>> Are you sure about this?
> >>>> Yes. :)
> >>>>>
> >>>>> The math is telling me for the:
> >>>>>
> >>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> >>>>>
> >>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> >>>>>
> >>>>> (4ms - 250 Hz)
> >>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
> >>>
> >>> OK, I see.
> >>>
> >>> But here the different sched slices (check_preempt_tick()->
> >>> sched_slice()) between normal tasks and the idle task play a role to.
> >>>
> >>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
> >>
> >> In fact that depends on the number of CPUs on the system
> >> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> >> normal task will run around 12ms in one shoot and the idle task still
> >> one tick period
> >
> > True. This is on a single CPU.
> Agree. :)
>
> >
> >> Also, you can increase even more the period between 2 runs of idle
> >> task by using cgroups and min shares value : 2
> >
> > Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> > not have other requirements preventing this.
> That could work for increasing the period between 2 runs. But could not
> reduce the single runtime of idle task I guess, which means normal task
> could have 1-tick schedule latency because of idle task.

Yes. An idle task will preempt an always running task during 1 tick
every 680ms. But also you should keep in mind that a waking normal
task will preempt the idle task immediately which means that it will
not add scheduling latency to a normal task but "steal" 0.14% of
normal task throughput (1/680) at most

> OTOH, cgroups(shares) could introduce extra complexity. :)
>
> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
> lower than SCHED_NORMAL(OTHER), which means no weights/shares
> for them, and they run only when no other task’s runnable.
> I guess there may be priority inversion issue if we do that. But maybe we

Exactly, that's why we must ensure a minimum running time for sched_idle task

> could avoid it by load-balance more aggressively, or it(priority inversion)
> could be ignored in some special case.
>
> Thx.
> Regard,
> Jiang
>
> >
> > [...]
>

2020-08-20 00:14:41

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)



> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
>
> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
>>
>>
>>
>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 19/08/2020 13:05, Vincent Guittot wrote:
>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>>>>>>
>>>>>>
>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>>>>>
>>>>> [...]
>>>>>
>>>>>>> Are you sure about this?
>>>>>> Yes. :)
>>>>>>>
>>>>>>> The math is telling me for the:
>>>>>>>
>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>>>>>>
>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>>>>>>
>>>>>>> (4ms - 250 Hz)
>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>>>>>
>>>>> OK, I see.
>>>>>
>>>>> But here the different sched slices (check_preempt_tick()->
>>>>> sched_slice()) between normal tasks and the idle task play a role to.
>>>>>
>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
>>>>
>>>> In fact that depends on the number of CPUs on the system
>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
>>>> normal task will run around 12ms in one shoot and the idle task still
>>>> one tick period
>>>
>>> True. This is on a single CPU.
>> Agree. :)
>>
>>>
>>>> Also, you can increase even more the period between 2 runs of idle
>>>> task by using cgroups and min shares value : 2
>>>
>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
>>> not have other requirements preventing this.
>> That could work for increasing the period between 2 runs. But could not
>> reduce the single runtime of idle task I guess, which means normal task
>> could have 1-tick schedule latency because of idle task.
>
> Yes. An idle task will preempt an always running task during 1 tick
> every 680ms. But also you should keep in mind that a waking normal
> task will preempt the idle task immediately which means that it will
> not add scheduling latency to a normal task but "steal" 0.14% of
> normal task throughput (1/680) at most
That’s true. But in the VM case, when VM are busy(MWAIT passthrough
or running cpu eating works), the 1-tick scheduling latency could be
detected by cyclictest running in the VM.

OTOH, we compensate vruntime in place_entity() to boot waking
without distinguish SCHED_IDLE task, do you think it’s necessary to
do that? like

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
vruntime += sched_vslice(cfs_rq, se);

/* sleeps up to a single latency don't count. */
- if (!initial) {
+ if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
unsigned long thresh = sysctl_sched_latency;

>
>> OTOH, cgroups(shares) could introduce extra complexity. :)
>>
>> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
>> lower than SCHED_NORMAL(OTHER), which means no weights/shares
>> for them, and they run only when no other task’s runnable.
>> I guess there may be priority inversion issue if we do that. But maybe we
>
> Exactly, that's why we must ensure a minimum running time for sched_idle task

Still for VM case, different VMs have been much isolated from each other,
priority inversion issue could be very rare, we’re trying to make offline tasks
absoultly harmless to online tasks. :)

Thanks a lot for your time.
Regards,
Jiang

>
>> could avoid it by load-balance more aggressively, or it(priority inversion)
>> could be ignored in some special case.
>>
>>>

2020-08-20 07:39:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) <[email protected]> wrote:
>
>
>
> > On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
> >
> > On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
> >>
> >>
> >>
> >>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>
> >>> On 19/08/2020 13:05, Vincent Guittot wrote:
> >>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
> >>>>>
> >>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> >>>>>>
> >>>>>>
> >>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> >>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>>> Are you sure about this?
> >>>>>> Yes. :)
> >>>>>>>
> >>>>>>> The math is telling me for the:
> >>>>>>>
> >>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> >>>>>>>
> >>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> >>>>>>>
> >>>>>>> (4ms - 250 Hz)
> >>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
> >>>>>
> >>>>> OK, I see.
> >>>>>
> >>>>> But here the different sched slices (check_preempt_tick()->
> >>>>> sched_slice()) between normal tasks and the idle task play a role to.
> >>>>>
> >>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
> >>>>
> >>>> In fact that depends on the number of CPUs on the system
> >>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> >>>> normal task will run around 12ms in one shoot and the idle task still
> >>>> one tick period
> >>>
> >>> True. This is on a single CPU.
> >> Agree. :)
> >>
> >>>
> >>>> Also, you can increase even more the period between 2 runs of idle
> >>>> task by using cgroups and min shares value : 2
> >>>
> >>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> >>> not have other requirements preventing this.
> >> That could work for increasing the period between 2 runs. But could not
> >> reduce the single runtime of idle task I guess, which means normal task
> >> could have 1-tick schedule latency because of idle task.
> >
> > Yes. An idle task will preempt an always running task during 1 tick
> > every 680ms. But also you should keep in mind that a waking normal
> > task will preempt the idle task immediately which means that it will
> > not add scheduling latency to a normal task but "steal" 0.14% of
> > normal task throughput (1/680) at most
> That’s true. But in the VM case, when VM are busy(MWAIT passthrough
> or running cpu eating works), the 1-tick scheduling latency could be
> detected by cyclictest running in the VM.
>
> OTOH, we compensate vruntime in place_entity() to boot waking
> without distinguish SCHED_IDLE task, do you think it’s necessary to
> do that? like
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> vruntime += sched_vslice(cfs_rq, se);
>
> /* sleeps up to a single latency don't count. */
> - if (!initial) {
> + if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
> unsigned long thresh = sysctl_sched_latency;

Yeah, this is a good improvement.
Does it solve your problem ?

>
> >
> >> OTOH, cgroups(shares) could introduce extra complexity. :)
> >>
> >> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
> >> lower than SCHED_NORMAL(OTHER), which means no weights/shares
> >> for them, and they run only when no other task’s runnable.
> >> I guess there may be priority inversion issue if we do that. But maybe we
> >
> > Exactly, that's why we must ensure a minimum running time for sched_idle task
>
> Still for VM case, different VMs have been much isolated from each other,
> priority inversion issue could be very rare, we’re trying to make offline tasks
> absoultly harmless to online tasks. :)
>
> Thanks a lot for your time.
> Regards,
> Jiang
>
> >
> >> could avoid it by load-balance more aggressively, or it(priority inversion)
> >> could be ignored in some special case.
> >>
> >>>
>

2020-08-20 11:32:58

by benbjiang(蒋彪)

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)



> On Aug 20, 2020, at 3:35 PM, Vincent Guittot <[email protected]> wrote:
>
> On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) <[email protected]> wrote:
>>
>>
>>
>>> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
>>>
>>> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 19/08/2020 13:05, Vincent Guittot wrote:
>>>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>>> Are you sure about this?
>>>>>>>> Yes. :)
>>>>>>>>>
>>>>>>>>> The math is telling me for the:
>>>>>>>>>
>>>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
>>>>>>>>>
>>>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
>>>>>>>>>
>>>>>>>>> (4ms - 250 Hz)
>>>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
>>>>>>>
>>>>>>> OK, I see.
>>>>>>>
>>>>>>> But here the different sched slices (check_preempt_tick()->
>>>>>>> sched_slice()) between normal tasks and the idle task play a role to.
>>>>>>>
>>>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
>>>>>>
>>>>>> In fact that depends on the number of CPUs on the system
>>>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
>>>>>> normal task will run around 12ms in one shoot and the idle task still
>>>>>> one tick period
>>>>>
>>>>> True. This is on a single CPU.
>>>> Agree. :)
>>>>
>>>>>
>>>>>> Also, you can increase even more the period between 2 runs of idle
>>>>>> task by using cgroups and min shares value : 2
>>>>>
>>>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
>>>>> not have other requirements preventing this.
>>>> That could work for increasing the period between 2 runs. But could not
>>>> reduce the single runtime of idle task I guess, which means normal task
>>>> could have 1-tick schedule latency because of idle task.
>>>
>>> Yes. An idle task will preempt an always running task during 1 tick
>>> every 680ms. But also you should keep in mind that a waking normal
>>> task will preempt the idle task immediately which means that it will
>>> not add scheduling latency to a normal task but "steal" 0.14% of
>>> normal task throughput (1/680) at most
>> That’s true. But in the VM case, when VM are busy(MWAIT passthrough
>> or running cpu eating works), the 1-tick scheduling latency could be
>> detected by cyclictest running in the VM.
>>
>> OTOH, we compensate vruntime in place_entity() to boot waking
>> without distinguish SCHED_IDLE task, do you think it’s necessary to
>> do that? like
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>> vruntime += sched_vslice(cfs_rq, se);
>>
>> /* sleeps up to a single latency don't count. */
>> - if (!initial) {
>> + if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
>> unsigned long thresh = sysctl_sched_latency;
>
> Yeah, this is a good improvement.
Thanks, I’ll send a patch for that. :)

> Does it solve your problem ?
>
Not exactly. :) I wonder if we can make SCHED_IDLE more pure(harmless)?
Or introduce a switch(or flag) to control it, and make it available for cases like us.

Thanks a lot.
Regards,
Jiang

>>
>>>
>>>> OTOH, cgroups(shares) could introduce extra complexity. :)
>>>>
>>>> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
>>>> lower than SCHED_NORMAL(OTHER), which means no weights/shares
>>>> for them, and they run only when no other task’s runnable.
>>>> I guess there may be priority inversion issue if we do that. But maybe we
>>>
>>> Exactly, that's why we must ensure a minimum running time for sched_idle task
>>
>> Still for VM case, different VMs have been much isolated from each other,
>> priority inversion issue could be very rare, we’re trying to make offline tasks
>> absoultly harmless to online tasks. :)
>>
>> Thanks a lot for your time.
>> Regards,
>> Jiang
>>
>>>
>>>> could avoid it by load-balance more aggressively, or it(priority inversion)
>>>> could be ignored in some special case.

2020-08-20 12:48:06

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Thu, 20 Aug 2020 at 13:28, benbjiang(蒋彪) <[email protected]> wrote:
>
>
>
> > On Aug 20, 2020, at 3:35 PM, Vincent Guittot <[email protected]> wrote:
> >
> > On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) <[email protected]> wrote:
> >>
> >>
> >>
> >>> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
> >>>
> >>> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>
> >>>>> On 19/08/2020 13:05, Vincent Guittot wrote:
> >>>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> >>>>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
> >>>>>>>
> >>>>>>> [...]
> >>>>>>>
> >>>>>>>>> Are you sure about this?
> >>>>>>>> Yes. :)
> >>>>>>>>>
> >>>>>>>>> The math is telling me for the:
> >>>>>>>>>
> >>>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> >>>>>>>>>
> >>>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> >>>>>>>>>
> >>>>>>>>> (4ms - 250 Hz)
> >>>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
> >>>>>>>
> >>>>>>> OK, I see.
> >>>>>>>
> >>>>>>> But here the different sched slices (check_preempt_tick()->
> >>>>>>> sched_slice()) between normal tasks and the idle task play a role to.
> >>>>>>>
> >>>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
> >>>>>>
> >>>>>> In fact that depends on the number of CPUs on the system
> >>>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> >>>>>> normal task will run around 12ms in one shoot and the idle task still
> >>>>>> one tick period
> >>>>>
> >>>>> True. This is on a single CPU.
> >>>> Agree. :)
> >>>>
> >>>>>
> >>>>>> Also, you can increase even more the period between 2 runs of idle
> >>>>>> task by using cgroups and min shares value : 2
> >>>>>
> >>>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> >>>>> not have other requirements preventing this.
> >>>> That could work for increasing the period between 2 runs. But could not
> >>>> reduce the single runtime of idle task I guess, which means normal task
> >>>> could have 1-tick schedule latency because of idle task.
> >>>
> >>> Yes. An idle task will preempt an always running task during 1 tick
> >>> every 680ms. But also you should keep in mind that a waking normal
> >>> task will preempt the idle task immediately which means that it will
> >>> not add scheduling latency to a normal task but "steal" 0.14% of
> >>> normal task throughput (1/680) at most
> >> That’s true. But in the VM case, when VM are busy(MWAIT passthrough
> >> or running cpu eating works), the 1-tick scheduling latency could be
> >> detected by cyclictest running in the VM.
> >>
> >> OTOH, we compensate vruntime in place_entity() to boot waking
> >> without distinguish SCHED_IDLE task, do you think it’s necessary to
> >> do that? like
> >>
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> >> vruntime += sched_vslice(cfs_rq, se);
> >>
> >> /* sleeps up to a single latency don't count. */
> >> - if (!initial) {
> >> + if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
> >> unsigned long thresh = sysctl_sched_latency;
> >
> > Yeah, this is a good improvement.
> Thanks, I’ll send a patch for that. :)
>
> > Does it solve your problem ?
> >
> Not exactly. :) I wonder if we can make SCHED_IDLE more pure(harmless)?

We can't prevent it from running time to time. Proxy execution feature
could be a step for considering to relax this constraint

> Or introduce a switch(or flag) to control it, and make it available for cases like us.
>
> Thanks a lot.
> Regards,
> Jiang
>
> >>
> >>>
> >>>> OTOH, cgroups(shares) could introduce extra complexity. :)
> >>>>
> >>>> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely
> >>>> lower than SCHED_NORMAL(OTHER), which means no weights/shares
> >>>> for them, and they run only when no other task’s runnable.
> >>>> I guess there may be priority inversion issue if we do that. But maybe we
> >>>
> >>> Exactly, that's why we must ensure a minimum running time for sched_idle task
> >>
> >> Still for VM case, different VMs have been much isolated from each other,
> >> priority inversion issue could be very rare, we’re trying to make offline tasks
> >> absoultly harmless to online tasks. :)
> >>
> >> Thanks a lot for your time.
> >> Regards,
> >> Jiang
> >>
> >>>
> >>>> could avoid it by load-balance more aggressively, or it(priority inversion)
> >>>> could be ignored in some special case.
>

2020-08-20 14:31:56

by Jiang Biao

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Thu, 20 Aug 2020 at 20:46, Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 20 Aug 2020 at 13:28, benbjiang(蒋彪) <[email protected]> wrote:
> >
> >
> >
> > > On Aug 20, 2020, at 3:35 PM, Vincent Guittot <[email protected]> wrote:
> > >
> > > On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >>> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
> > >>>
> > >>> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>
> > >>>>> On 19/08/2020 13:05, Vincent Guittot wrote:
> > >>>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> > >>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> > >>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> > >>>>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
> > >>>>>>>
> > >>>>>>> [...]
> > >>>>>>>
> > >>>>>>>>> Are you sure about this?
> > >>>>>>>> Yes. :)
> > >>>>>>>>>
> > >>>>>>>>> The math is telling me for the:
> > >>>>>>>>>
> > >>>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> > >>>>>>>>>
> > >>>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> > >>>>>>>>>
> > >>>>>>>>> (4ms - 250 Hz)
> > >>>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
> > >>>>>>>
> > >>>>>>> OK, I see.
> > >>>>>>>
> > >>>>>>> But here the different sched slices (check_preempt_tick()->
> > >>>>>>> sched_slice()) between normal tasks and the idle task play a role to.
> > >>>>>>>
> > >>>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
> > >>>>>>
> > >>>>>> In fact that depends on the number of CPUs on the system
> > >>>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> > >>>>>> normal task will run around 12ms in one shoot and the idle task still
> > >>>>>> one tick period
> > >>>>>
> > >>>>> True. This is on a single CPU.
> > >>>> Agree. :)
> > >>>>
> > >>>>>
> > >>>>>> Also, you can increase even more the period between 2 runs of idle
> > >>>>>> task by using cgroups and min shares value : 2
> > >>>>>
> > >>>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> > >>>>> not have other requirements preventing this.
> > >>>> That could work for increasing the period between 2 runs. But could not
> > >>>> reduce the single runtime of idle task I guess, which means normal task
> > >>>> could have 1-tick schedule latency because of idle task.
> > >>>
> > >>> Yes. An idle task will preempt an always running task during 1 tick
> > >>> every 680ms. But also you should keep in mind that a waking normal
> > >>> task will preempt the idle task immediately which means that it will
> > >>> not add scheduling latency to a normal task but "steal" 0.14% of
> > >>> normal task throughput (1/680) at most
> > >> That’s true. But in the VM case, when VM are busy(MWAIT passthrough
> > >> or running cpu eating works), the 1-tick scheduling latency could be
> > >> detected by cyclictest running in the VM.
> > >>
> > >> OTOH, we compensate vruntime in place_entity() to boot waking
> > >> without distinguish SCHED_IDLE task, do you think it’s necessary to
> > >> do that? like
> > >>
> > >> --- a/kernel/sched/fair.c
> > >> +++ b/kernel/sched/fair.c
> > >> @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > >> vruntime += sched_vslice(cfs_rq, se);
> > >>
> > >> /* sleeps up to a single latency don't count. */
> > >> - if (!initial) {
> > >> + if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
> > >> unsigned long thresh = sysctl_sched_latency;
> > >
> > > Yeah, this is a good improvement.
> > Thanks, I’ll send a patch for that. :)
> >
> > > Does it solve your problem ?
> > >
> > Not exactly. :) I wonder if we can make SCHED_IDLE more pure(harmless)?
>
> We can't prevent it from running time to time. Proxy execution feature
> could be a step for considering to relax this constraint
>
Could you please help to explain more about the *Proxy execution feature*?
I'm not sure I got the right point.

Thanks a lot.
Regards,
Jiang

2020-08-20 14:40:40

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Thu, 20 Aug 2020 at 16:28, Jiang Biao <[email protected]> wrote:
>
> On Thu, 20 Aug 2020 at 20:46, Vincent Guittot
> <[email protected]> wrote:
> >
> > On Thu, 20 Aug 2020 at 13:28, benbjiang(蒋彪) <[email protected]> wrote:
> > >
> > >
> > >
> > > > On Aug 20, 2020, at 3:35 PM, Vincent Guittot <[email protected]> wrote:
> > > >
> > > > On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) <[email protected]> wrote:
> > > >>
> > > >>
> > > >>
> > > >>> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
> > > >>>
> > > >>> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>
> > > >>>>> On 19/08/2020 13:05, Vincent Guittot wrote:
> > > >>>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>
> > > >>>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> > > >>>>>>>>>> Hi,
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> > > >>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> > > >>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> > > >>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> > > >>>>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
> > > >>>>>>>
> > > >>>>>>> [...]
> > > >>>>>>>
> > > >>>>>>>>> Are you sure about this?
> > > >>>>>>>> Yes. :)
> > > >>>>>>>>>
> > > >>>>>>>>> The math is telling me for the:
> > > >>>>>>>>>
> > > >>>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> > > >>>>>>>>>
> > > >>>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> > > >>>>>>>>>
> > > >>>>>>>>> (4ms - 250 Hz)
> > > >>>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
> > > >>>>>>>
> > > >>>>>>> OK, I see.
> > > >>>>>>>
> > > >>>>>>> But here the different sched slices (check_preempt_tick()->
> > > >>>>>>> sched_slice()) between normal tasks and the idle task play a role to.
> > > >>>>>>>
> > > >>>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
> > > >>>>>>
> > > >>>>>> In fact that depends on the number of CPUs on the system
> > > >>>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> > > >>>>>> normal task will run around 12ms in one shoot and the idle task still
> > > >>>>>> one tick period
> > > >>>>>
> > > >>>>> True. This is on a single CPU.
> > > >>>> Agree. :)
> > > >>>>
> > > >>>>>
> > > >>>>>> Also, you can increase even more the period between 2 runs of idle
> > > >>>>>> task by using cgroups and min shares value : 2
> > > >>>>>
> > > >>>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> > > >>>>> not have other requirements preventing this.
> > > >>>> That could work for increasing the period between 2 runs. But could not
> > > >>>> reduce the single runtime of idle task I guess, which means normal task
> > > >>>> could have 1-tick schedule latency because of idle task.
> > > >>>
> > > >>> Yes. An idle task will preempt an always running task during 1 tick
> > > >>> every 680ms. But also you should keep in mind that a waking normal
> > > >>> task will preempt the idle task immediately which means that it will
> > > >>> not add scheduling latency to a normal task but "steal" 0.14% of
> > > >>> normal task throughput (1/680) at most
> > > >> That’s true. But in the VM case, when VM are busy(MWAIT passthrough
> > > >> or running cpu eating works), the 1-tick scheduling latency could be
> > > >> detected by cyclictest running in the VM.
> > > >>
> > > >> OTOH, we compensate vruntime in place_entity() to boot waking
> > > >> without distinguish SCHED_IDLE task, do you think it’s necessary to
> > > >> do that? like
> > > >>
> > > >> --- a/kernel/sched/fair.c
> > > >> +++ b/kernel/sched/fair.c
> > > >> @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > > >> vruntime += sched_vslice(cfs_rq, se);
> > > >>
> > > >> /* sleeps up to a single latency don't count. */
> > > >> - if (!initial) {
> > > >> + if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
> > > >> unsigned long thresh = sysctl_sched_latency;
> > > >
> > > > Yeah, this is a good improvement.
> > > Thanks, I’ll send a patch for that. :)
> > >
> > > > Does it solve your problem ?
> > > >
> > > Not exactly. :) I wonder if we can make SCHED_IDLE more pure(harmless)?
> >
> > We can't prevent it from running time to time. Proxy execution feature
> > could be a step for considering to relax this constraint
> >
> Could you please help to explain more about the *Proxy execution feature*?
> I'm not sure I got the right point.

https://lwn.net/Articles/820575/
https://lwn.net/Articles/793502/

>
> Thanks a lot.
> Regards,
> Jiang

2020-08-21 00:15:46

by Jiang Biao

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

On Thu, 20 Aug 2020 at 22:36, Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 20 Aug 2020 at 16:28, Jiang Biao <[email protected]> wrote:
> >
> > On Thu, 20 Aug 2020 at 20:46, Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > On Thu, 20 Aug 2020 at 13:28, benbjiang(蒋彪) <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > > On Aug 20, 2020, at 3:35 PM, Vincent Guittot <[email protected]> wrote:
> > > > >
> > > > > On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) <[email protected]> wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Aug 19, 2020, at 10:55 PM, Vincent Guittot <[email protected]> wrote:
> > > > >>>
> > > > >>> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) <[email protected]> wrote:
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>
> > > > >>>>> On 19/08/2020 13:05, Vincent Guittot wrote:
> > > > >>>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>
> > > > >>>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote:
> > > > >>>>>>>>>> Hi,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote:
> > > > >>>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote:
> > > > >>>>>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote:
> > > > >>>>>>>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <[email protected]> wrote:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote:
> > > > >>>>>>>>>>>>>>>>>>>> From: Jiang Biao <[email protected]>
> > > > >>>>>>>
> > > > >>>>>>> [...]
> > > > >>>>>>>
> > > > >>>>>>>>> Are you sure about this?
> > > > >>>>>>>> Yes. :)
> > > > >>>>>>>>>
> > > > >>>>>>>>> The math is telling me for the:
> > > > >>>>>>>>>
> > > > >>>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms
> > > > >>>>>>>>>
> > > > >>>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms = 8ms
> > > > >>>>>>>>>
> > > > >>>>>>>>> (4ms - 250 Hz)
> > > > >>>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :)
> > > > >>>>>>>
> > > > >>>>>>> OK, I see.
> > > > >>>>>>>
> > > > >>>>>>> But here the different sched slices (check_preempt_tick()->
> > > > >>>>>>> sched_slice()) between normal tasks and the idle task play a role to.
> > > > >>>>>>>
> > > > >>>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms.
> > > > >>>>>>
> > > > >>>>>> In fact that depends on the number of CPUs on the system
> > > > >>>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system,
> > > > >>>>>> normal task will run around 12ms in one shoot and the idle task still
> > > > >>>>>> one tick period
> > > > >>>>>
> > > > >>>>> True. This is on a single CPU.
> > > > >>>> Agree. :)
> > > > >>>>
> > > > >>>>>
> > > > >>>>>> Also, you can increase even more the period between 2 runs of idle
> > > > >>>>>> task by using cgroups and min shares value : 2
> > > > >>>>>
> > > > >>>>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does
> > > > >>>>> not have other requirements preventing this.
> > > > >>>> That could work for increasing the period between 2 runs. But could not
> > > > >>>> reduce the single runtime of idle task I guess, which means normal task
> > > > >>>> could have 1-tick schedule latency because of idle task.
> > > > >>>
> > > > >>> Yes. An idle task will preempt an always running task during 1 tick
> > > > >>> every 680ms. But also you should keep in mind that a waking normal
> > > > >>> task will preempt the idle task immediately which means that it will
> > > > >>> not add scheduling latency to a normal task but "steal" 0.14% of
> > > > >>> normal task throughput (1/680) at most
> > > > >> That’s true. But in the VM case, when VM are busy(MWAIT passthrough
> > > > >> or running cpu eating works), the 1-tick scheduling latency could be
> > > > >> detected by cyclictest running in the VM.
> > > > >>
> > > > >> OTOH, we compensate vruntime in place_entity() to boot waking
> > > > >> without distinguish SCHED_IDLE task, do you think it’s necessary to
> > > > >> do that? like
> > > > >>
> > > > >> --- a/kernel/sched/fair.c
> > > > >> +++ b/kernel/sched/fair.c
> > > > >> @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > > > >> vruntime += sched_vslice(cfs_rq, se);
> > > > >>
> > > > >> /* sleeps up to a single latency don't count. */
> > > > >> - if (!initial) {
> > > > >> + if (!initial && likely(!task_has_idle_policy(task_of(se)))) {
> > > > >> unsigned long thresh = sysctl_sched_latency;
> > > > >
> > > > > Yeah, this is a good improvement.
> > > > Thanks, I’ll send a patch for that. :)
> > > >
> > > > > Does it solve your problem ?
> > > > >
> > > > Not exactly. :) I wonder if we can make SCHED_IDLE more pure(harmless)?
> > >
> > > We can't prevent it from running time to time. Proxy execution feature
> > > could be a step for considering to relax this constraint
> > >
> > Could you please help to explain more about the *Proxy execution feature*?
> > I'm not sure I got the right point.
>
> https://lwn.net/Articles/820575/
> https://lwn.net/Articles/793502/
Good to hear about that. I guess there would still be a long way to go. :)

Thanks again for your kindly patience.
Regard,
Jiang