From: Byungchul Park <[email protected]>
hello all,
i asked a question like below, in last version(=v2) patch.
***
the sysctl_sched_min_granularity must be defined clearly at first. after
defining that clearly, the way to work can be set. the definition can
be either case 1 or case 2 below.
case 1. any task must have at least sysctl_sched_min_granularity slice, which
is currently 0.75ms. in this case, increasing the number of tasks in a rq can
cause stretching a whole latency, which most of you don't like because it can
stretch the whole latency too much. but it looks normal to me since it already
happens in !CONFIG_FAIR_GROUP_SCHED world with the large number of tasks.
i wonder why CONFIG_FAIR_GROUP_SCHED world must be different with
!CONFIG_FAIR_GROUP_SCHED world? anyway...
case 2. tasks can have a slice much smaller than sysctl_sched_min_granularity,
according to the position in hierarchy. if a rq has 8 same weighted sched
entities and each entities has 8 same weighted sched entities and do it one
more, then a task can have a very small slice, e.g. 0.75ms / 64 ~ 0.01ms.
if you add more level to cgroup, it would get worse. in this situation,
context switching overhead becomes very large. what does it mean
sysctl_sched_min_granularity here? anyway...
i am not sure which is the right definition of sysctl_sched_min_granularity
between case 1 and case 2. what do you think about this?
***
i wrote this v3 patch based on the case 1 assuming the case 1 is right.
if the case 2 is right, then modifications in check_preempt_tick() should
be ignored.
doesn't it make sense?
thank you,
byungchul
---------------->8----------------
>From 7ebce566af9b952d24494cd1258b481ec6639cc1 Mon Sep 17 00:00:00 2001
From: Byungchul Park <[email protected]>
Date: Sun, 19 Jul 2015 17:11:37 +0900
Subject: [PATCH v3] sched: modify how to compute a slice and check a
preemptability
make cfs scheduler use rq level nr_running to compute a period in the case
of CONFIG_FAIR_GROUP_SCHED. using local cfs's nr_running to get period is
very weird. for example, imagine cgroup structure below.
root(=rq.cfs)--group1----a
|---b
|---c
|---d
|---e
|---f
|---g
|---h
|---i
|---j
|---k
|---l
|---m
in this case, group1's slice is not comparable to (a's slice + ... + m's
slice) with current code. it makes code using sum_exec_runtime weird, too.
it happens since current code does not use a consistent global wide thing
to get a global wide period.
in addition, modify preempt checking code to ensure that a sched entity
has at least sysctl_sched_min_granularity granularity for preemption.
Signed-off-by: Byungchul Park <[email protected]>
---
kernel/sched/fair.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09456fc..41c619f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -635,7 +635,7 @@ static u64 __sched_period(unsigned long nr_running)
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+ u64 slice = __sched_period(rq_of(cfs_rq)->cfs.nr_running + !se->on_rq);
for_each_sched_entity(se) {
struct load_weight *load;
@@ -3226,6 +3226,12 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
struct sched_entity *se;
s64 delta;
+ /*
+ * Ensure that a task executes at least for sysctl_sched_min_granularity
+ */
+ if (delta_exec < sysctl_sched_min_granularity)
+ return;
+
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
@@ -3243,9 +3249,6 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
* narrow margin doesn't have to wait for a full slice.
* This also mitigates buddy induced latencies under load.
*/
- if (delta_exec < sysctl_sched_min_granularity)
- return;
-
se = __pick_first_entity(cfs_rq);
delta = curr->vruntime - se->vruntime;
--
1.7.9.5
On Sun, Jul 19, 2015 at 06:11:00PM +0900, [email protected] wrote:
> doesn't it make sense?
No, people have already given you all kinds of reasons why this isn't a
good change. It is also a very invasive change and you have no
performance numbers one way or another.
I'm going to fully ignore this thread from now on.
On Sun, 2015-07-19 at 18:11 +0900, [email protected] wrote:
> @@ -3226,6 +3226,12 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> struct sched_entity *se;
> s64 delta;
>
> + /*
> + * Ensure that a task executes at least for sysctl_sched_min_granularity
> + */
> + if (delta_exec < sysctl_sched_min_granularity)
> + return;
> +
Think about what this does to a low weight task, or any task in a low
weight group. The scheduler equalizes runtimes for a living, there is
no free lunch. Any runtime larger than fair share that you graciously
grant to random task foo doesn't magically appear out of the vacuum, it
comes out of task foo's wallet. If you drag that hard coded minimum down
into the depths of group scheduling, yeah, every task will get a nice
juicy slice of CPU.. eventually, though you may not live to see it.
(yeah, overrun can and will happen at all depths due to tick
granularity, but you guaranteed it, so I inflated severity a bit;)
> ideal_runtime = sched_slice(cfs_rq, curr);
> delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> if (delta_exec > ideal_runtime) {
> @@ -3243,9 +3249,6 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> * narrow margin doesn't have to wait for a full slice.
> * This also mitigates buddy induced latencies under load.
> */
> - if (delta_exec < sysctl_sched_min_granularity)
> - return;
> -
That was about something entirely different. Feel free to remove it
after verifying that it has outlived it's original purpose, but please
don't just move it about at random.
-Mike
On Sun, Jul 19, 2015 at 01:15:09PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 19, 2015 at 06:11:00PM +0900, [email protected] wrote:
> > doesn't it make sense?
>
> No, people have already given you all kinds of reasons why this isn't a
i feel sorry. but all kinds?. i got only a reason, that is, latency problem.
so i already mentioned both of two cases, because of the reason people told
me. furthermore, nobody gave me a reason why the code should use local cfs's
nr_running to get period at all.
> good change. It is also a very invasive change and you have no
> performance numbers one way or another.
>
> I'm going to fully ignore this thread from now on.
sorry for bothering you.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sun, Jul 19, 2015 at 01:57:14PM +0200, Mike Galbraith wrote:
> On Sun, 2015-07-19 at 18:11 +0900, [email protected] wrote:
>
> > @@ -3226,6 +3226,12 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > struct sched_entity *se;
> > s64 delta;
> >
> > + /*
> > + * Ensure that a task executes at least for sysctl_sched_min_granularity
> > + */
> > + if (delta_exec < sysctl_sched_min_granularity)
> > + return;
> > +
>
> Think about what this does to a low weight task, or any task in a low
> weight group. The scheduler equalizes runtimes for a living, there is
> no free lunch. Any runtime larger than fair share that you graciously
> grant to random task foo doesn't magically appear out of the vacuum, it
> comes out of task foo's wallet. If you drag that hard coded minimum down
> into the depths of group scheduling, yeah, every task will get a nice
> juicy slice of CPU.. eventually, though you may not live to see it.
hello mike,
then i will not raise the question about ensuring minimum slice quantity more.
the case 2 must be taken.
>
> (yeah, overrun can and will happen at all depths due to tick
> granularity, but you guaranteed it, so I inflated severity a bit;)
yes, i also think that a preemption granularity has little meaning, atually
because of tick granularity. so, to be honest with you, my try is a kind of
trivial things but just things to fix wrong code.
>
> > ideal_runtime = sched_slice(cfs_rq, curr);
> > delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> > if (delta_exec > ideal_runtime) {
> > @@ -3243,9 +3249,6 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > * narrow margin doesn't have to wait for a full slice.
> > * This also mitigates buddy induced latencies under load.
> > */
> > - if (delta_exec < sysctl_sched_min_granularity)
> > - return;
> > -
>
> That was about something entirely different. Feel free to remove it
> after verifying that it has outlived it's original purpose, but please
> don't just move it about at random.
yes, i will not ensure minimum preemption granularity any more.
if i have to choose the case 2, i want to remove it.
thank you,
byungchul
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, 2015-07-20 at 09:34 +0900, Byungchul Park wrote:
> yes, i also think that a preemption granularity has little meaning, atually
> because of tick granularity.
See HR_TICK. It's not cheap though, why it's default off.
-Mike
On Mon, Jul 20, 2015 at 05:42:38AM +0200, Mike Galbraith wrote:
> On Mon, 2015-07-20 at 09:34 +0900, Byungchul Park wrote:
>
> > yes, i also think that a preemption granularity has little meaning, atually
> > because of tick granularity.
>
> See HR_TICK. It's not cheap though, why it's default off.
yes, i thought so. now, thank to you, i am assured.
thank you,
byungchul
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/