2021-05-19 18:08:09

by Odin Ugedal

[permalink] [raw]
Subject: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

Make sure cfs_rq does not contribute to task group load avg when
checking if it is decayed. Due to how the pelt tracking works,
the divider can result in a situation where:

cfs_rq->avg.load_sum = 0
cfs_rq->avg.load_avg = 4
cfs_rq->avg.tg_load_avg_contrib = 4

If pelt tracking in this case does not cross a period, there is no
"change" in load_sum, and therefore load_avg is not recalculated, and
keeps its value.

If this cfs_rq is then removed from the leaf list, it results in a
situation where the load is never removed from the tg. If that happen,
the fiarness is permanently skewed.

Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: Odin Ugedal <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3248e24a90b0..ceda53c2a87a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8004,6 +8004,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
if (cfs_rq->avg.runnable_sum)
return false;

+ if (cfs_rq->tg_load_avg_contrib)
+ return false;
+
return true;
}

--
2.31.1



2021-05-25 10:33:49

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Tue, 18 May 2021 at 14:54, Odin Ugedal <[email protected]> wrote:
>
> Make sure cfs_rq does not contribute to task group load avg when
> checking if it is decayed. Due to how the pelt tracking works,
> the divider can result in a situation where:
>
> cfs_rq->avg.load_sum = 0
> cfs_rq->avg.load_avg = 4

Could you give more details about how cfs_rq->avg.load_avg = 4 but
cfs_rq->avg.load_sum = 0 ?

cfs_rq->avg.load_sum is decayed and can become null when crossing
period which implies an update of cfs_rq->avg.load_avg. This means
that your case is generated by something outside the pelt formula ...
like maybe the propagation of load in the tree. If this is the case,
we should find the error and fix it

> cfs_rq->avg.tg_load_avg_contrib = 4
>
> If pelt tracking in this case does not cross a period, there is no
> "change" in load_sum, and therefore load_avg is not recalculated, and
> keeps its value.
>
> If this cfs_rq is then removed from the leaf list, it results in a
> situation where the load is never removed from the tg. If that happen,
> the fiarness is permanently skewed.
>
> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> Signed-off-by: Odin Ugedal <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3248e24a90b0..ceda53c2a87a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8004,6 +8004,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> if (cfs_rq->avg.runnable_sum)
> return false;
>
> + if (cfs_rq->tg_load_avg_contrib)
> + return false;
> +
> return true;
> }
>
> --
> 2.31.1
>

2021-05-25 15:08:12

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

Hi,

tir. 25. mai 2021 kl. 11:58 skrev Vincent Guittot <[email protected]>:
> Could you give more details about how cfs_rq->avg.load_avg = 4 but
> cfs_rq->avg.load_sum = 0 ?
>
> cfs_rq->avg.load_sum is decayed and can become null when crossing
> period which implies an update of cfs_rq->avg.load_avg. This means
> that your case is generated by something outside the pelt formula ...
> like maybe the propagation of load in the tree. If this is the case,
> we should find the error and fix it

Ahh, yeah, that could probably be described better.

It is (as far as I have found out) because the pelt divider is changed,
and the output from "get_pelt_divider(&cfs_rq->avg)" is changed, resulting
in a different value being removed than added.

Inside pelt itself, this cannot happen. When pelt changes the load_sum, it
recalculates the load_avg based on load_sum, and not the delta, afaik.

And as you say, the "issue" therefore (as I see it) outside of PELT. Due to
how the pelt divider is changed, I assume it is hard to pinpoint where the issue
is. I can try to find a clear path where where we can see what is added
and what is removed from both cfs_rq->avg.load_sum and cfs_rq->avg.load_avg,
to better be able to pinpoint what is happening.

Previously I thought this was a result of precision loss due to division and
multiplication during load add/remove inside fair.c, but I am not sure that
is the issue, or is it?

If my above line of thought makes sense, do you still view this as an error
outside PELT, or do you see another possible/better solution?

Will investigate further.

Thanks
Odin

2021-05-25 17:11:28

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Tue, 25 May 2021 at 12:34, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
> tir. 25. mai 2021 kl. 11:58 skrev Vincent Guittot <[email protected]>:
> > Could you give more details about how cfs_rq->avg.load_avg = 4 but
> > cfs_rq->avg.load_sum = 0 ?

> >
> > cfs_rq->avg.load_sum is decayed and can become null when crossing
> > period which implies an update of cfs_rq->avg.load_avg. This means
> > that your case is generated by something outside the pelt formula ...
> > like maybe the propagation of load in the tree. If this is the case,
> > we should find the error and fix it
>
> Ahh, yeah, that could probably be described better.
>
> It is (as far as I have found out) because the pelt divider is changed,
> and the output from "get_pelt_divider(&cfs_rq->avg)" is changed, resulting
> in a different value being removed than added.

ok so IIUC, it happens during the adding/removing/propagating
entities' load in the cfs_rq.

>
> Inside pelt itself, this cannot happen. When pelt changes the load_sum, it
> recalculates the load_avg based on load_sum, and not the delta, afaik.
>
> And as you say, the "issue" therefore (as I see it) outside of PELT. Due to
> how the pelt divider is changed, I assume it is hard to pinpoint where the issue
> is. I can try to find a clear path where where we can see what is added
> and what is removed from both cfs_rq->avg.load_sum and cfs_rq->avg.load_avg,
> to better be able to pinpoint what is happening.
>
> Previously I thought this was a result of precision loss due to division and
> multiplication during load add/remove inside fair.c, but I am not sure that
> is the issue, or is it?

I don't think the precision looss is the problem because
adding/removing load in fair.c could truncate load_sum but it stays
sync with load_avg. I will have a llo to see if i can see something
weird

>
> If my above line of thought makes sense, do you still view this as an error
> outside PELT, or do you see another possible/better solution?
>
> Will investigate further.

Thanks

>
> Thanks
> Odin

2021-05-26 10:52:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Tue, 25 May 2021 at 16:30, Vincent Guittot
<[email protected]> wrote:
>
> On Tue, 25 May 2021 at 12:34, Odin Ugedal <[email protected]> wrote:
> >
> > Hi,
> >
> > tir. 25. mai 2021 kl. 11:58 skrev Vincent Guittot <[email protected]>:
> > > Could you give more details about how cfs_rq->avg.load_avg = 4 but
> > > cfs_rq->avg.load_sum = 0 ?
>
> > >
> > > cfs_rq->avg.load_sum is decayed and can become null when crossing
> > > period which implies an update of cfs_rq->avg.load_avg. This means
> > > that your case is generated by something outside the pelt formula ...
> > > like maybe the propagation of load in the tree. If this is the case,
> > > we should find the error and fix it
> >
> > Ahh, yeah, that could probably be described better.
> >
> > It is (as far as I have found out) because the pelt divider is changed,
> > and the output from "get_pelt_divider(&cfs_rq->avg)" is changed, resulting
> > in a different value being removed than added.
>
> ok so IIUC, it happens during the adding/removing/propagating
> entities' load in the cfs_rq.
>
> >
> > Inside pelt itself, this cannot happen. When pelt changes the load_sum, it
> > recalculates the load_avg based on load_sum, and not the delta, afaik.
> >
> > And as you say, the "issue" therefore (as I see it) outside of PELT. Due to
> > how the pelt divider is changed, I assume it is hard to pinpoint where the issue
> > is. I can try to find a clear path where where we can see what is added
> > and what is removed from both cfs_rq->avg.load_sum and cfs_rq->avg.load_avg,
> > to better be able to pinpoint what is happening.
> >
> > Previously I thought this was a result of precision loss due to division and
> > multiplication during load add/remove inside fair.c, but I am not sure that
> > is the issue, or is it?
>
> I don't think the precision looss is the problem because
> adding/removing load in fair.c could truncate load_sum but it stays
> sync with load_avg. I will have a llo to see if i can see something
> weird

I have added a trace in cfs_rq_is_decayed() but I'm not able to
reproduce a situation where load_sum == 0 but not load_avg even with
the script in the cover letter


>
> >
> > If my above line of thought makes sense, do you still view this as an error
> > outside PELT, or do you see another possible/better solution?
> >
> > Will investigate further.
>
> Thanks
>
> >
> > Thanks
> > Odin

2021-05-27 07:56:44

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

Hi,

Did some quick testing with bpftrace, and it looks like it is
"update_tg_cfs_load" that does this;

kretfunc:attach_entity_load_avg: cfs_rq: 0xffff8a3e6773e400,
se.load_sum: 0, se.load_avg: 0, se->load.weight: 1048576 diff.sum: 0,
diff.load: 0, new_sum: 0, new_load: 0, period_contrib: 821

// quite some time passes

kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum
0, old load_avg: 0, new load_sum: 47876096, new load_avg: 1022
tg_load_avg_contrib: 0, period_contrib: 82, on_list: 0, throttled: 0/0
stack:
bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
bpf_get_stackid_raw_tp+115
bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
bpf_trampoline_6442466486_1+144
update_tg_cfs_load+5
update_load_avg+450
enqueue_entity+91
enqueue_task_fair+134
move_queued_task+172
affine_move_task+1282
__set_cpus_allowed_ptr+317
update_tasks_cpumask+70
update_cpumasks_hier+448
update_cpumask+345
cpuset_write_resmask+796
cgroup_file_write+162
kernfs_fop_write_iter+284
new_sync_write+345
vfs_write+511
ksys_write+103
do_syscall_64+51
entry_SYSCALL_64_after_hwframe+68

// There might be more stuff happening here
// __update_load_avg_cfs_rq runs way too often to be able to
// trace it without filtering, and didn't look into that.

kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum
48117265, old load_avg: 1025, new load_sum: 0, new load_avg: 2
tg_load_avg_contrib: 1022, period_contrib: 223, on_list: 1, throttled:
1/1 stack:
bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
bpf_get_stackid_raw_tp+115
bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
bpf_trampoline_6442466486_1+144
update_tg_cfs_load+5
update_load_avg+450
update_blocked_averages+723
newidle_balance+533
pick_next_task_fair+57
__schedule+376
schedule+91
schedule_hrtimeout_range_clock+164
do_sys_poll+1043
__x64_sys_poll+179
do_syscall_64+51
entry_SYSCALL_64_after_hwframe+68

This particular example resulted in a 75/25 share of cpu time for the
two groups.

From this, it also looks like it doesn't matter if new load_avg is 0 or not
(when new load_sum is 0), since tg_load_avg_contrib will not be properly
removed either way (without this patch). There might be some other precision


> I have added a trace in cfs_rq_is_decayed() but I'm not able to
> reproduce a situation where load_sum == 0 but not load_avg even with
> the script in the cover letter

Hmm, strange. I am able to reproduce on both v5.12 and v5.13. I am running on
a non SMT 4 core CPU (4670k), with CONFIG_HZ=250. Since this is timing
sensitive,
you might need to tweak the timings for either the CFS bw period/quota, or the
time waiting between changing what cpu it runs on. If you have more than 4 cpus,
you can also try to start on CPU 0 and move it through all cores and
then onto CPU n.
Maybe that makes it possible to reproduce.

Since bpftrace doesn't work properly in v5.13, this particular test was
on v1.12 with cherry-pick of 0258bdfaff5bd13c4d2383150b7097aecd6b6d82 and
the other patch in this patchset.

testscript.bt:

kfunc:attach_entity_load_avg {@a_sum[tid] =
args->cfs_rq->avg.load_sum; @a_load[tid] =
args->cfs_rq->avg.load_avg;}
kretfunc:attach_entity_load_avg{printf("%s: cfs_rq: %p, se.load_sum:
%llu, se.load_avg: %llu, se->load.weight: %llu diff.sum: %llu,
diff.load: %llu, new_sum: %llu, new_load: %llu, period_contrib:
%llu\n", probe, args->cfs_rq,args->se->avg.load_sum,
args->se->avg.load_avg,
args->se->load.weight,args->cfs_rq->avg.load_sum - @a_sum[tid],
args->cfs_rq->avg.load_avg - @a_load[tid], args->cfs_rq->avg.load_sum,
args->cfs_rq->avg.load_avg, args->cfs_rq->avg.period_contrib)}

kfunc:update_tg_cfs_load {@beg_load_avg[tid] =
args->cfs_rq->avg.load_avg; @beg_load_sum[tid] =
args->cfs_rq->avg.load_sum}
kretfunc:update_tg_cfs_load {printf("%s: cfs_rq: %p, old load_sum
%llu, old load_avg: %llu, new load_sum: %llu, new load_avg: %llu
tg_load_avg_contrib: %llu, period_contrib: %llu, on_list: %d,
throttled: %d/%d stack: %s\n",probe, args->cfs_rq, @beg_load_sum[tid],
@beg_load_avg[tid], args->cfs_rq->avg.load_sum,
args->cfs_rq->avg.load_avg, args->cfs_rq->tg_load_avg_contrib,
args->cfs_rq->avg.period_contrib,args->cfs_rq->on_list,
args->cfs_rq->throttled, args->cfs_rq->throttle_count,kstack)}

Thanks
Odin

2021-05-27 09:48:19

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

Hi,

> I finally got it this morning with your script and I confirm that the
> problem of load_sum == 0 but load_avg != 0 comes from
> update_tg_cfs_load(). Then, it seems that we don't call
> update_tg_load_avg for this cfs_rq in __update_blocked_fair() because
> of a recent update while propagating child's load changes. At the end
> we remove the cfs_rq from the list without updating its contribution.
>
> I'm going to prepare a patch to fix this

Yeah, that is another way to look at it. Have not verified, but
wouldn't update_tg_load_avg() in this case
just remove the diff (load_avg - tg_load_avg_contrib)? Wouldn't we
still see some tg_load_avg_contrib
after the cfs_rq is removed from the list then? Eg. in my example
above, the cfs_rq will be removed from
the list while tg_load_avg_contrib=2, or am I missing something? That
was my thought when I looked
at it last week at least..

Thanks
Odin

2021-05-27 11:06:41

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

> 1st : ensure that cfs_rq->load_sum is not null if cfs_rq-> load_isn't too
> 2nd : call update_tg_load_avg() during child update so we will be sure
> to update tg_load_avg_contrib before removing the cfs from the list

Ahh, yeah, with "1st" that would work. Yeah, that was my initial
implementation of the change, but I thought that it was better to keep
the logic away from the "hot path". We can verify this in
update_tg_cfs_load(), and then force update_tg_load_avg() inside
__update_blocked_fair() when avg.load_avg is 0. (Given that this is
the only place where we can end up in this situation. I can update
this patch to do that instead.

Another solution is to update avg.load_avg
inside__update_blocked_fair() when load_sum is 0, and then propagate
that with update_tg_load_avg(). This removes the logic from the hot
path all together.

Not sure what the preferred way is. I have not found any other places
where this situation _should_ occur, but who knows..

Odin

2021-05-27 12:39:57

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

Hi again,

Saw your patchset now, and that is essentially the same as I did. I
guess you want to keep that patchset instead of me updating this
series then?

Also, could you take a look at patch 2 and 3 in this series? Should I
resend those in a new series, or?

Odin

2021-05-27 12:42:22

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

> For 1st problem, the way we were updating load_avg and load_sum, we
> were losing the sync between both value

Yeah, that would be a natural way to fix that.

> In fact, the update was already there but not always called (see the
> patchset i just sent)

Yeah, that was my exact point.

Odin

2021-05-27 15:41:16

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Thu, 27 May 2021 at 09:51, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
> Did some quick testing with bpftrace, and it looks like it is
> "update_tg_cfs_load" that does this;
>
> kretfunc:attach_entity_load_avg: cfs_rq: 0xffff8a3e6773e400,
> se.load_sum: 0, se.load_avg: 0, se->load.weight: 1048576 diff.sum: 0,
> diff.load: 0, new_sum: 0, new_load: 0, period_contrib: 821
>
> // quite some time passes
>
> kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum
> 0, old load_avg: 0, new load_sum: 47876096, new load_avg: 1022
> tg_load_avg_contrib: 0, period_contrib: 82, on_list: 0, throttled: 0/0
> stack:
> bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
> bpf_get_stackid_raw_tp+115
> bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
> bpf_trampoline_6442466486_1+144
> update_tg_cfs_load+5
> update_load_avg+450
> enqueue_entity+91
> enqueue_task_fair+134
> move_queued_task+172
> affine_move_task+1282
> __set_cpus_allowed_ptr+317
> update_tasks_cpumask+70
> update_cpumasks_hier+448
> update_cpumask+345
> cpuset_write_resmask+796
> cgroup_file_write+162
> kernfs_fop_write_iter+284
> new_sync_write+345
> vfs_write+511
> ksys_write+103
> do_syscall_64+51
> entry_SYSCALL_64_after_hwframe+68
>
> // There might be more stuff happening here
> // __update_load_avg_cfs_rq runs way too often to be able to
> // trace it without filtering, and didn't look into that.
>
> kretfunc:update_tg_cfs_load: cfs_rq: 0xffff8a3e6773e400, old load_sum
> 48117265, old load_avg: 1025, new load_sum: 0, new load_avg: 2
> tg_load_avg_contrib: 1022, period_contrib: 223, on_list: 1, throttled:
> 1/1 stack:
> bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
> bpf_get_stackid_raw_tp+115
> bpf_prog_a37cf01922e02958_update_tg_cfs_l+504
> bpf_trampoline_6442466486_1+144
> update_tg_cfs_load+5
> update_load_avg+450
> update_blocked_averages+723
> newidle_balance+533
> pick_next_task_fair+57
> __schedule+376
> schedule+91
> schedule_hrtimeout_range_clock+164
> do_sys_poll+1043
> __x64_sys_poll+179
> do_syscall_64+51
> entry_SYSCALL_64_after_hwframe+68
>
> This particular example resulted in a 75/25 share of cpu time for the
> two groups.
>
> From this, it also looks like it doesn't matter if new load_avg is 0 or not
> (when new load_sum is 0), since tg_load_avg_contrib will not be properly
> removed either way (without this patch). There might be some other precision
>
>
> > I have added a trace in cfs_rq_is_decayed() but I'm not able to
> > reproduce a situation where load_sum == 0 but not load_avg even with
> > the script in the cover letter
>
> Hmm, strange. I am able to reproduce on both v5.12 and v5.13. I am running on
> a non SMT 4 core CPU (4670k), with CONFIG_HZ=250. Since this is timing
> sensitive,
> you might need to tweak the timings for either the CFS bw period/quota, or the
> time waiting between changing what cpu it runs on. If you have more than 4 cpus,
> you can also try to start on CPU 0 and move it through all cores and
> then onto CPU n.
> Maybe that makes it possible to reproduce.

I finally got it this morning with your script and I confirm that the
problem of load_sum == 0 but load_avg != 0 comes from
update_tg_cfs_load(). Then, it seems that we don't call
update_tg_load_avg for this cfs_rq in __update_blocked_fair() because
of a recent update while propagating child's load changes. At the end
we remove the cfs_rq from the list without updating its contribution.

I'm going to prepare a patch to fix this


>
> Since bpftrace doesn't work properly in v5.13, this particular test was
> on v1.12 with cherry-pick of 0258bdfaff5bd13c4d2383150b7097aecd6b6d82 and
> the other patch in this patchset.
>
> testscript.bt:
>
> kfunc:attach_entity_load_avg {@a_sum[tid] =
> args->cfs_rq->avg.load_sum; @a_load[tid] =
> args->cfs_rq->avg.load_avg;}
> kretfunc:attach_entity_load_avg{printf("%s: cfs_rq: %p, se.load_sum:
> %llu, se.load_avg: %llu, se->load.weight: %llu diff.sum: %llu,
> diff.load: %llu, new_sum: %llu, new_load: %llu, period_contrib:
> %llu\n", probe, args->cfs_rq,args->se->avg.load_sum,
> args->se->avg.load_avg,
> args->se->load.weight,args->cfs_rq->avg.load_sum - @a_sum[tid],
> args->cfs_rq->avg.load_avg - @a_load[tid], args->cfs_rq->avg.load_sum,
> args->cfs_rq->avg.load_avg, args->cfs_rq->avg.period_contrib)}
>
> kfunc:update_tg_cfs_load {@beg_load_avg[tid] =
> args->cfs_rq->avg.load_avg; @beg_load_sum[tid] =
> args->cfs_rq->avg.load_sum}
> kretfunc:update_tg_cfs_load {printf("%s: cfs_rq: %p, old load_sum
> %llu, old load_avg: %llu, new load_sum: %llu, new load_avg: %llu
> tg_load_avg_contrib: %llu, period_contrib: %llu, on_list: %d,
> throttled: %d/%d stack: %s\n",probe, args->cfs_rq, @beg_load_sum[tid],
> @beg_load_avg[tid], args->cfs_rq->avg.load_sum,
> args->cfs_rq->avg.load_avg, args->cfs_rq->tg_load_avg_contrib,
> args->cfs_rq->avg.period_contrib,args->cfs_rq->on_list,
> args->cfs_rq->throttled, args->cfs_rq->throttle_count,kstack)}
>
> Thanks
> Odin

2021-05-27 15:46:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Thu, 27 May 2021 at 11:46, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
> > I finally got it this morning with your script and I confirm that the
> > problem of load_sum == 0 but load_avg != 0 comes from
> > update_tg_cfs_load(). Then, it seems that we don't call
> > update_tg_load_avg for this cfs_rq in __update_blocked_fair() because
> > of a recent update while propagating child's load changes. At the end
> > we remove the cfs_rq from the list without updating its contribution.
> >
> > I'm going to prepare a patch to fix this
>
> Yeah, that is another way to look at it. Have not verified, but
> wouldn't update_tg_load_avg() in this case
> just remove the diff (load_avg - tg_load_avg_contrib)? Wouldn't we
> still see some tg_load_avg_contrib
> after the cfs_rq is removed from the list then? Eg. in my example
> above, the cfs_rq will be removed from
> the list while tg_load_avg_contrib=2, or am I missing something? That
> was my thought when I looked
> at it last week at least..

1st : ensure that cfs_rq->load_sum is not null if cfs_rq-> load_isn't too
2nd : call update_tg_load_avg() during child update so we will be sure
to update tg_load_avg_contrib before removing the cfs from the list

>
> Thanks
> Odin

2021-05-27 16:11:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Thu, 27 May 2021 at 14:38, Odin Ugedal <[email protected]> wrote:
>
> Hi again,
>
> Saw your patchset now, and that is essentially the same as I did. I
> guess you want to keep that patchset instead of me updating this
> series then?

yes

>
> Also, could you take a look at patch 2 and 3 in this series? Should I

Yes, I'm going to have a look at patch 2 and 3

> resend those in a new series, or?
>
> Odin

2021-05-27 20:49:16

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/fair: Add tg_load_contrib cfs_rq decay checking

On Thu, 27 May 2021 at 13:04, Odin Ugedal <[email protected]> wrote:
>
> > 1st : ensure that cfs_rq->load_sum is not null if cfs_rq-> load_isn't too
> > 2nd : call update_tg_load_avg() during child update so we will be sure
> > to update tg_load_avg_contrib before removing the cfs from the list
>
> Ahh, yeah, with "1st" that would work. Yeah, that was my initial
> implementation of the change, but I thought that it was better to keep
> the logic away from the "hot path". We can verify this in
> update_tg_cfs_load(), and then force update_tg_load_avg() inside

For 1st problem, the way we were updating load_avg and load_sum, we
were losing the sync between both value

> __update_blocked_fair() when avg.load_avg is 0. (Given that this is
> the only place where we can end up in this situation. I can update
> this patch to do that instead.

In fact, the update was already there but not always called (see the
patchset i just sent)


>
> Another solution is to update avg.load_avg
> inside__update_blocked_fair() when load_sum is 0, and then propagate
> that with update_tg_load_avg(). This removes the logic from the hot
> path all together.
>
> Not sure what the preferred way is. I have not found any other places
> where this situation _should_ occur, but who knows..
>
> Odin