hi folks,
I found problem about s64 overflow in max_vruntime().
I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this
group on each cpu, these tasks is while loop and 100% cpu usage.
When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs()
and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been
waiting on the queue and has not been scheduled.
After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the
cfs_rq->min_vruntime is 0x124b17fd59db8d02.
why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large?
1) the kworker of the system_highpri_wq sleep for long long time(about 300 days).
2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when
the task belonging to the GROUPA run for a long time, its vruntime will increase by 420
times, cfs_rq->min_vruntime will also grow rapidly.
3) when wakeup kworker thread, kworker will be set the maximum value between kworker's
vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue,
as follow:
---------
static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
{
/*
* vruntime=0x124b17fd59db8d02
* min_vruntime=0x918fdb05287da7c3
* vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
*/
s64 delta = (s64)(vruntime - min_vruntime);
if (delta < 0)
min_vruntime = vruntime;
return min_vruntime;
}
----------
max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
will cause kworker thread starved.
Does anyone have a good suggestion for slove this problem? or bugfix patch.
On 12/21/22 10:19, Zhang Qiao wrote:
> hi folks,
>
> I found problem about s64 overflow in max_vruntime().
>
> I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this
> group on each cpu, these tasks is while loop and 100% cpu usage.
>
> When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs()
> and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been
> waiting on the queue and has not been scheduled.
>
> After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the
> cfs_rq->min_vruntime is 0x124b17fd59db8d02.
>
> why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large?
> 1) the kworker of the system_highpri_wq sleep for long long time(about 300 days).
This is an interesting problem. That means if the kworker has been
sleeping even longer, like 600 days, it may overflow u64 as well. My
suggestion is to cap the sleep time dependency of the vruntime
computation to a max value that cannot overflow s64 when combined with a
max load.weight. IOW, if the tasks are sleeping long enough, they are
all treated to be the same.
> 2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when
> the task belonging to the GROUPA run for a long time, its vruntime will increase by 420
> times, cfs_rq->min_vruntime will also grow rapidly.
> 3) when wakeup kworker thread, kworker will be set the maximum value between kworker's
> vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue,
> as follow:
>
> ---------
>
> static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
> {
> /*
> * vruntime=0x124b17fd59db8d02
> * min_vruntime=0x918fdb05287da7c3
> * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
> */
> s64 delta = (s64)(vruntime - min_vruntime);
> if (delta < 0)
> min_vruntime = vruntime;
>
> return min_vruntime;
> }
>
> ----------
>
> max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
> shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
> will cause kworker thread starved.
>
> Does anyone have a good suggestion for slove this problem? or bugfix patch.
>
Cheers,
Longman
On Wed, Dec 21, 2022 at 11:19:31PM +0800, Zhang Qiao wrote:
> hi folks,
>
> I found problem about s64 overflow in max_vruntime().
>
> I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this
> group on each cpu, these tasks is while loop and 100% cpu usage.
>
> When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs()
> and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been
> waiting on the queue and has not been scheduled.
>
> After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the
> cfs_rq->min_vruntime is 0x124b17fd59db8d02.
>
> why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large?
> 1) the kworker of the system_highpri_wq sleep for long long time(about 300 days).
> 2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when
> the task belonging to the GROUPA run for a long time, its vruntime will increase by 420
> times, cfs_rq->min_vruntime will also grow rapidly.
> 3) when wakeup kworker thread, kworker will be set the maximum value between kworker's
> vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue,
> as follow:
>
> ---------
>
> static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
> {
> /*
> * vruntime=0x124b17fd59db8d02
> * min_vruntime=0x918fdb05287da7c3
> * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
> */
> s64 delta = (s64)(vruntime - min_vruntime);
> if (delta < 0)
> min_vruntime = vruntime;
>
> return min_vruntime;
> }
>
> ----------
>
> max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
> shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
> will cause kworker thread starved.
>
> Does anyone have a good suggestion for slove this problem? or bugfix patch.
I don't understand what you tihnk the problem is. Signed overflow is
perfectly fine and works as designed here.
在 2022/12/22 20:45, Peter Zijlstra 写道:
> On Wed, Dec 21, 2022 at 11:19:31PM +0800, Zhang Qiao wrote:
>> hi folks,
>>
>> I found problem about s64 overflow in max_vruntime().
>>
>> I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this
>> group on each cpu, these tasks is while loop and 100% cpu usage.
>>
>> When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs()
>> and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been
>> waiting on the queue and has not been scheduled.
>>
>> After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the
>> cfs_rq->min_vruntime is 0x124b17fd59db8d02.
>>
>> why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large?
>> 1) the kworker of the system_highpri_wq sleep for long long time(about 300 days).
>> 2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when
>> the task belonging to the GROUPA run for a long time, its vruntime will increase by 420
>> times, cfs_rq->min_vruntime will also grow rapidly.
>> 3) when wakeup kworker thread, kworker will be set the maximum value between kworker's
>> vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue,
>> as follow:
>>
>> ---------
>>
>> static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
>> {
>> /*
>> * vruntime=0x124b17fd59db8d02
>> * min_vruntime=0x918fdb05287da7c3
>> * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
>> */
>> s64 delta = (s64)(vruntime - min_vruntime);
>> if (delta < 0)
>> min_vruntime = vruntime;
>>
>> return min_vruntime;
>> }
>>
>> ----------
>>
>> max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
>> shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
>> will cause kworker thread starved.
>>
>> Does anyone have a good suggestion for slove this problem? or bugfix patch.
>
> I don't understand what you tihnk the problem is. Signed overflow is
> perfectly fine and works as designed here.
hi, Peter and Waiman,
This problem occurs in the production environment that deploy some dpdk services. When this probelm
occurs, the system will be unavailable(for example, many commands about network will be stuck),so
i think it's a problem.
Because most network commands(such as "ip") require rtnl_mutex, but the rtnl_mutex's onwer is waiting for
the the kworker of the system_highpri_wq at flush_all_backlogs(), util this highpri kworker finished
flush the network packets.
However, this highpri kworker has been sleeping for long, the difference between the kworker's vruntime
and cfs_rq->min_vruntime is so big, when waking up it, it will be set its old vruntime due to s64 overflow
at max_vruntime(). Because the incorrect vruntime, the kworker might not be scheduled.
Is it necessary to deal with this problem in kernel?
If necessary, for fix this problem, when a tasks is sleeping long enough, we set its vruntime as
cfs_rq->min_vruntime when wakeup it, avoid the s64 overflow issue at max_vruntime, as follow:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e16e9f0124b0..89df8d7bae66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4336,10 +4336,14 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
#endif
}
+/* when a task sleep over 200 days, it's vruntime will be set as cfs_rq->min_vruntime. */
+#define WAKEUP_REINIT_THRESHOLD_NS (200LL * 24 * 3600 * NSEC_PER_SEC)
+
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;
+ struct rq *rq = rq_of(cfs_rq);
/*
* The 'current' period is already promised to the current tasks,
@@ -4364,8 +4368,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
vruntime -= thresh;
}
- /* ensure we never gain time by being placed backwards. */
- se->vruntime = max_vruntime(se->vruntime, vruntime);
+ if (unlikely(!initial && (s64)(rq_clock_task(rq) - se->exec_start) > WAKEUP_REINIT_THRESHOLD_NS))
+ se->vruntime = vruntime;
+ else
+ /* ensure we never gain time by being placed backwards. */
+ se->vruntime = max_vruntime(se->vruntime, vruntime);
}
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
>
> .
>
在 2022/12/23 21:57, Zhang Qiao 写道:
>
>
> 在 2022/12/22 20:45, Peter Zijlstra 写道:
>> On Wed, Dec 21, 2022 at 11:19:31PM +0800, Zhang Qiao wrote:
>>> hi folks,
>>>
>>> I found problem about s64 overflow in max_vruntime().
>>>
>>> I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this
>>> group on each cpu, these tasks is while loop and 100% cpu usage.
>>>
>>> When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs()
>>> and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been
>>> waiting on the queue and has not been scheduled.
>>>
>>> After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the
>>> cfs_rq->min_vruntime is 0x124b17fd59db8d02.
>>>
>>> why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large?
>>> 1) the kworker of the system_highpri_wq sleep for long long time(about 300 days).
>>> 2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when
>>> the task belonging to the GROUPA run for a long time, its vruntime will increase by 420
>>> times, cfs_rq->min_vruntime will also grow rapidly.
>>> 3) when wakeup kworker thread, kworker will be set the maximum value between kworker's
>>> vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue,
>>> as follow:
>>>
>>> ---------
>>>
>>> static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
>>> {
>>> /*
>>> * vruntime=0x124b17fd59db8d02
>>> * min_vruntime=0x918fdb05287da7c3
>>> * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
>>> */
>>> s64 delta = (s64)(vruntime - min_vruntime);
>>> if (delta < 0)
>>> min_vruntime = vruntime;
>>>
>>> return min_vruntime;
>>> }
>>>
>>> ----------
>>>
>>> max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
>>> shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
>>> will cause kworker thread starved.
>>>
>>> Does anyone have a good suggestion for slove this problem? or bugfix patch.
>>
>> I don't understand what you tihnk the problem is. Signed overflow is
>> perfectly fine and works as designed here.
>
> hi, Peter and Waiman,
>
> This problem occurs in the production environment that deploy some dpdk services. When this probelm
> occurs, the system will be unavailable(for example, many commands about network will be stuck),so
> i think it's a problem.
>
> Because most network commands(such as "ip") require rtnl_mutex, but the rtnl_mutex's onwer is waiting for
> the the kworker of the system_highpri_wq at flush_all_backlogs(), util this highpri kworker finished
> flush the network packets.
>
> However, this highpri kworker has been sleeping for long, the difference between the kworker's vruntime
> and cfs_rq->min_vruntime is so big, when waking up it, it will be set its old vruntime due to s64 overflow
> at max_vruntime(). Because the incorrect vruntime, the kworker might not be scheduled.
>
> Is it necessary to deal with this problem in kernel?
> If necessary, for fix this problem, when a tasks is sleeping long enough, we set its vruntime as
> cfs_rq->min_vruntime when wakeup it, avoid the s64 overflow issue at max_vruntime, as follow:
>
hi,
Gentle Ping. Please let me know if you have any comments on the issue.
Thanks,
Zhang Qiao.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e16e9f0124b0..89df8d7bae66 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4336,10 +4336,14 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
> #endif
> }
>
> +/* when a task sleep over 200 days, it's vruntime will be set as cfs_rq->min_vruntime. */
> +#define WAKEUP_REINIT_THRESHOLD_NS (200LL * 24 * 3600 * NSEC_PER_SEC)
> +
> static void
> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> {
> u64 vruntime = cfs_rq->min_vruntime;
> + struct rq *rq = rq_of(cfs_rq);
>
> /*
> * The 'current' period is already promised to the current tasks,
> @@ -4364,8 +4368,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> vruntime -= thresh;
> }
>
> - /* ensure we never gain time by being placed backwards. */
> - se->vruntime = max_vruntime(se->vruntime, vruntime);
> + if (unlikely(!initial && (s64)(rq_clock_task(rq) - se->exec_start) > WAKEUP_REINIT_THRESHOLD_NS))
> + se->vruntime = vruntime;
> + else
> + /* ensure we never gain time by being placed backwards. */
> + se->vruntime = max_vruntime(se->vruntime, vruntime);
> }
>
> static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
>
>
>
>>
>> .
>>
>
> .
>
Upping the thread as we're hitting this problem too.
On Thu, Dec 22, 2022 at 01:45:48PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 21, 2022 at 11:19:31PM +0800, Zhang Qiao wrote:
> > I found problem about s64 overflow in max_vruntime().
> >
[...]
> > static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
> > {
> > /*
> > * vruntime=0x124b17fd59db8d02
> > * min_vruntime=0x918fdb05287da7c3
> > * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
> > */
> > s64 delta = (s64)(vruntime - min_vruntime);
> > if (delta < 0)
> > min_vruntime = vruntime;
> >
> > return min_vruntime;
> > }
> >
> > ----------
> >
> > max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
> > shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
> > will cause kworker thread starved.
> >
> > Does anyone have a good suggestion for slove this problem? or bugfix patch.
>
> I don't understand what you tihnk the problem is. Signed overflow is
> perfectly fine and works as designed here.
Disagreed.
The calculation is indeed safe against the overflow of the vruntimes
themselves. However, when the two vruntimes are more than 2^63 apart,
their comparison gets inverted due to that s64 overflow.
And this is what happens here: one scheduling entity has accumulated a
vruntime more than 2^63 ahead of another. Now the comparison is
inverted due to s64 overflow, and the latter can't get to the cpu,
because it appears to have vruntime (much) bigger than that of the
former.
This situation is reproducible e.g. when one scheduling entity is a
multi-cpu hog, and the other is woken up from a long sleep. Normally
when a task is placed on a cfs_rq, its vruntime is pulled to
min_vruntime, to avoid boosting the woken up task. However in this case
the task is so much behind in vruntime that it appears ahead instead,
its vruntime is not adjusted in place_entity(), and then it looses the
cpu to the current scheduling entity.
Thanks,
Roman.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Fri, Dec 23, 2022 at 09:57:33PM +0800, Zhang Qiao wrote:
> 在 2022/12/22 20:45, Peter Zijlstra 写道:
> > On Wed, Dec 21, 2022 at 11:19:31PM +0800, Zhang Qiao wrote:
> >> hi folks,
> >>
> >> I found problem about s64 overflow in max_vruntime().
> >>
> >> I create a task group GROUPA (path: /system.slice/xxx/yyy/CGROUPA) and run a task in this
> >> group on each cpu, these tasks is while loop and 100% cpu usage.
> >>
> >> When unregister net devices, will queue a kwork on system_highpri_wq at flush_all_backlogs()
> >> and wake up a high-priority kworker thread on each cpu. However, the kworker thread has been
> >> waiting on the queue and has not been scheduled.
> >>
> >> After parsing the vmcore, the vruntime of the kworker is 0x918fdb05287da7c3 and the
> >> cfs_rq->min_vruntime is 0x124b17fd59db8d02.
> >>
> >> why the difference between the cfs_rq->min_vruntime and kworker's vruntime is so large?
> >> 1) the kworker of the system_highpri_wq sleep for long long time(about 300 days).
> >> 2) cfs_rq->curr is the ancestor of the GROUPA, cfs->curr->load.weight is 2494, so when
> >> the task belonging to the GROUPA run for a long time, its vruntime will increase by 420
> >> times, cfs_rq->min_vruntime will also grow rapidly.
> >> 3) when wakeup kworker thread, kworker will be set the maximum value between kworker's
> >> vruntime and cfs_rq->min_vruntime. But at max_vruntime(), there will be a s64 overflow issue,
> >> as follow:
> >>
> >> ---------
> >>
> >> static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
> >> {
> >> /*
> >> * vruntime=0x124b17fd59db8d02
> >> * min_vruntime=0x918fdb05287da7c3
> >> * vruntime - min_vruntime = 9276074894177461567 > s64_max, will s64 overflow
> >> */
> >> s64 delta = (s64)(vruntime - min_vruntime);
> >> if (delta < 0)
> >> min_vruntime = vruntime;
> >>
> >> return min_vruntime;
> >> }
> >>
> >> ----------
> >>
> >> max_vruntime() will return the kworker's old vruntime, it is incorrect and the correct result
> >> shoud be cfs_rq->minvruntime. This incorrect result is greater than cfs_rq->min_vruntime and
> >> will cause kworker thread starved.
> >>
> >> Does anyone have a good suggestion for slove this problem? or bugfix patch.
> >
> > I don't understand what you tihnk the problem is. Signed overflow is
> > perfectly fine and works as designed here.
>
> hi, Peter and Waiman,
>
> This problem occurs in the production environment that deploy some dpdk services. When this probelm
> occurs, the system will be unavailable(for example, many commands about network will be stuck),so
> i think it's a problem.
>
> Because most network commands(such as "ip") require rtnl_mutex, but the rtnl_mutex's onwer is waiting for
> the the kworker of the system_highpri_wq at flush_all_backlogs(), util this highpri kworker finished
> flush the network packets.
>
> However, this highpri kworker has been sleeping for long, the difference between the kworker's vruntime
> and cfs_rq->min_vruntime is so big, when waking up it, it will be set its old vruntime due to s64 overflow
> at max_vruntime(). Because the incorrect vruntime, the kworker might not be scheduled.
>
> Is it necessary to deal with this problem in kernel?
> If necessary, for fix this problem, when a tasks is sleeping long enough, we set its vruntime as
> cfs_rq->min_vruntime when wakeup it, avoid the s64 overflow issue at max_vruntime, as follow:
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e16e9f0124b0..89df8d7bae66 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4336,10 +4336,14 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
> #endif
> }
>
> +/* when a task sleep over 200 days, it's vruntime will be set as cfs_rq->min_vruntime. */
> +#define WAKEUP_REINIT_THRESHOLD_NS (200LL * 24 * 3600 * NSEC_PER_SEC)
I wonder where do these 200 days come from?
E.g. in our setup we've observed the problem on a 448 cpu system, with
all the cpus being occupied by tasks in a single cpu cgroup (and
therefore contributing to its weight), when the other task (kworker)
slept for around 209 days. IOW presumably adding a few more cpus or
just running the whole cgroup at an elevated nice level will make the
difference accumulate faster.
Thanks,
Roman.
> +
> static void
> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> {
> u64 vruntime = cfs_rq->min_vruntime;
> + struct rq *rq = rq_of(cfs_rq);
>
> /*
> * The 'current' period is already promised to the current tasks,
> @@ -4364,8 +4368,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> vruntime -= thresh;
> }
>
> - /* ensure we never gain time by being placed backwards. */
> - se->vruntime = max_vruntime(se->vruntime, vruntime);
> + if (unlikely(!initial && (s64)(rq_clock_task(rq) - se->exec_start) > WAKEUP_REINIT_THRESHOLD_NS))
> + se->vruntime = vruntime;
> + else
> + /* ensure we never gain time by being placed backwards. */
> + se->vruntime = max_vruntime(se->vruntime, vruntime);
> }
>
> static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
>
>
>
> >
> > .
> >
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, Jan 25, 2023 at 08:45:32PM +0100, Roman Kagan wrote:
> The calculation is indeed safe against the overflow of the vruntimes
> themselves. However, when the two vruntimes are more than 2^63 apart,
> their comparison gets inverted due to that s64 overflow.
Yes, but that's a whole different issue. vruntime are not expected to be
*that* far apart.
That is surely the abnormal case. The normal case is wrap around, and
that happens 'often' and should continue working.
> And this is what happens here: one scheduling entity has accumulated a
> vruntime more than 2^63 ahead of another. Now the comparison is
> inverted due to s64 overflow, and the latter can't get to the cpu,
> because it appears to have vruntime (much) bigger than that of the
> former.
If it can be 2^63 ahead, it can also be 2^(64+) ahead and nothing will
help.
> This situation is reproducible e.g. when one scheduling entity is a
> multi-cpu hog, and the other is woken up from a long sleep. Normally
A very low weight CPU hog?
> when a task is placed on a cfs_rq, its vruntime is pulled to
> min_vruntime, to avoid boosting the woken up task. However in this case
> the task is so much behind in vruntime that it appears ahead instead,
> its vruntime is not adjusted in place_entity(), and then it looses the
> cpu to the current scheduling entity.
What I think might be a way out here is passing the the sleep wall-time
(cfs_rq_clock_pelt() time I suppose) to place entity and simply skip the
magic if 'big'.
All that only matters for small sleeps anyway.
Something like:
sleep_time = U64_MAX;
if (se->avg.last_update_time)
sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
if (sleep_time > 60*NSEC_PER_SEC) { // 1 minute is huge
se->vruntime = cfs_rq->min_vruntime;
return;
}
// ... rest of place_entity()
Hmm... ?
On Thu, Jan 26, 2023 at 01:49:43PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 25, 2023 at 08:45:32PM +0100, Roman Kagan wrote:
>
> > The calculation is indeed safe against the overflow of the vruntimes
> > themselves. However, when the two vruntimes are more than 2^63 apart,
> > their comparison gets inverted due to that s64 overflow.
>
> Yes, but that's a whole different issue. vruntime are not expected to be
> *that* far apart.
>
> That is surely the abnormal case. The normal case is wrap around, and
> that happens 'often' and should continue working.
>
> > And this is what happens here: one scheduling entity has accumulated a
> > vruntime more than 2^63 ahead of another. Now the comparison is
> > inverted due to s64 overflow, and the latter can't get to the cpu,
> > because it appears to have vruntime (much) bigger than that of the
> > former.
>
> If it can be 2^63 ahead, it can also be 2^(64+) ahead and nothing will
> help.
>
> > This situation is reproducible e.g. when one scheduling entity is a
> > multi-cpu hog, and the other is woken up from a long sleep. Normally
>
> A very low weight CPU hog?
Right. In our case this weight was due to the task group consuming
all 448 cpus on the machine; presumably one can achive this on a smaller
machine by tweaking shares of the cgroup.
> > when a task is placed on a cfs_rq, its vruntime is pulled to
> > min_vruntime, to avoid boosting the woken up task. However in this case
> > the task is so much behind in vruntime that it appears ahead instead,
> > its vruntime is not adjusted in place_entity(), and then it looses the
> > cpu to the current scheduling entity.
>
> What I think might be a way out here is passing the the sleep wall-time
> (cfs_rq_clock_pelt() time I suppose) to place entity and simply skip the
> magic if 'big'.
>
> All that only matters for small sleeps anyway.
>
> Something like:
>
> sleep_time = U64_MAX;
> if (se->avg.last_update_time)
> sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
others were suggesting? It appears to better match the notion of sleep
wall-time, no?
Thanks,
Roman.
>
> if (sleep_time > 60*NSEC_PER_SEC) { // 1 minute is huge
> se->vruntime = cfs_rq->min_vruntime;
> return;
> }
>
> // ... rest of place_entity()
>
> Hmm... ?
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > All that only matters for small sleeps anyway.
> >
> > Something like:
> >
> > sleep_time = U64_MAX;
> > if (se->avg.last_update_time)
> > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
>
> Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> others were suggesting? It appears to better match the notion of sleep
> wall-time, no?
Should also work I suppose. cfs_rq_clock takes throttling into account,
but that should hopefully also not be *that* long, so either should
work.
On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
>
> > > All that only matters for small sleeps anyway.
> > >
> > > Something like:
> > >
> > > sleep_time = U64_MAX;
> > > if (se->avg.last_update_time)
> > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> >
> > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > others were suggesting? It appears to better match the notion of sleep
> > wall-time, no?
>
> Should also work I suppose. cfs_rq_clock takes throttling into account,
> but that should hopefully also not be *that* long, so either should
> work.
yes rq_clock_task(rq_of(cfs_rq)) should be fine too
Another thing to take into account is the sleeper credit that the
waking task deserves so the detection should be done once it has been
subtracted from vruntime.
Last point, when a nice -20 task runs on a rq, it will take a bit more
than 2 seconds for the vruntime to be increased by more than 24ms (the
maximum credit that a waking task can get) so threshold must be
significantly higher than 2 sec. On the opposite side, the lowest
possible weight of a cfs rq is 2 which means that the problem appears
for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
value instead of an arbitrary 200 days
Vincent Guittot <[email protected]> writes:
> On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
>>
>> On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
>>
>> > > All that only matters for small sleeps anyway.
>> > >
>> > > Something like:
>> > >
>> > > sleep_time = U64_MAX;
>> > > if (se->avg.last_update_time)
>> > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
>> >
>> > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
>> > others were suggesting? It appears to better match the notion of sleep
>> > wall-time, no?
>>
>> Should also work I suppose. cfs_rq_clock takes throttling into account,
>> but that should hopefully also not be *that* long, so either should
>> work.
>
> yes rq_clock_task(rq_of(cfs_rq)) should be fine too
No, last_update_time is based on cfs_rq_clock_pelt(cfs_rq), and it will
get more and more out of sync as time goes on, every time the cfs_rq
throttles. It won't reset when the throttle is done.
>
> Another thing to take into account is the sleeper credit that the
> waking task deserves so the detection should be done once it has been
> subtracted from vruntime.
>
> Last point, when a nice -20 task runs on a rq, it will take a bit more
> than 2 seconds for the vruntime to be increased by more than 24ms (the
> maximum credit that a waking task can get) so threshold must be
> significantly higher than 2 sec. On the opposite side, the lowest
> possible weight of a cfs rq is 2 which means that the problem appears
> for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> value instead of an arbitrary 200 days
On Fri, 27 Jan 2023 at 23:10, Benjamin Segall <[email protected]> wrote:
>
> Vincent Guittot <[email protected]> writes:
>
> > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> >>
> >> On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> >>
> >> > > All that only matters for small sleeps anyway.
> >> > >
> >> > > Something like:
> >> > >
> >> > > sleep_time = U64_MAX;
> >> > > if (se->avg.last_update_time)
> >> > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> >> >
> >> > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> >> > others were suggesting? It appears to better match the notion of sleep
> >> > wall-time, no?
> >>
> >> Should also work I suppose. cfs_rq_clock takes throttling into account,
> >> but that should hopefully also not be *that* long, so either should
> >> work.
> >
> > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
>
> No, last_update_time is based on cfs_rq_clock_pelt(cfs_rq), and it will
> get more and more out of sync as time goes on, every time the cfs_rq
> throttles. It won't reset when the throttle is done.
I was referring to the rq_clock_task(rq_of(cfs_rq)) - se->exec_start
that Roman was asking for
>
> >
> > Another thing to take into account is the sleeper credit that the
> > waking task deserves so the detection should be done once it has been
> > subtracted from vruntime.
> >
> > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > maximum credit that a waking task can get) so threshold must be
> > significantly higher than 2 sec. On the opposite side, the lowest
> > possible weight of a cfs rq is 2 which means that the problem appears
> > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > value instead of an arbitrary 200 days
On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> >
> > > > All that only matters for small sleeps anyway.
> > > >
> > > > Something like:
> > > >
> > > > sleep_time = U64_MAX;
> > > > if (se->avg.last_update_time)
> > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > >
> > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > others were suggesting? It appears to better match the notion of sleep
> > > wall-time, no?
> >
> > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > but that should hopefully also not be *that* long, so either should
> > work.
>
> yes rq_clock_task(rq_of(cfs_rq)) should be fine too
>
> Another thing to take into account is the sleeper credit that the
> waking task deserves so the detection should be done once it has been
> subtracted from vruntime.
>
> Last point, when a nice -20 task runs on a rq, it will take a bit more
> than 2 seconds for the vruntime to be increased by more than 24ms (the
> maximum credit that a waking task can get) so threshold must be
> significantly higher than 2 sec. On the opposite side, the lowest
> possible weight of a cfs rq is 2 which means that the problem appears
> for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> value instead of an arbitrary 200 days
Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
'vruntime bonus' when enqueued.
2. Although a low weight cfs rq run for 2^54 nsec could trigger the overflow,
we can choose threshold lower than 2^54 to avoid any overflow.
thanks,
Chenyu
On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > >
> > > > > All that only matters for small sleeps anyway.
> > > > >
> > > > > Something like:
> > > > >
> > > > > sleep_time = U64_MAX;
> > > > > if (se->avg.last_update_time)
> > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > >
> > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > others were suggesting? It appears to better match the notion of sleep
> > > > wall-time, no?
> > >
> > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > but that should hopefully also not be *that* long, so either should
> > > work.
> >
> > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> >
> > Another thing to take into account is the sleeper credit that the
> > waking task deserves so the detection should be done once it has been
> > subtracted from vruntime.
> >
> > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > maximum credit that a waking task can get) so threshold must be
> > significantly higher than 2 sec. On the opposite side, the lowest
> > possible weight of a cfs rq is 2 which means that the problem appears
> > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > value instead of an arbitrary 200 days
> Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> 'vruntime bonus' when enqueued.
> 2. Although a low weight cfs rq run for 2^54 nsec could trigger the overflow,
> we can choose threshold lower than 2^54 to avoid any overflow.
This matches my understanding too, so I went ahead with the value
proposed by Peter (1 min) which looked sufficiently far away from either
side.
Roman.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
>
> On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > >
> > > > > > All that only matters for small sleeps anyway.
> > > > > >
> > > > > > Something like:
> > > > > >
> > > > > > sleep_time = U64_MAX;
> > > > > > if (se->avg.last_update_time)
> > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > >
> > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > wall-time, no?
> > > >
> > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > but that should hopefully also not be *that* long, so either should
> > > > work.
> > >
> > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > >
> > > Another thing to take into account is the sleeper credit that the
> > > waking task deserves so the detection should be done once it has been
> > > subtracted from vruntime.
> > >
> > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > maximum credit that a waking task can get) so threshold must be
> > > significantly higher than 2 sec. On the opposite side, the lowest
> > > possible weight of a cfs rq is 2 which means that the problem appears
> > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > value instead of an arbitrary 200 days
> > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > 'vruntime bonus' when enqueued.
This means that if a task nice -20 runs on cfs rq while your task is
sleeping 2seconds, the min vruntime of the cfs rq will increase by
24ms. If there are 2 nice -20 tasks then the min vruntime will
increase by 24ms after 4 seconds and so on ...
On the other side, a task nice 19 that runs 1ms will increase its
vruntime by around 68ms.
So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
rq, the nice -19 one should run 1ms every 65 seconds and this also
means that the vruntime of task nice -19 should still be above
min_vruntime after sleeping 60 seconds. Of course this is even worse
with a child cgroup with the lowest weight (weight of 2 instead of 15)
Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
> > 2. Although a low weight cfs rq run for 2^54 nsec could trigger the overflow,
> > we can choose threshold lower than 2^54 to avoid any overflow.
>
> This matches my understanding too, so I went ahead with the value
> proposed by Peter (1 min) which looked sufficiently far away from either
> side.
>
> Roman.
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>
On 2023-01-31 at 12:10:29 +0100, Vincent Guittot wrote:
> On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> >
> > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > >
> > > > > > > All that only matters for small sleeps anyway.
> > > > > > >
> > > > > > > Something like:
> > > > > > >
> > > > > > > sleep_time = U64_MAX;
> > > > > > > if (se->avg.last_update_time)
> > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > >
> > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > wall-time, no?
> > > > >
> > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > but that should hopefully also not be *that* long, so either should
> > > > > work.
> > > >
> > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > >
> > > > Another thing to take into account is the sleeper credit that the
> > > > waking task deserves so the detection should be done once it has been
> > > > subtracted from vruntime.
> > > >
> > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > maximum credit that a waking task can get) so threshold must be
> > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > value instead of an arbitrary 200 days
> > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > 'vruntime bonus' when enqueued.
>
> This means that if a task nice -20 runs on cfs rq while your task is
> sleeping 2seconds, the min vruntime of the cfs rq will increase by
> 24ms. If there are 2 nice -20 tasks then the min vruntime will
> increase by 24ms after 4 seconds and so on ...
>
Got it, thanks for this example.
> On the other side, a task nice 19 that runs 1ms will increase its
> vruntime by around 68ms.
>
> So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> rq, the nice -19 one should run 1ms every 65 seconds and this also
I assume that you were refering to nice 19 task, and also the following
'-19'.
> means that the vruntime of task nice -19 should still be above
> min_vruntime after sleeping 60 seconds.
So even if the -19 task sleeps very long, the cfs_rq->min_vruntime can not
take the lead, the overflow of s64(min_vruntime - se->vruntime) will not happen.
> Of course this is even worse
> with a child cgroup with the lowest weight (weight of 2 instead of 15)
>
> Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
>
2^54 could be the "eailiest" interval that could trigger the s64 overflow(because other
weight > 2 will not trigger overflow when sleeping for 2^54).
thanks,
Chenyu
On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > >
> > > > > > > All that only matters for small sleeps anyway.
> > > > > > >
> > > > > > > Something like:
> > > > > > >
> > > > > > > sleep_time = U64_MAX;
> > > > > > > if (se->avg.last_update_time)
> > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > >
> > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > wall-time, no?
> > > > >
> > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > but that should hopefully also not be *that* long, so either should
> > > > > work.
> > > >
> > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > >
> > > > Another thing to take into account is the sleeper credit that the
> > > > waking task deserves so the detection should be done once it has been
> > > > subtracted from vruntime.
> > > >
> > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > maximum credit that a waking task can get) so threshold must be
> > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > value instead of an arbitrary 200 days
> > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > 'vruntime bonus' when enqueued.
>
> This means that if a task nice -20 runs on cfs rq while your task is
> sleeping 2seconds, the min vruntime of the cfs rq will increase by
> 24ms. If there are 2 nice -20 tasks then the min vruntime will
> increase by 24ms after 4 seconds and so on ...
>
> On the other side, a task nice 19 that runs 1ms will increase its
> vruntime by around 68ms.
>
> So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> rq, the nice -19 one should run 1ms every 65 seconds and this also
> means that the vruntime of task nice -19 should still be above
> min_vruntime after sleeping 60 seconds. Of course this is even worse
> with a child cgroup with the lowest weight (weight of 2 instead of 15)
>
> Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
If we go this route, what would be the proper way to infer this value?
Looks like
(1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
Is there any other definition that stipulates the lowest weight to be 2?
Besides, MIN_SHARES is under #ifdef CONFIG_FAIR_GROUP_SCHED, so the
above expression would require more #ifdef-s.
(That said, I'm still not convinced being math-precise here is
practical, and slightly violating fairness in such a skewed setup is
really something to be afraid of.)
Thanks,
Roman.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Tue, 7 Feb 2023 at 20:37, Roman Kagan <[email protected]> wrote:
>
> On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> > On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > > >
> > > > > > > > All that only matters for small sleeps anyway.
> > > > > > > >
> > > > > > > > Something like:
> > > > > > > >
> > > > > > > > sleep_time = U64_MAX;
> > > > > > > > if (se->avg.last_update_time)
> > > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > > >
> > > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > > wall-time, no?
> > > > > >
> > > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > > but that should hopefully also not be *that* long, so either should
> > > > > > work.
> > > > >
> > > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > > >
> > > > > Another thing to take into account is the sleeper credit that the
> > > > > waking task deserves so the detection should be done once it has been
> > > > > subtracted from vruntime.
> > > > >
> > > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > > maximum credit that a waking task can get) so threshold must be
> > > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > > value instead of an arbitrary 200 days
> > > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > > 'vruntime bonus' when enqueued.
> >
> > This means that if a task nice -20 runs on cfs rq while your task is
> > sleeping 2seconds, the min vruntime of the cfs rq will increase by
> > 24ms. If there are 2 nice -20 tasks then the min vruntime will
> > increase by 24ms after 4 seconds and so on ...
> >
> > On the other side, a task nice 19 that runs 1ms will increase its
> > vruntime by around 68ms.
> >
> > So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> > rq, the nice -19 one should run 1ms every 65 seconds and this also
> > means that the vruntime of task nice -19 should still be above
> > min_vruntime after sleeping 60 seconds. Of course this is even worse
> > with a child cgroup with the lowest weight (weight of 2 instead of 15)
> >
> > Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
>
> If we go this route, what would be the proper way to infer this value?
> Looks like
>
> (1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
(1ull << 63) / NICE_0_LOAD * MIN_SHARES
>
> Is there any other definition that stipulates the lowest weight to be 2?
no, at task level the min weight is 3 for sched idle task.
> Besides, MIN_SHARES is under #ifdef CONFIG_FAIR_GROUP_SCHED, so the
> above expression would require more #ifdef-s.
(1ull << 63) / NICE_0_LOAD
could be a reasonable shortcut I think
>
> (That said, I'm still not convinced being math-precise here is
> practical, and slightly violating fairness in such a skewed setup is
> really something to be afraid of.)
We regularly have people complaining that sched_idle tasks (with a
weight of 3) wake up too often and steal time. The 60 seconds may just
make the situation happen more frequently
Vincent
>
> Thanks,
> Roman.
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>
On Wed, Feb 08, 2023 at 11:13:35AM +0100, Vincent Guittot wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Tue, 7 Feb 2023 at 20:37, Roman Kagan <[email protected]> wrote:
> >
> > On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> > > On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > > > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > > > >
> > > > > > > > > All that only matters for small sleeps anyway.
> > > > > > > > >
> > > > > > > > > Something like:
> > > > > > > > >
> > > > > > > > > sleep_time = U64_MAX;
> > > > > > > > > if (se->avg.last_update_time)
> > > > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > > > >
> > > > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > > > wall-time, no?
> > > > > > >
> > > > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > > > but that should hopefully also not be *that* long, so either should
> > > > > > > work.
> > > > > >
> > > > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > > > >
> > > > > > Another thing to take into account is the sleeper credit that the
> > > > > > waking task deserves so the detection should be done once it has been
> > > > > > subtracted from vruntime.
> > > > > >
> > > > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > > > maximum credit that a waking task can get) so threshold must be
> > > > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > > > value instead of an arbitrary 200 days
> > > > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > > > 'vruntime bonus' when enqueued.
> > >
> > > This means that if a task nice -20 runs on cfs rq while your task is
> > > sleeping 2seconds, the min vruntime of the cfs rq will increase by
> > > 24ms. If there are 2 nice -20 tasks then the min vruntime will
> > > increase by 24ms after 4 seconds and so on ...
> > >
> > > On the other side, a task nice 19 that runs 1ms will increase its
> > > vruntime by around 68ms.
> > >
> > > So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> > > rq, the nice -19 one should run 1ms every 65 seconds and this also
> > > means that the vruntime of task nice -19 should still be above
> > > min_vruntime after sleeping 60 seconds. Of course this is even worse
> > > with a child cgroup with the lowest weight (weight of 2 instead of 15)
> > >
> > > Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
> >
> > If we go this route, what would be the proper way to infer this value?
> > Looks like
> >
> > (1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
>
> (1ull << 63) / NICE_0_LOAD * MIN_SHARES
On 64bit platforms NICE_0_LOAD == 1L << 20 (i.e. it's also scaled) for
better precision. So this will yield 2^63 / 2^20 * 2 = 2^44. Good
enough probably but confusing.
Thanks,
Roman.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, 8 Feb 2023 at 19:09, Roman Kagan <[email protected]> wrote:
>
> On Wed, Feb 08, 2023 at 11:13:35AM +0100, Vincent Guittot wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On Tue, 7 Feb 2023 at 20:37, Roman Kagan <[email protected]> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> > > > On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > > > > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > > > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > > > > >
> > > > > > > > > > All that only matters for small sleeps anyway.
> > > > > > > > > >
> > > > > > > > > > Something like:
> > > > > > > > > >
> > > > > > > > > > sleep_time = U64_MAX;
> > > > > > > > > > if (se->avg.last_update_time)
> > > > > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > > > > >
> > > > > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > > > > wall-time, no?
> > > > > > > >
> > > > > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > > > > but that should hopefully also not be *that* long, so either should
> > > > > > > > work.
> > > > > > >
> > > > > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > > > > >
> > > > > > > Another thing to take into account is the sleeper credit that the
> > > > > > > waking task deserves so the detection should be done once it has been
> > > > > > > subtracted from vruntime.
> > > > > > >
> > > > > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > > > > maximum credit that a waking task can get) so threshold must be
> > > > > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > > > > value instead of an arbitrary 200 days
> > > > > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > > > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > > > > 'vruntime bonus' when enqueued.
> > > >
> > > > This means that if a task nice -20 runs on cfs rq while your task is
> > > > sleeping 2seconds, the min vruntime of the cfs rq will increase by
> > > > 24ms. If there are 2 nice -20 tasks then the min vruntime will
> > > > increase by 24ms after 4 seconds and so on ...
> > > >
> > > > On the other side, a task nice 19 that runs 1ms will increase its
> > > > vruntime by around 68ms.
> > > >
> > > > So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> > > > rq, the nice -19 one should run 1ms every 65 seconds and this also
> > > > means that the vruntime of task nice -19 should still be above
> > > > min_vruntime after sleeping 60 seconds. Of course this is even worse
> > > > with a child cgroup with the lowest weight (weight of 2 instead of 15)
> > > >
> > > > Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
> > >
> > > If we go this route, what would be the proper way to infer this value?
> > > Looks like
> > >
> > > (1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
> >
> > (1ull << 63) / NICE_0_LOAD * MIN_SHARES
>
> On 64bit platforms NICE_0_LOAD == 1L << 20 (i.e. it's also scaled) for
> better precision. So this will yield 2^63 / 2^20 * 2 = 2^44. Good
> enough probably but confusing.
Something like the below should be enough to explain the value
/*
* min_vruntime can move forward much faster than real time. The worst case
* happens when an entity with the min weight always runs on the cfs rq. In this
* case, the max comparison between vruntime and min_vruntime can fail after a
* sleep greater than :
* (1ull << 63) / NICE_0_LOAD) * MIN_SHARES
* We can simplify this to :
* (1ull << 63) / NICE_0_LOAD)
*/
#define SLEEP_VRUNTIME_OVERFLOW ((1ull << 63) / NICE_0_LOAD)
>
> Thanks,
> Roman.
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>
On Thu, Feb 09, 2023 at 12:26:12PM +0100, Vincent Guittot wrote:
> On Wed, 8 Feb 2023 at 19:09, Roman Kagan <[email protected]> wrote:
> > On Wed, Feb 08, 2023 at 11:13:35AM +0100, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 20:37, Roman Kagan <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> > > > > On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > > > > > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > > > > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > > > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > > > > > >
> > > > > > > > > > > All that only matters for small sleeps anyway.
> > > > > > > > > > >
> > > > > > > > > > > Something like:
> > > > > > > > > > >
> > > > > > > > > > > sleep_time = U64_MAX;
> > > > > > > > > > > if (se->avg.last_update_time)
> > > > > > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > > > > > >
> > > > > > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > > > > > wall-time, no?
> > > > > > > > >
> > > > > > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > > > > > but that should hopefully also not be *that* long, so either should
> > > > > > > > > work.
> > > > > > > >
> > > > > > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > > > > > >
> > > > > > > > Another thing to take into account is the sleeper credit that the
> > > > > > > > waking task deserves so the detection should be done once it has been
> > > > > > > > subtracted from vruntime.
> > > > > > > >
> > > > > > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > > > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > > > > > maximum credit that a waking task can get) so threshold must be
> > > > > > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > > > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > > > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > > > > > value instead of an arbitrary 200 days
> > > > > > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > > > > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > > > > > 'vruntime bonus' when enqueued.
> > > > >
> > > > > This means that if a task nice -20 runs on cfs rq while your task is
> > > > > sleeping 2seconds, the min vruntime of the cfs rq will increase by
> > > > > 24ms. If there are 2 nice -20 tasks then the min vruntime will
> > > > > increase by 24ms after 4 seconds and so on ...
> > > > >
> > > > > On the other side, a task nice 19 that runs 1ms will increase its
> > > > > vruntime by around 68ms.
> > > > >
> > > > > So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> > > > > rq, the nice -19 one should run 1ms every 65 seconds and this also
> > > > > means that the vruntime of task nice -19 should still be above
> > > > > min_vruntime after sleeping 60 seconds. Of course this is even worse
> > > > > with a child cgroup with the lowest weight (weight of 2 instead of 15)
> > > > >
> > > > > Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
> > > >
> > > > If we go this route, what would be the proper way to infer this value?
> > > > Looks like
> > > >
> > > > (1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
> > >
> > > (1ull << 63) / NICE_0_LOAD * MIN_SHARES
> >
> > On 64bit platforms NICE_0_LOAD == 1L << 20 (i.e. it's also scaled) for
> > better precision. So this will yield 2^63 / 2^20 * 2 = 2^44. Good
> > enough probably but confusing.
>
> Something like the below should be enough to explain the value
>
> /*
> * min_vruntime can move forward much faster than real time. The worst case
> * happens when an entity with the min weight always runs on the cfs rq. In this
> * case, the max comparison between vruntime and min_vruntime can fail after a
> * sleep greater than :
> * (1ull << 63) / NICE_0_LOAD) * MIN_SHARES
Sorry if I'm being dense, but aren't NICE_0_LOAD and MIN_SHARES measured
in different units: the former is scaled while the latter is not?
> * We can simplify this to :
> * (1ull << 63) / NICE_0_LOAD)
> */
> #define SLEEP_VRUNTIME_OVERFLOW ((1ull << 63) / NICE_0_LOAD)
Thanks,
Roman.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Thu, 9 Feb 2023 at 14:33, Roman Kagan <[email protected]> wrote:
>
> On Thu, Feb 09, 2023 at 12:26:12PM +0100, Vincent Guittot wrote:
> > On Wed, 8 Feb 2023 at 19:09, Roman Kagan <[email protected]> wrote:
> > > On Wed, Feb 08, 2023 at 11:13:35AM +0100, Vincent Guittot wrote:
> > > > On Tue, 7 Feb 2023 at 20:37, Roman Kagan <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> > > > > > On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > > > > > > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > > > > > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > > > > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > > > > > > >
> > > > > > > > > > > > All that only matters for small sleeps anyway.
> > > > > > > > > > > >
> > > > > > > > > > > > Something like:
> > > > > > > > > > > >
> > > > > > > > > > > > sleep_time = U64_MAX;
> > > > > > > > > > > > if (se->avg.last_update_time)
> > > > > > > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > > > > > > >
> > > > > > > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > > > > > > wall-time, no?
> > > > > > > > > >
> > > > > > > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > > > > > > but that should hopefully also not be *that* long, so either should
> > > > > > > > > > work.
> > > > > > > > >
> > > > > > > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > > > > > > >
> > > > > > > > > Another thing to take into account is the sleeper credit that the
> > > > > > > > > waking task deserves so the detection should be done once it has been
> > > > > > > > > subtracted from vruntime.
> > > > > > > > >
> > > > > > > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > > > > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > > > > > > maximum credit that a waking task can get) so threshold must be
> > > > > > > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > > > > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > > > > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > > > > > > value instead of an arbitrary 200 days
> > > > > > > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > > > > > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > > > > > > 'vruntime bonus' when enqueued.
> > > > > >
> > > > > > This means that if a task nice -20 runs on cfs rq while your task is
> > > > > > sleeping 2seconds, the min vruntime of the cfs rq will increase by
> > > > > > 24ms. If there are 2 nice -20 tasks then the min vruntime will
> > > > > > increase by 24ms after 4 seconds and so on ...
> > > > > >
> > > > > > On the other side, a task nice 19 that runs 1ms will increase its
> > > > > > vruntime by around 68ms.
> > > > > >
> > > > > > So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> > > > > > rq, the nice -19 one should run 1ms every 65 seconds and this also
> > > > > > means that the vruntime of task nice -19 should still be above
> > > > > > min_vruntime after sleeping 60 seconds. Of course this is even worse
> > > > > > with a child cgroup with the lowest weight (weight of 2 instead of 15)
> > > > > >
> > > > > > Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
> > > > >
> > > > > If we go this route, what would be the proper way to infer this value?
> > > > > Looks like
> > > > >
> > > > > (1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
> > > >
> > > > (1ull << 63) / NICE_0_LOAD * MIN_SHARES
> > >
> > > On 64bit platforms NICE_0_LOAD == 1L << 20 (i.e. it's also scaled) for
> > > better precision. So this will yield 2^63 / 2^20 * 2 = 2^44. Good
> > > enough probably but confusing.
> >
> > Something like the below should be enough to explain the value
> >
> > /*
> > * min_vruntime can move forward much faster than real time. The worst case
> > * happens when an entity with the min weight always runs on the cfs rq. In this
> > * case, the max comparison between vruntime and min_vruntime can fail after a
> > * sleep greater than :
> > * (1ull << 63) / NICE_0_LOAD) * MIN_SHARES
>
> Sorry if I'm being dense, but aren't NICE_0_LOAD and MIN_SHARES measured
> in different units: the former is scaled while the latter is not?
There are 2 usages of MIN_SHARES:
- one when setting cgroup weight in __sched_group_set_shares() which
uses scale_load(MIN_SHARES)
- one when sharing this weight between the cfs of the group in
calc_group_shares() : clamp_t(long, shares, MIN_SHARES, tg_shares)
The 2nd one is the most important in our case that's why I use
MIN_SHARES and not scale_load(MIN_SHARES)
>
> > * We can simplify this to :
> > * (1ull << 63) / NICE_0_LOAD)
> > */
> > #define SLEEP_VRUNTIME_OVERFLOW ((1ull << 63) / NICE_0_LOAD)
>
> Thanks,
> Roman.
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>
On Thu, Feb 09, 2023 at 02:44:49PM +0100, Vincent Guittot wrote:
> On Thu, 9 Feb 2023 at 14:33, Roman Kagan <[email protected]> wrote:
> >
> > On Thu, Feb 09, 2023 at 12:26:12PM +0100, Vincent Guittot wrote:
> > > On Wed, 8 Feb 2023 at 19:09, Roman Kagan <[email protected]> wrote:
> > > > On Wed, Feb 08, 2023 at 11:13:35AM +0100, Vincent Guittot wrote:
> > > > > On Tue, 7 Feb 2023 at 20:37, Roman Kagan <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Jan 31, 2023 at 12:10:29PM +0100, Vincent Guittot wrote:
> > > > > > > On Tue, 31 Jan 2023 at 11:00, Roman Kagan <[email protected]> wrote:
> > > > > > > > On Tue, Jan 31, 2023 at 11:21:17AM +0800, Chen Yu wrote:
> > > > > > > > > On 2023-01-27 at 17:18:56 +0100, Vincent Guittot wrote:
> > > > > > > > > > On Fri, 27 Jan 2023 at 12:44, Peter Zijlstra <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 26, 2023 at 07:31:02PM +0100, Roman Kagan wrote:
> > > > > > > > > > >
> > > > > > > > > > > > > All that only matters for small sleeps anyway.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Something like:
> > > > > > > > > > > > >
> > > > > > > > > > > > > sleep_time = U64_MAX;
> > > > > > > > > > > > > if (se->avg.last_update_time)
> > > > > > > > > > > > > sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;
> > > > > > > > > > > >
> > > > > > > > > > > > Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
> > > > > > > > > > > > others were suggesting? It appears to better match the notion of sleep
> > > > > > > > > > > > wall-time, no?
> > > > > > > > > > >
> > > > > > > > > > > Should also work I suppose. cfs_rq_clock takes throttling into account,
> > > > > > > > > > > but that should hopefully also not be *that* long, so either should
> > > > > > > > > > > work.
> > > > > > > > > >
> > > > > > > > > > yes rq_clock_task(rq_of(cfs_rq)) should be fine too
> > > > > > > > > >
> > > > > > > > > > Another thing to take into account is the sleeper credit that the
> > > > > > > > > > waking task deserves so the detection should be done once it has been
> > > > > > > > > > subtracted from vruntime.
> > > > > > > > > >
> > > > > > > > > > Last point, when a nice -20 task runs on a rq, it will take a bit more
> > > > > > > > > > than 2 seconds for the vruntime to be increased by more than 24ms (the
> > > > > > > > > > maximum credit that a waking task can get) so threshold must be
> > > > > > > > > > significantly higher than 2 sec. On the opposite side, the lowest
> > > > > > > > > > possible weight of a cfs rq is 2 which means that the problem appears
> > > > > > > > > > for a sleep longer or equal to 2^54 = 2^63*2/1024. We should use this
> > > > > > > > > > value instead of an arbitrary 200 days
> > > > > > > > > Does it mean any threshold between 2 sec and 2^54 nsec should be fine? Because
> > > > > > > > > 1. Any task sleeps longer than 2 sec will get at most 24 ms(sysctl_sched_latency)
> > > > > > > > > 'vruntime bonus' when enqueued.
> > > > > > >
> > > > > > > This means that if a task nice -20 runs on cfs rq while your task is
> > > > > > > sleeping 2seconds, the min vruntime of the cfs rq will increase by
> > > > > > > 24ms. If there are 2 nice -20 tasks then the min vruntime will
> > > > > > > increase by 24ms after 4 seconds and so on ...
> > > > > > >
> > > > > > > On the other side, a task nice 19 that runs 1ms will increase its
> > > > > > > vruntime by around 68ms.
> > > > > > >
> > > > > > > So if there is 1 task nice 19 with 11 tasks nice -20 on the same cfs
> > > > > > > rq, the nice -19 one should run 1ms every 65 seconds and this also
> > > > > > > means that the vruntime of task nice -19 should still be above
> > > > > > > min_vruntime after sleeping 60 seconds. Of course this is even worse
> > > > > > > with a child cgroup with the lowest weight (weight of 2 instead of 15)
> > > > > > >
> > > > > > > Just to say that 60 seconds is not so far away and 2^54 should be better IMHO
> > > > > >
> > > > > > If we go this route, what would be the proper way to infer this value?
> > > > > > Looks like
> > > > > >
> > > > > > (1ull << 63) / NICE_0_LOAD * scale_load(MIN_SHARES)
> > > > >
> > > > > (1ull << 63) / NICE_0_LOAD * MIN_SHARES
> > > >
> > > > On 64bit platforms NICE_0_LOAD == 1L << 20 (i.e. it's also scaled) for
> > > > better precision. So this will yield 2^63 / 2^20 * 2 = 2^44. Good
> > > > enough probably but confusing.
> > >
> > > Something like the below should be enough to explain the value
> > >
> > > /*
> > > * min_vruntime can move forward much faster than real time. The worst case
> > > * happens when an entity with the min weight always runs on the cfs rq. In this
> > > * case, the max comparison between vruntime and min_vruntime can fail after a
> > > * sleep greater than :
> > > * (1ull << 63) / NICE_0_LOAD) * MIN_SHARES
> >
> > Sorry if I'm being dense, but aren't NICE_0_LOAD and MIN_SHARES measured
> > in different units: the former is scaled while the latter is not?
>
> There are 2 usages of MIN_SHARES:
> - one when setting cgroup weight in __sched_group_set_shares() which
> uses scale_load(MIN_SHARES)
> - one when sharing this weight between the cfs of the group in
> calc_group_shares() : clamp_t(long, shares, MIN_SHARES, tg_shares)
>
> The 2nd one is the most important in our case that's why I use
> MIN_SHARES and not scale_load(MIN_SHARES)
I see now, thanks a lot for explaining!
I'll post an updated patch later today.
Thanks,
Roman.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879